pandas 用python清理一个非常混乱的文本文件

fykwrbwg  于 2023-05-05  发布在  Python
关注(0)|答案(1)|浏览(172)

我有一个凌乱的文本文件,包含有数百个表的数据库的元数据,我试图将其发送到pandas,这样我就有了一个很好的CSV,但这个很难,我可能错过了一些导致问题的早期内容
myfile.txt =

Table: "DBO"."PSDSTTIME"
Column list, count 8:
v_FileID r_FileID ColumnID DomainID Name         Type     Precision Scale MaxLength Nullability Nulls Min                         Max                         #Distinct 
-------- -------- -------- -------- ------------ -------- --------- ----- --------- ----------- ----- --------------------------- --------------------------- --------- 
    1058     1058        0        1 DSTID        CHAR     24              12        Y           N     0LastSunMar                 4LastSunOct                 21        
    1058     1058        1        2 DSTABSOLUTE  CHAR     2               1         Y           N     N                           N                           1         
    1058     1058        2        3 DSTMONTH     CHAR     4               2         Y           N     10                          9                           8         
    1058     1058        3        4 DSTDAY       SMALLINT                 2         Y           N     1                           5                           5         
    1058     1058        4        5 DSTDAYOFWEEK CHAR     2               1         Y           N     0                           5                           4         
    1058     1058        5        6 DSTHOUR      SMALLINT                 2         Y           N     0                           4                           5         
    1058     1058        6        7 DSTMINUTE    SMALLINT                 2         Y           N     0                           0                           1         
    1058     1058        7        8 DESCR        CHAR     60              30        Y           N     First Friday in May, 2:00am Third Sunday in Oct, 2:00am 21        

Ready>Check Files...OK
Load Data...OK
Table: "DBO"."PS_PRCSRECUR"
Column list, count 29:
v_FileID r_FileID ColumnID DomainID Name              Type      Precision Scale MaxLength Nullability Nulls Min                              Max                              #Distinct 
-------- -------- -------- -------- ----------------- --------- --------- ----- --------- ----------- ----- -------------------------------- -------------------------------- --------- 
    1048     1048        0        1 RECURTYPE         SMALLINT                  2         Y           N     2                                2                                1         
    1048     1048        1        2 RECURNAME         CHAR      60              30        Y           N     Daily Purge                      Y_DAILY3_NT2                     12        
    1048     1048        2        3 VERSION           INTEGER                   2         Y           N     1                                4                                3         
    1048     1048        3        4 DAYOFMONTH        SMALLINT                  2         Y           N     0                                0                                1         
    1048     1048        4        5 RUN1STWEEK        SMALLINT                  2         Y           N     1                                1                                1         
    1048     1048        5        6 RUN2NDWEEK        SMALLINT                  2         Y           N     2                                2                                1         
    1048     1048        6        7 RUN3RDWEEK        SMALLINT                  2         Y           N     3                                3                                1         
    1048     1048        7        8 RUN4THWEEK        SMALLINT                  2         Y           N     4                                4                                1         
    1048     1048        8        9 RUN5THWEEK        SMALLINT                  2         Y           N     5                                5                                1         
    1048     1048        9       10 RUN6THWEEK        SMALLINT                  2         Y           N     6                                6                                1         
    1048     1048       10       11 RUNSUNDAY         SMALLINT                  2         Y           N     0                                1                                2         
    1048     1048       11       12 RUNMONDAY         SMALLINT                  2         Y           N     2                                2                                1         
    1048     1048       12       13 RUNTUESDAY        SMALLINT                  2         Y           N     3                                3                                1         
    1048     1048       13       14 RUNWEDNESDAY      SMALLINT                  2         Y           N     4                                4                                1         
    1048     1048       14       15 RUNTHURSDAY       SMALLINT                  2         Y           N     5                                5                                1         
    1048     1048       15       16 RUNFRIDAY         SMALLINT                  2         Y           N     6                                6                                1         
    1048     1048       16       17 RUNSATURDAY       SMALLINT                  2         Y           N     0                                7                                2         
    1048     1048       17       18 RECURDAYSPCL      SMALLINT                  2         Y           N     0                                0                                1         
    1048     1048       18       19 BEGINDTTM         TIMESTAMP                 32        Y           N     1999-01-01-17.00.00.000000000000 2014-08-23-10.59.24.000000000000 12        
    1048     1048       19       20 REPEATRECURRENCE  SMALLINT                  3         Y           N     0                                60                               8         
    1048     1048       20       21 REPEATUNIT        SMALLINT                  2         Y           N     0                                2                                3         
    1048     1048       21       22 DURATION          SMALLINT                  3         Y           N     0                                24                               4         
    1048     1048       22       23 DURATIONUNIT      SMALLINT                  2         Y           N     0                                1                                2         
    1048     1048       23       24 INITIATEWHEN      SMALLINT                  2         Y           N     0                                1                                2         
    1048     1048       24       25 RECURDESCR        CHAR      60              30        Y           N     Daily Purge                      Y_DAILY3_NT2 3Times              12        
    1048     1048       25       26 LASTUPDDTTM       TIMESTAMP                 32        Y           N     2001-08-14-16.25.54.000000000000 2014-08-23-10.59.38.000000000000 11        
    1048     1048       26       27 LASTUPDOPRID      CHAR      60              30        Y           N     CruiseC                          dcg01                            4         
    1048     1048       27       28 ENDDTTM           TIMESTAMP                 32        Y           Y     2005-02-01-00.00.00.000000000000 2005-02-01-00.00.00.000000000000 1         
    1048     1048       28       29 RECUR_USECURRDATE SMALLINT                  2         Y           N     0                                1                                2         

Ready>Check Files...OK

这就是我试图接近的

import re
with open('lmprod_metadata.txt', encoding='utf8') as f:
    lines = f.readlines()

pattern1 = re.compile('^\s*\d+\s+\d+\s+\d+\s+\d+.*$')
pattern2 = re.compile('^Table:.*$')

lines = [line for line in lines if pattern1.match(line) or pattern2.match(line)]

lines2 = [line.strip() for line in lines]

lines3 = [line for line in lines2]

# my thought for the following was to get table name in the rows to help standardize the data
table_names = []
data_columns = []
for i in range(len(lst)):
    if lst[i].startswith('Table'):
        table_name = lst[i].split()[1]
    else:
        data_columns.append(table_name + ' ' + lst[i])

以下是示例文本文件中显示的前两个表的输出

"DBO"."PSDSTTIME" 1058     1058        0        1 DSTID        CHAR     24              12        Y           N     0LastSunMar                 4LastSunOct                 21
"DBO"."PSDSTTIME" 1058     1058        1        2 DSTABSOLUTE  CHAR     2               1         Y           N     N                           N                           1
"DBO"."PSDSTTIME" 1058     1058        2        3 DSTMONTH     CHAR     4               2         Y           N     10                          9                           8
"DBO"."PSDSTTIME" 1058     1058        3        4 DSTDAY       SMALLINT                 2         Y           N     1                           5                           5
"DBO"."PSDSTTIME" 1058     1058        4        5 DSTDAYOFWEEK CHAR     2               1         Y           N     0                           5                           4
"DBO"."PSDSTTIME" 1058     1058        5        6 DSTHOUR      SMALLINT                 2         Y           N     0                           4                           5
"DBO"."PSDSTTIME" 1058     1058        6        7 DSTMINUTE    SMALLINT                 2         Y           N     0                           0                           1
"DBO"."PSDSTTIME" 1058     1058        7        8 DESCR        CHAR     60              30        Y           N     First Friday in May, 2:00am Third Sunday in Oct, 2:00am 21
"DBO"."PS_PRCSRECUR" 1048     1048        0        1 RECURTYPE         SMALLINT                  2         Y           N     2                                2                                1
"DBO"."PS_PRCSRECUR" 1048     1048        1        2 RECURNAME         CHAR      60              30        Y           N     Daily Purge                      Y_DAILY3_NT2                     12
"DBO"."PS_PRCSRECUR" 1048     1048        2        3 VERSION           INTEGER                   2         Y           N     1                                4                                3
"DBO"."PS_PRCSRECUR" 1048     1048        3        4 DAYOFMONTH        SMALLINT                  2         Y           N     0                                0                                1
"DBO"."PS_PRCSRECUR" 1048     1048        4        5 RUN1STWEEK        SMALLINT                  2         Y           N     1                                1                                1
"DBO"."PS_PRCSRECUR" 1048     1048        5        6 RUN2NDWEEK        SMALLINT                  2         Y           N     2                                2                                1
"DBO"."PS_PRCSRECUR" 1048     1048        6        7 RUN3RDWEEK        SMALLINT                  2         Y           N     3                                3                                1
"DBO"."PS_PRCSRECUR" 1048     1048        7        8 RUN4THWEEK        SMALLINT                  2         Y           N     4                                4                                1
"DBO"."PS_PRCSRECUR" 1048     1048        8        9 RUN5THWEEK        SMALLINT                  2         Y           N     5                                5                                1
"DBO"."PS_PRCSRECUR" 1048     1048        9       10 RUN6THWEEK        SMALLINT                  2         Y           N     6                                6                                1
"DBO"."PS_PRCSRECUR" 1048     1048       10       11 RUNSUNDAY         SMALLINT                  2         Y           N     0                                1                                2
"DBO"."PS_PRCSRECUR" 1048     1048       11       12 RUNMONDAY         SMALLINT                  2         Y           N     2                                2                                1
"DBO"."PS_PRCSRECUR" 1048     1048       12       13 RUNTUESDAY        SMALLINT                  2         Y           N     3                                3                                1
"DBO"."PS_PRCSRECUR" 1048     1048       13       14 RUNWEDNESDAY      SMALLINT                  2         Y           N     4                                4                                1
"DBO"."PS_PRCSRECUR" 1048     1048       14       15 RUNTHURSDAY       SMALLINT                  2         Y           N     5                                5                                1
"DBO"."PS_PRCSRECUR" 1048     1048       15       16 RUNFRIDAY         SMALLINT                  2         Y           N     6                                6                                1
"DBO"."PS_PRCSRECUR" 1048     1048       16       17 RUNSATURDAY       SMALLINT                  2         Y           N     0                                7                                2
"DBO"."PS_PRCSRECUR" 1048     1048       17       18 RECURDAYSPCL      SMALLINT                  2         Y           N     0                                0                                1
"DBO"."PS_PRCSRECUR" 1048     1048       18       19 BEGINDTTM         TIMESTAMP                 32        Y           N     1999-01-01-17.00.00.000000000000 2014-08-23-10.59.24.000000000000 12
"DBO"."PS_PRCSRECUR" 1048     1048       19       20 REPEATRECURRENCE  SMALLINT                  3         Y           N     0                                60                               8
"DBO"."PS_PRCSRECUR" 1048     1048       20       21 REPEATUNIT        SMALLINT                  2         Y           N     0                                2                                3
"DBO"."PS_PRCSRECUR" 1048     1048       21       22 DURATION          SMALLINT                  3         Y           N     0                                24                               4
"DBO"."PS_PRCSRECUR" 1048     1048       22       23 DURATIONUNIT      SMALLINT                  2         Y           N     0                                1                                2
"DBO"."PS_PRCSRECUR" 1048     1048       23       24 INITIATEWHEN      SMALLINT                  2         Y           N     0                                1                                2
"DBO"."PS_PRCSRECUR" 1048     1048       24       25 RECURDESCR        CHAR      60              30        Y           N     Daily Purge                      Y_DAILY3_NT2 3Times              12
"DBO"."PS_PRCSRECUR" 1048     1048       25       26 LASTUPDDTTM       TIMESTAMP                 32        Y           N     2001-08-14-16.25.54.000000000000 2014-08-23-10.59.38.000000000000 11
"DBO"."PS_PRCSRECUR" 1048     1048       26       27 LASTUPDOPRID      CHAR      60              30        Y           N     CruiseC                          dcg01                            4
"DBO"."PS_PRCSRECUR" 1048     1048       27       28 ENDDTTM           TIMESTAMP                 32        Y           Y     2005-02-01-00.00.00.000000000000 2005-02-01-00.00.00.000000000000 1
"DBO"."PS_PRCSRECUR" 1048     1048       28       29 RECUR_USECURRDATE SMALLINT                  2         Y           N     0                                1                                2

所以我的问题是我已经清理它到这一点的方式不处理任意数量的空格好吧,我已经尝试了OpenRefine,cleaninng列表进一步,和不同的分隔符在Pandas没有成功,我不想失去任何数据
这里尝试使用上一个示例中的data_columns表进一步清理数据,因此所有数据都以空格分隔

lst = []

for item in data_columns:
    clean_item = item.replace('\n', '').replace('\r', '').replace('\t', ' ')
    lst.append(clean_item)

输出(包括更少的,因为我用完了字符)

"DBO"."PSDSTTIME" 1058     1058        0        1 DSTID        CHAR     24              12        Y           N     0LastSunMar                 4LastSunOct                 21
"DBO"."PSDSTTIME" 1058     1058        1        2 DSTABSOLUTE  CHAR     2               1         Y           N     N                           N                           1
"DBO"."PSDSTTIME" 1058     1058        2        3 DSTMONTH     CHAR     4               2         Y           N     10                          9                           8
"DBO"."PSDSTTIME" 1058     1058        3        4 DSTDAY       SMALLINT                 2         Y           N     1                           5                           5
"DBO"."PSDSTTIME" 1058     1058        4        5 DSTDAYOFWEEK CHAR     2               1         Y           N     0                           5                           4
"DBO"."PSDSTTIME" 1058     1058        5        6 DSTHOUR      SMALLINT                 2         Y           N     0                           4                           5
"DBO"."PSDSTTIME" 1058     1058        6        7 DSTMINUTE    SMALLINT                 2         Y           N     0                           0                           1
"DBO"."PSDSTTIME" 1058     1058        7        8 DESCR        CHAR     60              30        Y           N     First Friday in May, 2:00am Third Sunday in Oct, 2:00am 21
"DBO"."PS_PRCSRECUR" 1048     1048        0        1 RECURTYPE         SMALLINT                  2         Y           N     2                                2                                1
"DBO"."PS_PRCSRECUR" 1048     1048        1        2 RECURNAME         CHAR      60              30        Y           N     Daily Purge                      Y_DAILY3_NT2                     12
"DBO"."PS_PRCSRECUR" 1048     1048        2        3 VERSION           INTEGER                   2         Y           N     1                                4                                3

这是对Pandas的尝试

df = pd.read_csv('myfile.txt', delimiter='\s+')
# and
df = pd.read_csv('myfile.txt', sep='\s+')

导致此错误:

ParserError                               Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 df = pd.read_csv('myfile.txt', delimiter='\s+')

not including full output, running out of characters

ParserError: Error tokenizing data. C error: Expected 15 fields in line 9, saw 22

我已经删除了一些其他的文本清理代码,我试过了...这不是最好的举动,但哦,好吧,我想必须有一些上游我可以做的,我只是不确定?

7ivaypg9

7ivaypg91#

read_fwf似乎是一个很好的 * 用例 *:

import re
from io import StringIO

with open("myfile.txt", "r") as f:
    content = f.read()
    splits = re.findall(r"Table:\s*(.*?)\n.*?\n(.*?)\n(?=Ready|$)", content, flags=re.DOTALL)
  
dfs = {table[0].replace('"', ""): pd.read_fwf(StringIO(table[1])).loc[1:] for table in splits}
​​
df = pd.concat(dfs).reset_index(level=0, names="table_name")

输出:
要单独访问每个表,请使用dict key-indexing

print(dfs["DBO.PSDSTTIME"])
​
   v_FileID r_FileID ColumnID  ...                          Min                          Max #Distinct
1      1058     1058        0  ...                  0LastSunMar                  4LastSunOct        21
2      1058     1058        1  ...                            N                            N         1
3      1058     1058        2  ...                           10                            9         8
..      ...      ...      ...  ...                          ...                          ...       ...
6      1058     1058        5  ...                            0                            4         5
7      1058     1058        6  ...                            0                            0         1
8      1058     1058        7  ...  First Friday in May, 2:00am  Third Sunday in Oct, 2:00am        21

[8 rows x 14 columns]

print(dfs["DBO.PS_PRCSRECUR"])

   v_FileID r_FileID ColumnID  ...                               Min                               Max #Distinct
1      1048     1048        0  ...                                 2                                 2         1
2      1048     1048        1  ...                       Daily Purge                      Y_DAILY3_NT2        12
3      1048     1048        2  ...                                 1                                 4         3
..      ...      ...      ...  ...                               ...                               ...       ...
27     1048     1048       26  ...                           CruiseC                             dcg01         4
28     1048     1048       27  ...  2005-02-01-00.00.00.000000000000  2005-02-01-00.00.00.000000000000         1
29     1048     10
48       28  ...                                 0                                 1         2

[29 rows x 14 columns]

使用concat,您可以在同一个 DataFrame 中获得所有表:

print(df)

          table_name v_FileID r_FileID  ...                               Min                               Max #Distinct
1      DBO.PSDSTTIME     1058     1058  ...                       0LastSunMar                       4LastSunOct        21
2      DBO.PSDSTTIME     1058     1058  ...                                 N                                 N         1
3      DBO.PSDSTTIME     1058     1058  ...                                10                                 9         8
..               ...      ...      ...  ...                               ...                               ...       ...
27  DBO.PS_PRCSRECUR     1048     1048  ...                           CruiseC                             dcg01         4
28  DBO.PS_PRCSRECUR     1048     1048  ...  2005-02-01-00.00.00.000000000000  2005-02-01-00.00.00.000000000000         1
29  DBO.PS_PRCSRECUR     1048     1048  ...                                 0                                 1         2

[37 rows x 15 columns]

相关问题