我有一个凌乱的文本文件,包含有数百个表的数据库的元数据,我试图将其发送到pandas,这样我就有了一个很好的CSV,但这个很难,我可能错过了一些导致问题的早期内容
myfile.txt =
Table: "DBO"."PSDSTTIME"
Column list, count 8:
v_FileID r_FileID ColumnID DomainID Name Type Precision Scale MaxLength Nullability Nulls Min Max #Distinct
-------- -------- -------- -------- ------------ -------- --------- ----- --------- ----------- ----- --------------------------- --------------------------- ---------
1058 1058 0 1 DSTID CHAR 24 12 Y N 0LastSunMar 4LastSunOct 21
1058 1058 1 2 DSTABSOLUTE CHAR 2 1 Y N N N 1
1058 1058 2 3 DSTMONTH CHAR 4 2 Y N 10 9 8
1058 1058 3 4 DSTDAY SMALLINT 2 Y N 1 5 5
1058 1058 4 5 DSTDAYOFWEEK CHAR 2 1 Y N 0 5 4
1058 1058 5 6 DSTHOUR SMALLINT 2 Y N 0 4 5
1058 1058 6 7 DSTMINUTE SMALLINT 2 Y N 0 0 1
1058 1058 7 8 DESCR CHAR 60 30 Y N First Friday in May, 2:00am Third Sunday in Oct, 2:00am 21
Ready>Check Files...OK
Load Data...OK
Table: "DBO"."PS_PRCSRECUR"
Column list, count 29:
v_FileID r_FileID ColumnID DomainID Name Type Precision Scale MaxLength Nullability Nulls Min Max #Distinct
-------- -------- -------- -------- ----------------- --------- --------- ----- --------- ----------- ----- -------------------------------- -------------------------------- ---------
1048 1048 0 1 RECURTYPE SMALLINT 2 Y N 2 2 1
1048 1048 1 2 RECURNAME CHAR 60 30 Y N Daily Purge Y_DAILY3_NT2 12
1048 1048 2 3 VERSION INTEGER 2 Y N 1 4 3
1048 1048 3 4 DAYOFMONTH SMALLINT 2 Y N 0 0 1
1048 1048 4 5 RUN1STWEEK SMALLINT 2 Y N 1 1 1
1048 1048 5 6 RUN2NDWEEK SMALLINT 2 Y N 2 2 1
1048 1048 6 7 RUN3RDWEEK SMALLINT 2 Y N 3 3 1
1048 1048 7 8 RUN4THWEEK SMALLINT 2 Y N 4 4 1
1048 1048 8 9 RUN5THWEEK SMALLINT 2 Y N 5 5 1
1048 1048 9 10 RUN6THWEEK SMALLINT 2 Y N 6 6 1
1048 1048 10 11 RUNSUNDAY SMALLINT 2 Y N 0 1 2
1048 1048 11 12 RUNMONDAY SMALLINT 2 Y N 2 2 1
1048 1048 12 13 RUNTUESDAY SMALLINT 2 Y N 3 3 1
1048 1048 13 14 RUNWEDNESDAY SMALLINT 2 Y N 4 4 1
1048 1048 14 15 RUNTHURSDAY SMALLINT 2 Y N 5 5 1
1048 1048 15 16 RUNFRIDAY SMALLINT 2 Y N 6 6 1
1048 1048 16 17 RUNSATURDAY SMALLINT 2 Y N 0 7 2
1048 1048 17 18 RECURDAYSPCL SMALLINT 2 Y N 0 0 1
1048 1048 18 19 BEGINDTTM TIMESTAMP 32 Y N 1999-01-01-17.00.00.000000000000 2014-08-23-10.59.24.000000000000 12
1048 1048 19 20 REPEATRECURRENCE SMALLINT 3 Y N 0 60 8
1048 1048 20 21 REPEATUNIT SMALLINT 2 Y N 0 2 3
1048 1048 21 22 DURATION SMALLINT 3 Y N 0 24 4
1048 1048 22 23 DURATIONUNIT SMALLINT 2 Y N 0 1 2
1048 1048 23 24 INITIATEWHEN SMALLINT 2 Y N 0 1 2
1048 1048 24 25 RECURDESCR CHAR 60 30 Y N Daily Purge Y_DAILY3_NT2 3Times 12
1048 1048 25 26 LASTUPDDTTM TIMESTAMP 32 Y N 2001-08-14-16.25.54.000000000000 2014-08-23-10.59.38.000000000000 11
1048 1048 26 27 LASTUPDOPRID CHAR 60 30 Y N CruiseC dcg01 4
1048 1048 27 28 ENDDTTM TIMESTAMP 32 Y Y 2005-02-01-00.00.00.000000000000 2005-02-01-00.00.00.000000000000 1
1048 1048 28 29 RECUR_USECURRDATE SMALLINT 2 Y N 0 1 2
Ready>Check Files...OK
这就是我试图接近的
import re
with open('lmprod_metadata.txt', encoding='utf8') as f:
lines = f.readlines()
pattern1 = re.compile('^\s*\d+\s+\d+\s+\d+\s+\d+.*$')
pattern2 = re.compile('^Table:.*$')
lines = [line for line in lines if pattern1.match(line) or pattern2.match(line)]
lines2 = [line.strip() for line in lines]
lines3 = [line for line in lines2]
# my thought for the following was to get table name in the rows to help standardize the data
table_names = []
data_columns = []
for i in range(len(lst)):
if lst[i].startswith('Table'):
table_name = lst[i].split()[1]
else:
data_columns.append(table_name + ' ' + lst[i])
以下是示例文本文件中显示的前两个表的输出
"DBO"."PSDSTTIME" 1058 1058 0 1 DSTID CHAR 24 12 Y N 0LastSunMar 4LastSunOct 21
"DBO"."PSDSTTIME" 1058 1058 1 2 DSTABSOLUTE CHAR 2 1 Y N N N 1
"DBO"."PSDSTTIME" 1058 1058 2 3 DSTMONTH CHAR 4 2 Y N 10 9 8
"DBO"."PSDSTTIME" 1058 1058 3 4 DSTDAY SMALLINT 2 Y N 1 5 5
"DBO"."PSDSTTIME" 1058 1058 4 5 DSTDAYOFWEEK CHAR 2 1 Y N 0 5 4
"DBO"."PSDSTTIME" 1058 1058 5 6 DSTHOUR SMALLINT 2 Y N 0 4 5
"DBO"."PSDSTTIME" 1058 1058 6 7 DSTMINUTE SMALLINT 2 Y N 0 0 1
"DBO"."PSDSTTIME" 1058 1058 7 8 DESCR CHAR 60 30 Y N First Friday in May, 2:00am Third Sunday in Oct, 2:00am 21
"DBO"."PS_PRCSRECUR" 1048 1048 0 1 RECURTYPE SMALLINT 2 Y N 2 2 1
"DBO"."PS_PRCSRECUR" 1048 1048 1 2 RECURNAME CHAR 60 30 Y N Daily Purge Y_DAILY3_NT2 12
"DBO"."PS_PRCSRECUR" 1048 1048 2 3 VERSION INTEGER 2 Y N 1 4 3
"DBO"."PS_PRCSRECUR" 1048 1048 3 4 DAYOFMONTH SMALLINT 2 Y N 0 0 1
"DBO"."PS_PRCSRECUR" 1048 1048 4 5 RUN1STWEEK SMALLINT 2 Y N 1 1 1
"DBO"."PS_PRCSRECUR" 1048 1048 5 6 RUN2NDWEEK SMALLINT 2 Y N 2 2 1
"DBO"."PS_PRCSRECUR" 1048 1048 6 7 RUN3RDWEEK SMALLINT 2 Y N 3 3 1
"DBO"."PS_PRCSRECUR" 1048 1048 7 8 RUN4THWEEK SMALLINT 2 Y N 4 4 1
"DBO"."PS_PRCSRECUR" 1048 1048 8 9 RUN5THWEEK SMALLINT 2 Y N 5 5 1
"DBO"."PS_PRCSRECUR" 1048 1048 9 10 RUN6THWEEK SMALLINT 2 Y N 6 6 1
"DBO"."PS_PRCSRECUR" 1048 1048 10 11 RUNSUNDAY SMALLINT 2 Y N 0 1 2
"DBO"."PS_PRCSRECUR" 1048 1048 11 12 RUNMONDAY SMALLINT 2 Y N 2 2 1
"DBO"."PS_PRCSRECUR" 1048 1048 12 13 RUNTUESDAY SMALLINT 2 Y N 3 3 1
"DBO"."PS_PRCSRECUR" 1048 1048 13 14 RUNWEDNESDAY SMALLINT 2 Y N 4 4 1
"DBO"."PS_PRCSRECUR" 1048 1048 14 15 RUNTHURSDAY SMALLINT 2 Y N 5 5 1
"DBO"."PS_PRCSRECUR" 1048 1048 15 16 RUNFRIDAY SMALLINT 2 Y N 6 6 1
"DBO"."PS_PRCSRECUR" 1048 1048 16 17 RUNSATURDAY SMALLINT 2 Y N 0 7 2
"DBO"."PS_PRCSRECUR" 1048 1048 17 18 RECURDAYSPCL SMALLINT 2 Y N 0 0 1
"DBO"."PS_PRCSRECUR" 1048 1048 18 19 BEGINDTTM TIMESTAMP 32 Y N 1999-01-01-17.00.00.000000000000 2014-08-23-10.59.24.000000000000 12
"DBO"."PS_PRCSRECUR" 1048 1048 19 20 REPEATRECURRENCE SMALLINT 3 Y N 0 60 8
"DBO"."PS_PRCSRECUR" 1048 1048 20 21 REPEATUNIT SMALLINT 2 Y N 0 2 3
"DBO"."PS_PRCSRECUR" 1048 1048 21 22 DURATION SMALLINT 3 Y N 0 24 4
"DBO"."PS_PRCSRECUR" 1048 1048 22 23 DURATIONUNIT SMALLINT 2 Y N 0 1 2
"DBO"."PS_PRCSRECUR" 1048 1048 23 24 INITIATEWHEN SMALLINT 2 Y N 0 1 2
"DBO"."PS_PRCSRECUR" 1048 1048 24 25 RECURDESCR CHAR 60 30 Y N Daily Purge Y_DAILY3_NT2 3Times 12
"DBO"."PS_PRCSRECUR" 1048 1048 25 26 LASTUPDDTTM TIMESTAMP 32 Y N 2001-08-14-16.25.54.000000000000 2014-08-23-10.59.38.000000000000 11
"DBO"."PS_PRCSRECUR" 1048 1048 26 27 LASTUPDOPRID CHAR 60 30 Y N CruiseC dcg01 4
"DBO"."PS_PRCSRECUR" 1048 1048 27 28 ENDDTTM TIMESTAMP 32 Y Y 2005-02-01-00.00.00.000000000000 2005-02-01-00.00.00.000000000000 1
"DBO"."PS_PRCSRECUR" 1048 1048 28 29 RECUR_USECURRDATE SMALLINT 2 Y N 0 1 2
所以我的问题是我已经清理它到这一点的方式不处理任意数量的空格好吧,我已经尝试了OpenRefine,cleaninng列表进一步,和不同的分隔符在Pandas没有成功,我不想失去任何数据
这里尝试使用上一个示例中的data_columns表进一步清理数据,因此所有数据都以空格分隔
lst = []
for item in data_columns:
clean_item = item.replace('\n', '').replace('\r', '').replace('\t', ' ')
lst.append(clean_item)
输出(包括更少的,因为我用完了字符)
"DBO"."PSDSTTIME" 1058 1058 0 1 DSTID CHAR 24 12 Y N 0LastSunMar 4LastSunOct 21
"DBO"."PSDSTTIME" 1058 1058 1 2 DSTABSOLUTE CHAR 2 1 Y N N N 1
"DBO"."PSDSTTIME" 1058 1058 2 3 DSTMONTH CHAR 4 2 Y N 10 9 8
"DBO"."PSDSTTIME" 1058 1058 3 4 DSTDAY SMALLINT 2 Y N 1 5 5
"DBO"."PSDSTTIME" 1058 1058 4 5 DSTDAYOFWEEK CHAR 2 1 Y N 0 5 4
"DBO"."PSDSTTIME" 1058 1058 5 6 DSTHOUR SMALLINT 2 Y N 0 4 5
"DBO"."PSDSTTIME" 1058 1058 6 7 DSTMINUTE SMALLINT 2 Y N 0 0 1
"DBO"."PSDSTTIME" 1058 1058 7 8 DESCR CHAR 60 30 Y N First Friday in May, 2:00am Third Sunday in Oct, 2:00am 21
"DBO"."PS_PRCSRECUR" 1048 1048 0 1 RECURTYPE SMALLINT 2 Y N 2 2 1
"DBO"."PS_PRCSRECUR" 1048 1048 1 2 RECURNAME CHAR 60 30 Y N Daily Purge Y_DAILY3_NT2 12
"DBO"."PS_PRCSRECUR" 1048 1048 2 3 VERSION INTEGER 2 Y N 1 4 3
这是对Pandas的尝试
df = pd.read_csv('myfile.txt', delimiter='\s+')
# and
df = pd.read_csv('myfile.txt', sep='\s+')
导致此错误:
ParserError Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 df = pd.read_csv('myfile.txt', delimiter='\s+')
not including full output, running out of characters
ParserError: Error tokenizing data. C error: Expected 15 fields in line 9, saw 22
我已经删除了一些其他的文本清理代码,我试过了...这不是最好的举动,但哦,好吧,我想必须有一些上游我可以做的,我只是不确定?
1条答案
按热度按时间7ivaypg91#
read_fwf
似乎是一个很好的 * 用例 *:输出:
要单独访问每个表,请使用dict key-indexing:
使用
concat
,您可以在同一个 DataFrame 中获得所有表: