pandas 如何从python写入txt,其中将标题中的日期复制到txt文件中的每条记录

8yoxcaq7  于 2023-03-16  发布在  Python
关注(0)|答案(2)|浏览(110)

我有一个文本文件,它看起来像下面(仅快照):

DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000

3年来,我有大约1m条记录。第一条记录“DC000D20221110012022100019”是标题,其日期位于[6:14]。我需要将此数据导入到数据框中进行探索性分析,对于此分析,我需要每条记录显示日期,而不是标题。因此,我需要类似以下内容:

DC000D20221110012022100019
DC011D20221110   AV0019000300080180003340501031800481200000
DC011D20221110   AV0019000300083180003361901031900071900000
DC011D20221110   AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D20221209   CG0019000300080220100264401000000000000000
DC011D20221209   CG0019000300080220400885101000039990700000
DC011D20221209   CG0019000300080220400885101000040013000000

这样我就可以很容易地把它导入Pandasdf了。
形成上述主文本文件,我准备另一个txt文件,其中只有DC011记录,如下所示:

子模块DC011的文件

File1=open(r"\path\CRGDEC\CRGDEC.txt")
File2=open(r"\path\CRGDEC_DC011.txt", "w")

for line in File1.readlines():
    if (line.startswith('DC011')):
        File2.write(line)

但是这样做会删除DC000的头记录,而且我无法使用df中的fillna()选项来准备Date列。
注意:我也有类似方式的其他模块(如DC012、DC013、DC014)。

djp7away

djp7away1#

如果您只是添加一个if来检查标题记录,您也可以写入它(并记住要附加到后续行的日期)

File1=open(r"\path\CRGDEC\CRGDEC.txt")
File2=open(r"\path\CRGDEC_DC011.txt", "w")

saveDate = ""   # just in case
for line in File1.readlines():
    if (line.startswith('DC000'):
        saveDate = line[6:14]
        File2.write(line)
    if (line.startswith('DC011')):
       line2write = line[:6] + saveDate + line[7:]
        File2.write(line2write)

我没有测试过,但应该差不多。

ux6nzvsh

ux6nzvsh2#

您可以通过在read_csv中指定分隔符和列名直接导入Pandas,然后进行一些操作以获得所需的格式:

import pandas as pd
from io import StringIO

data = StringIO("""DC000D20221110012022100019
DC011D           AV0019000300080180003340501031800481200000
DC011D           AV0019000300083180003361901031900071900000
DC011D           AV0019000300089180003378701032100515800000
DC000D20221209012022100019
DC011D           CG0019000300080220100264401000000000000000
DC011D           CG0019000300080220400885101000039990700000
DC011D           CG0019000300080220400885101000040013000000
""")

df =  pd.read_csv(data, sep=r"\s+", engine="python", names=['ref', 'value'])
df['date'] = df.loc[df['value'].isna(), 'ref'].str[6:14]
df['date'] = df['date'].ffill()
mask = ~df['value'].isna()
df.loc[mask, 'ref'] = df[mask]['ref'] + df[mask]['date']
print(df.drop(columns='date'))

输出:

ref                                       value
0  DC000D20221110012022100019                                        None
1              DC011D20221110  AV0019000300080180003340501031800481200000
2              DC011D20221110  AV0019000300083180003361901031900071900000
3              DC011D20221110  AV0019000300089180003378701032100515800000
4  DC000D20221209012022100019                                        None
5              DC011D20221209  CG0019000300080220100264401000000000000000
6              DC011D20221209  CG0019000300080220400885101000039990700000
7              DC011D20221209  CG0019000300080220400885101000040013000000

相关问题