pandas 我如何在python中把下面的明文中的表提取为panda Dataframe ?

368yc8dk  于 2023-01-07  发布在  Python
关注(0)|答案(3)|浏览(110)

我有一套文本如下。我想提取这些表格格式在这些纯文本作为Pandas Dataframe 。
我拥有的:
"数据. txt"

Description:           Daily Index of EDGAR Dissemination Feed by Form Type
Last Data Received:    July 01, 1994
Discussion List:       edgar-interest@town.hall.org
To Subscribe to List:  Majordomo@town.hall.org
General Information:   info@radio.com
E-mail server:         mail@town.hall.org
Anonymous FTP:         ftp://town.hall.org/edgar/daily-index/form.070194.idx

Form Type   Company Name            CIK         Date Filed  File Name 
-----------------------------------------------------------------------------------------
10-C        ABC.                    310254      19940701    file1.txt   
10-B        XYZ.                    121234      54547474    file2.txt   
10-A        LMN.                    346765      12352356    file3.txt

我想要的:具有以下结构的panda Dataframe

Form Type   Company Name     CIK          Date Filed   File Name 
-------------------------------------------------------------------
| 10-C      | ABC          | 310254     | 19940701   | file1.txt  | 
| 10-B      | XYZ          | 121234     | 54547474   | file2.txt  |
| 10-A      | LMN          | 346765     | 12352356   | file3.txt  |
-------------------------------------------------------------------

下面是我的代码:

test = test.split('\n')
while not re.search('^--*', test[0]): test.pop(0)
test.pop(0)

rows = []
for row in test:
  rows.append(row.split())

print(rows)

我找到第一个出现的虚线,然后将行追加到列表中。稍后我将其转换为 Dataframe 。然而,我相信一定有一个更干净的方法来做到这一点,这就是为什么我请求你的贡献/支持。
谢谢,新年快乐!!:)

j0pj023g

j0pj023g1#

找到将开始的额外行与所需的表结构分隔开的空行,然后将剩余的文件缓冲区传递给pd.read_table函数:

with open('data.txt') as f:
     for line in f:
        line = line.strip()
        if not line:  # find empty line
            break
     df = pd.read_table(f, sep='\s{2,}', header=0, comment='--', engine='python')
     print(df)

输出:

Form Type Company Name     CIK  Date Filed  File Name
0      10-C         ABC.  310254    19940701  file1.txt
1      10-B         XYZ.  121234    54547474  file2.txt
2      10-A         LMN.  346765    12352356  file3.txt
7dl7o3gd

7dl7o3gd2#

我们有一个方法pd.read_fwf(),在这里你可以给予参数跳过 Dataframe 中的记录,比如-跳过初始记录的行数空白记录
如果您事先知道数据将从何处开始,则使用此函数读取文件。

w80xi6nr

w80xi6nr3#

你可以尝试一些类似的方法:

from io import StringIO

input_string = 'the content of my text file (PLEASE REPLACE)' # TODO replace with actual content or code to load the content
nr_of_seperators = 20 # number of '-' in seperator line (I just assumed 20, I did not count)
mystring = input_string.split('-' * nr_of_seperators)
header_split_from_data = mystring.split(delimiter=mystring)

# create columns assuming there is always at least two spaces between column names
headerlines = header_split_from_data[0].split('\n') # split the lines
columns = headerlines[-1].split() # extract the column names from the last line of the header 

# create csv string from data assuming the data does not contain strings with spaces
lines = header_split_from_data[1].split('\n') # seperate lines
csv_string_lines = [','.join(line.split()) for line in lines] # this splits the string on whitespaces
csv_string = '\n'.join(csv_string_lines)

# load dataframe
df = pd.read_csv(StringIO(csv_string), sep=",", header=None, names=columns)

可能有一种更简单、更有效的方法来做到这一点,但这是我想到的一个易于实现的版本(讨厌它的人会称之为快速和肮脏:D)。

相关问题