pandas 我如何在python中把下面的明文中的表提取为panda Dataframe ？

368yc8dk 于 2023-01-07 发布在 Python

关注(0)|答案(3)|浏览(110)

我有一套文本如下。我想提取这些表格格式在这些纯文本作为Pandas Dataframe 。
我拥有的：
"数据. txt"

Description:           Daily Index of EDGAR Dissemination Feed by Form Type
Last Data Received:    July 01, 1994
Discussion List:       edgar-interest@town.hall.org
To Subscribe to List:  Majordomo@town.hall.org
General Information:   info@radio.com
E-mail server:         mail@town.hall.org
Anonymous FTP:         ftp://town.hall.org/edgar/daily-index/form.070194.idx

Form Type   Company Name            CIK         Date Filed  File Name 
-----------------------------------------------------------------------------------------
10-C        ABC.                    310254      19940701    file1.txt   
10-B        XYZ.                    121234      54547474    file2.txt   
10-A        LMN.                    346765      12352356    file3.txt

我想要的：具有以下结构的panda Dataframe

Form Type   Company Name     CIK          Date Filed   File Name 
-------------------------------------------------------------------
| 10-C      | ABC          | 310254     | 19940701   | file1.txt  | 
| 10-B      | XYZ          | 121234     | 54547474   | file2.txt  |
| 10-A      | LMN          | 346765     | 12352356   | file3.txt  |
-------------------------------------------------------------------

下面是我的代码：

test = test.split('\n')
while not re.search('^--*', test[0]): test.pop(0)
test.pop(0)

rows = []
for row in test:
  rows.append(row.split())

print(rows)

我找到第一个出现的虚线，然后将行追加到列表中。稍后我将其转换为 Dataframe 。然而，我相信一定有一个更干净的方法来做到这一点，这就是为什么我请求你的贡献/支持。
谢谢，新年快乐!!：）

pandas

来源：https://stackoverflow.com/questions/74977241/how-do-i-extract-the-table-in-the-below-plaintext-as-a-pandas-dataframe-in-pytho

3条答案

按热度按时间

j0pj023g1#

找到将开始的额外行与所需的表结构分隔开的空行，然后将剩余的文件缓冲区传递给pd.read_table函数：

with open('data.txt') as f:
     for line in f:
        line = line.strip()
        if not line:  # find empty line
            break
     df = pd.read_table(f, sep='\s{2,}', header=0, comment='--', engine='python')
     print(df)

输出：

Form Type Company Name     CIK  Date Filed  File Name
0      10-C         ABC.  310254    19940701  file1.txt
1      10-B         XYZ.  121234    54547474  file2.txt
2      10-A         LMN.  346765    12352356  file3.txt

赞(0）回复(0）举报 2023-01-07

7dl7o3gd2#

我们有一个方法pd.read_fwf（），在这里你可以给予参数跳过 Dataframe 中的记录，比如-跳过初始记录的行数空白记录
如果您事先知道数据将从何处开始，则使用此函数读取文件。

赞(0）回复(0）举报 2023-01-07

w80xi6nr3#

你可以尝试一些类似的方法：

from io import StringIO

input_string = 'the content of my text file (PLEASE REPLACE)' # TODO replace with actual content or code to load the content
nr_of_seperators = 20 # number of '-' in seperator line (I just assumed 20, I did not count)
mystring = input_string.split('-' * nr_of_seperators)
header_split_from_data = mystring.split(delimiter=mystring)

# create columns assuming there is always at least two spaces between column names
headerlines = header_split_from_data[0].split('\n') # split the lines
columns = headerlines[-1].split() # extract the column names from the last line of the header 

# create csv string from data assuming the data does not contain strings with spaces
lines = header_split_from_data[1].split('\n') # seperate lines
csv_string_lines = [','.join(line.split()) for line in lines] # this splits the string on whitespaces
csv_string = '\n'.join(csv_string_lines)

# load dataframe
df = pd.read_csv(StringIO(csv_string), sep=",", header=None, names=columns)

可能有一种更简单、更有效的方法来做到这一点，但这是我想到的一个易于实现的版本（讨厌它的人会称之为快速和肮脏：D）。

赞(0）回复(0）举报 2023-01-07

我来回答

pandas 我如何在python中把下面的明文中的表提取为panda Dataframe ？

3条答案

相关问题

热门标签

最新问答