python 从pdf提取表格数据时出现问题

gcuhipw9 于 2023-02-15 发布在 Python

关注(0)|答案(1)|浏览(220)

我正试图从一个PDF格式的提取表，有很多媒体来源的名称。所需的输出是一个全面的csv文件与所有列出的来源列。
我正在尝试编写一个简单的python脚本来从pdf文件中提取表格数据。我能够得到的输出是我尝试合并的每个表格的CSV。然后我使用concat函数来合并所有文件。结果很混乱，文件中有多余的标点符号和大量空格。
有人能帮我达到一个更好的结果吗？
代码：

from camelot import read_pdf
import glob
import os
import pandas as pd
import numpy as np
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# Get all the tables within the file
all_tables = read_pdf("/Users/zago/code/pdftext/pdftextvenv/mimesiweb.pdf", pages = 'all')

# Show the total number of tables in the file
print("Total number of table: {}".format(all_tables.n))
 
# print all the tables in the file
for t in range(all_tables.n):
    print("Table n°{}".format(t))
    print((all_tables[t].df).head())

#convert to excel or csv 
#all_tables.export('table.xlsx', f="excel")
all_tables.export('table.csv', f="csv")

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',') for f in all_filenames ])
#export to csv
combined_csv.to_csv("combined_csv_tables.csv", index=False, encoding="utf-8")

Starting point PDF
Result for 1 csv
Combined csv
谢谢

python

来源：https://stackoverflow.com/questions/75445706/problem-extracting-tabular-data-from-a-pdf

1条答案

按热度按时间

plicqrtu1#

在连接前仅选择第一列，然后保存。
只需使用以下代码行：

combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',').iloc[:,0] for f in all_filenames ])

输出：

In [25]: combined_csv
Out[25]:
0                 Interni.it
1                  Intima.fr
2              Intimo Retail
3     Intimoda Magazine - En
4          Intorno Tirano.it
               ...
47          Alessandria Oggi
48               Aleteia.org
49              Alibi Online
50               Alimentando
51       All About Italy.net
Length: 2824, dtype: object

最终csv输出：

赞(0）回复(0）举报 2023-02-15

我来回答

python 从pdf提取表格数据时出现问题

1条答案

相关问题

热门标签

最新问答