我想将使用Tabula下载的PDF文件中的数据读取到Google Sheets中,当我将读取的数据传输到Google Sheets中时,我收到一个错误。我知道我下载的数据是脏的,但我想在Google Sheets中清理它。
从完整代码部分的PDF部分下载数据
import tabula
import pandas as pd
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = tabula.read_pdf(file_path, pages='all', multiple_tables='FALSE', stream='TRUE')
print (df)
[ Anderson 19,212 9,013 74 1,034 42 174 189 28 0 0.1
0 Bedford 11,486 3,395 25 306 8 47 75 5 0 0
1 Benton 4,716 1,474 12 83 13 11 14 2 0 0
2 Bledsoe 3,622 897 7 95 4 9 18 2 0 0
3 Blount 37,443 12,100 83 1,666 72 250 313 51 1 1
4 Bradley 29,768 7,070 66 1,098 44 143 210 29 1 1
5 Campbell 9,870 2,248 32 251 25 43 45 5 0 0
6 Cannon 4,007 1,127 8 106 7 18 29 3 0 0
7 Carroll 7,756 2,327 22 181 20 18 39 2 0 0
8 Carter 16,898 3,453 30 409 20 54 130 26 0 0
9 Cheatham 11,297 3,878 26 463 13 50 99 8 0 0
10 Chester 5,081 1,243 5 115 4 12 10 4 0 0
11 Claiborne 8,602 1,832 16 192 24 27 29 2 0 0
12 Clay 2,141 707 2 47 2 10 11 0 0 0
13 Cocke 9,791 1,981 21 211 19 27 59 2 0 2
14 Coffee 14,417 4,743 32 517 23 62 113 9 0 1
15 Crockett 3,982 1,303 7 76 3 8 13 1 0 0
16 Cumberland 20,413 5,202 37 471 26 53 99 17 0 1
17 Davidson 84,550 148,864 412 9,603 304 619 2,459 106 0 6
18 Decatur 3,588 894 5 70 4 8 16 2 0 0
19 DeKalb 5,171 1,569 10 117 6 29 49 0 0 0
20 Dickson 13,233 4,722 32 489 18 58 94 9 0 3
21 Dyer 10,180 2,816 19 193 13 27 48 3 0 0
22 Fayette 13,055 5,874 19 261 16 37 62 21 0 0
23 Fentress 6,038 1,100 10 107 14 11 37 1 0 0
24 Franklin 11,532 4,374 28 319 16 36 66 7 0 0
25 Gibson 13,786 5,258 26 305 18 36 66 8 0 0
26 Giles 7,970 2,917 16 162 11 11 41 1 0 0
27 Grainger 6,626 1,154 17 130 12 28 26 4 0 0
28 Greene 18,562 4,216 28 481 29 56 152 14 0 0
29 Grundy 3,636 999 11 80 3 13 19 0 0 0
30 Hamblen 15,857 4,075 30 443 27 73 93 8 0 0
31 Hamilton 78,733 55,316 147 5,443 138 349 1,098 121 0 0
32 Hancock 1,843 322 4 42 1 5 13 0 0 0
33 Hardeman 4,919 4,185 18 84 11 13 30 9 0 0
34 Hardin 8,012 1,622 15 134 22 48 96 0 0 0
35 Hawkins 16,648 3,507 31 397 12 52 91 7 0 3
36 Haywood 3,013 3,711 11 60 10 10 19 0 0 0
37 Henderson 8,138 1,800 13 172 9 27 39 1 0 0
38 Henry 9,508 3,063 18 223 15 27 60 4 0 0
39 Hickman 5,695 1,824 20 161 19 15 39 18 0 0
40 Houston 2,182 866 9 88 4 7 12 0 0 0
41 Humphreys 4,930 1,967 17 166 12 23 26 5 0 0
42 Jackson 3,236 1,129 2 62 1 7 17 1 0 0
43 Jefferson 14,776 3,494 34 497 22 76 115 8 0 1
44 Johnson 5,410 988 11 102 7 9 39 6 0 0
45 Knox 105,767 62,878 382 7,458 227 986 1,634 122 0 9
46 Lake 1,357 577 5 18 1 6 6 0 0 0, Lauderdale 4,884 3,056 14 87 13 10 14.1 \
0 Lawrence 12,420 2,821 21 271 13 36 77
1 Lewis 3,585 890 14 59 8 9 42
2 Lincoln 10,398 2,554 19 231 13 39 46
3 Loudon 17,610 4,919 41 573 22 77 87
只是我提取的数据的一个样本。同样,这不是我完全想象的,但是作为一个初学者,我想在工作表中清理它
HERE is an image of the PDF I was downloading data from.
Here is the link to download the PDF I am downloading data from
现在我想导入gspread和gpsread_dataframe以上传到Google工作表选项卡,这是我遇到问题的地方。
编辑:虽然这两个部分都没有包含我的所有编码,但现在顶部和底部包含了我迄今为止完成的所有编码。
from oauth2client.service_account import ServiceAccountCredentials
import json
import gspread
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
from gspread_dataframe import set_with_dataframe
set_with_dataframe(worksheet, df, include_column_header='False')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/zc/x2w76_4121g3gzfxybkz2q480000gn/T/ipykernel_44678/2784595029.py in <module>
----> 1 set_with_dataframe(worksheet, df, include_column_header='False')
/opt/anaconda3/lib/python3.9/site-packages/gspread_dataframe.py in set_with_dataframe(worksheet, dataframe, row, col, include_index, include_column_header, resize, allow_formulas, string_escaping)
260 # If header-related params are True, the values are adjusted
261 # to allow space for the headers.
--> 262 y, x = dataframe.shape
263 index_col_size = 0
264 column_header_size = 0
AttributeError: 'list' object has no attribute 'shape'
这是否与我的数据是如何从我的PDF中提取的有关?
1条答案
按热度按时间06odsfpq1#
df
似乎是一个列表,因此尝试将参数output_format='dataframe'
传递给tabula.read_pdf()
函数,如下所示:此外,我建议您看一看PEP8样式指南,以便更好地了解如何编写格式良好的脚本。