我有一个PDF，我正在使用PDE包。这是工作，但不完全是我想要的方式。

library(PDE)

myTables <- PDE_pdfs2table(pdf = 'GPI-2023-Web.pdf')
Following file is processing: 'GPI-2023-Web.pdf'
No filter words chosen for analysis.
The following table was detected but not processable for extraction: Table 3.2 shows a breakdown of the change in the e
27 table(s) found in 'GPI-2023-Web.pdf'.
Analysis of 'GPI-2023-Web.pdf' complete.

这将提取所有表，并作为单独的CSV转储到名为tables的子文件夹中。

cd tables/
[tables]$ ls
GPI-2023-Web_#010_table1.csv        GPI-2023-Web_#024_table3.csv
GPI-2023-Web_#011_table1.csv        GPI-2023-Web_#025_table1.csv
GPI-2023-Web_#012_table1.csv        GPI-2023-Web_#026_table1.csv
GPI-2023-Web_#013_table3.csv        GPI-2023-Web_#027_table1.csv
GPI-2023-Web_#014_table3.csv        GPI-2023-Web_#02_table1.csv
GPI-2023-Web_#015_table3.csv        GPI-2023-Web_#03_table1.csv
GPI-2023-Web_#017_table3.csv        GPI-2023-Web_#04_table1.csv
GPI-2023-Web_#018_table3.csv        GPI-2023-Web_#05_table1.csv
GPI-2023-Web_#019_table3.csv        GPI-2023-Web_#06_table1.csv
GPI-2023-Web_#01_table1.csv     GPI-2023-Web_#07_table1.csv
GPI-2023-Web_#020_table3.csv        GPI-2023-Web_#08_table1.csv
GPI-2023-Web_#021_table3.csv        GPI-2023-Web_#09_table1.csv
GPI-2023-Web_#022_table1.csv        GPI-2023-Web_page39_w.table-000039.png
GPI-2023-Web_#023_table2.csv
[tables]$ grep -l 'Safety and Security domain' *.csv
GPI-2023-Web_#011_table1.csv
GPI-2023-Web_#01_table1.csv
GPI-2023-Web_#023_table2.csv
GPI-2023-Web_#03_table1.csv
[tables]$ vi GPI-2023-Web_#01_table1.csv

虽然我可以选择我想要的特定表并进行后期处理，但我想提取一个标题为Table 1.1: Safety and Security domain的非常特定的表，除此之外什么都不做。
这可能吗？
使用PDE_pdfs2table_searchandfilter听起来很有希望，直到我尝试的search.words和filter.words选项都没有实际工作。它仍然提取了许多表。
注：以上PDF文件可从此处下载：GPI-2023-Web.pdf

PDE_pdfs2table_searchandfilter( pdf = 'GPI-2023-Web.pdf', search.words = 'TABLE 1\\.1\\b', # short for c('TABLE 1\\.1\\b') #ignore.case.sw = FALSE, # search words are case sensitive (default) #regex.sw = TRUE, # use regex rules for search words eval.abbrevs = FALSE, # don't detect abbreviations, use search words as they are exp.nondetc.tabs = FALSE, # don't save images for failed to read tables write.tab.doc.file = FALSE # don't write info about failed to read tables )

1条答案

按热度按时间

qnzebej01#

PDE_pdfs2table_searchandfilter非常好，特别是对于正则表达式search.words（使用正则表达式是默认行为）。
对于具体的示例，您可以使用

search.words = 'TABLE 1\\.1\\b'

第一个转义序列\.（在传递给正则表达式之前，双斜杠在字符串中计算为单斜杠）是匹配点字符;在正则表达式中，点.是用于匹配任何单个字符的特殊字符，因此正则表达式1.1（无转义）匹配"1.1"，但也匹配"101"。
第二换码序列\b代表字边界;因此，如果没有它，regex 1\\.1匹配1.1，但也匹配1.11（部分匹配）
对PDE_pdfs2table_searchandfilter的完整调用可以是（对应于默认值的基本参数值被注解掉）：

赞(0）回复(0）举报 2023-10-13

R语言使用PDE包从PDF中提取单个编号表

1条答案

相关问题

热门标签

最新问答

R语言 使用PDE包从PDF中提取单个编号表

1条答案

相关问题

热门标签

最新问答

R语言使用PDE包从PDF中提取单个编号表