如何使用R和Readr包扩展.RDS文本语料库

qmb5sa22  于 2023-05-20  发布在  其他
关注(0)|答案(1)|浏览(120)

我正试图扩大一个提供给我的文本语料库。该文件本身是一个.RDS文件,我需要使用20个不同PDF文档中的文本来扩展它,其中1个PDF文件是语料库中自己的文档条目。
我在项目中使用的所有包是:

  • Readr
  • Tidyverse
  • Tidytext
  • 全特达
  • tm

这是我试图转换为文本并扩展语料库的所有PDF的代码:

pdf_paths <- c("NGODocuments/1234567_EPIC_NGO.pdf",
           "NGODocuments/F2662175_Allied-Startups_NGO.pdf",
           "NGODocuments/F2662292_Civil-Liberties_NGO.pdf",
           "NGODocuments/F2662654_PGEU_NGO.pdf",
           "NGODocuments/F2663061_Not-for-profit-law_NGO.pdf",
           "NGODocuments/F2663127_Eurocities_NGO.pdf",
           "NGODocuments/F2663268_European-Disability_NGO.pdf",
           "NGODocuments/F2663380_Information-Accountability_NGO.pdf",
           "NGODocuments/F2665208_Hospital-Pharmacy_NGO.pdf",
           "NGODocuments/F2665222_European-Radiology_NGO.pdf",
           "BusinessDocs/123_DeepMind_Business.pdf",
           "BusinessDocs/1234_LinedIn_Business.pdf",
           "BusinessDocs/12345_AVAAZ_Business.pdf",
           "BusinessDocs/F2488672_SAZKA_Business.pdf",
           "BusinessDocs/F2662492_Google_Business.pdf",
           "BusinessDocs/F2662771_SICK_Business.pdf",
           "BusinessDocs/F2662846_sanofi_Business.pdf",
           "BusinessDocs/F2662935_EnBV_Business.pdf", 
           "BusinessDocs/F2662941_Siemens_Business.pdf",
           "BusinessDocs/F2662944_BlackBerry_Business.pdf")

这是我为尝试提取文本然后扩展语料库所做的代码:

pdf_text <- lapply(pdf_paths, read_file)
corpus <- tm::Corpus(VectorSource(pdf_text))

prev_corpus <- readRDS("data_corpus_aiact.RDS")
new_corpus <- c(prev_corpus, corpus)
writeCorpus(new_corpus, filenames = pdf_paths)

然而,当我运行这段代码时,我遇到了来自new_corpus变量的一个错误:
错误:as.corpus()只对corpus对象有效。
我已经在网上搜遍了,试图找到解决方案,但无论我找到什么,似乎都不起作用。我确实用一个名为pdftools的软件包试过一次,但是在将PDF文件转换为文本时出现了一个错误,说它在文档中有一个非法的字体粗细,这就是为什么我切换到readr。
目标是生成一个新的语料库,其中包括旧语料库中的内容,将新内容添加到语料库中,并将其保存为新的.RDS文件。

zbdgwd5y

zbdgwd5y1#

下面是我的方法,只使用quantedareadtext

library("quanteda")
#> Package version: 3.3.0.9001
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

prev_corpus <- readRDS("~/Downloads/pdf documents/data_corpus_aiact.rds")
pdfpath <- "~/Downloads/pdf documents/PDF documents/NGODocuments/*.pdf"

new_corpus <- readtext::readtext(pdfpath, 
                                 docvarsfrom = "filenames",
                                 docvarnames = c("id", "actor", "type_actor")) |>
    corpus()
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight

你在pdf文件中有一些奇怪的地方,但这并不罕见。您应该考虑检查文本,看看readtext::readtext()是否正确转换了它们。
现在,我们可以更改文档名称以匹配RDS文件中的内容:

docnames(new_corpus) <- with(docvars(new_corpus),
                             paste0(actor, " (", type_actor, ")"))
print(new_corpus, 2)
#> Corpus consisting of 40 documents and 3 docvars.
#> EPIC (NGO) :
#> "          FEEDBACK OF THE ELECTRONIC PRIVACY INFORMATION CEN..."
#> 
#> Allied-Startups (NGO) :
#> "Feedback reference F2662175 Submitted on 13 July 2021 Submit..."
#> 
#> [ reached max_ndoc ... 38 more documents ]
head(docvars(new_corpus))
#>         id              actor type_actor
#> 1  1234567               EPIC        NGO
#> 2 F2662175    Allied-Startups        NGO
#> 3 F2662292    Civil-Liberties        NGO
#> 4 F2662654               PGEU        NGO
#> 5 F2663061 Not-for-profit-law        NGO
#> 6 F2663127         Eurocities        NGO

其中一些将与旧的文档名冲突,并且在quanteda中,这些应该是唯一的。所以:

# to avoid ducplicated docids
duplicated_index <- which(docnames(new_corpus) %in% docnames(prev_corpus))
docnames(new_corpus)[duplicated_index] <- 
    paste(docnames(new_corpus)[duplicated_index], "new")

现在我们可以简单地合并它们,+操作符将自动匹配docvar列。

# combine the two
new_corpus <- prev_corpus + new_corpus
print(new_corpus, 0, 0)
#> Corpus consisting of 60 documents and 3 docvars.
head(docvars(new_corpus))
#>                                 actor type_actor   id
#> 1                          Access Now        NGO <NA>
#> 2                                 ACM        NGO <NA>
#> 3                      AlgorithmWatch        NGO <NA>
#> 4                               AVAAZ        NGO <NA>
#> 5                     Bits of Freedom        NGO <NA>
#> 6 Centre for Democracy and Technology        NGO <NA>
tail(docvars(new_corpus))
#>                  actor type_actor       id
#> 55           Impact-AI        NGO F2665589
#> 56         Croation-AI        NGO F2665590
#> 57               GLEIF        NGO F2665591
#> 58 Fraud-Corruption-AI        NGO F2665605
#> 59      Future-Society        NGO F2665611
#> 60   Climate-Change-AI        NGO F2665623

创建于2023-05-15带有reprex v2.0.2

相关问题