如何在R中合并溢出多行的文本

wbrvyc0a  于 2023-06-27  发布在  其他
关注(0)|答案(3)|浏览(89)

我使用tabulizer在R中抓取PDF,发现多行文本经常溢出到多行中,如下所示:

item quantity price
1   PURCHASE ORDER       NA    NA
2    ITEM CODE ABC       NA    NA
3  1 Computer soft        3 10.99
4             ware       NA    NA
5         2 Mathem        2  8.50
6 atics curriculum       NA    NA
7  and calculators       NA    NA

我想将跨多行的文本合并到有限数量的行中,所以它看起来像这样:

item                                    quantity price
1 PURCHASE ORDER ITEM CODE ABC                NA    NA  
2 1 Computer software                           3 10.99
3 2 Mathematics curriculum and calculators      2  8.5

下面是重新创建这些tibles的代码:

#Messy tibble
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df

#Clean tibble
item <- c("PURCHASE ORDER ITEM CODE ABC", "1 Computer software", "2 Mathematics curriculum and calculators")
quantity <- c(NA, 3, 2)
price <- c(NA, 10.99, 8.50)
df <- data.frame(item, quantity, price)
df
56lgkhnf

56lgkhnf1#

您可以通过那些在quantityprice上具有缺失值的组来标识这些组。然后,为要放在一起的文本创建唯一的组标识符。最后,您可以将文本粘贴在一起,并取属于每组中第一个观测的quantityprice的值:

library(dplyr)
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df
#>               item quantity price
#> 1   PURCHASE ORDER       NA    NA
#> 2    ITEM CODE ABC       NA    NA
#> 3  1 Computer soft        3 10.99
#> 4             ware       NA    NA
#> 5         2 Mathem        2  8.50
#> 6 atics curriculum       NA    NA
#> 7  and calculators       NA    NA

df %>% 
  mutate(grp = ifelse(is.na(quantity) & is.na(price) & !row_number() == 1, 0, 1),
         grp = cumsum(grp)) %>%
  group_by(grp) %>% 
  reframe(item = paste(item, collapse=" "), 
            across(c(quantity, price), first)) %>%
  select(-grp)
#> # A tibble: 3 × 3
#>   item                                      quantity price
#>   <chr>                                        <dbl> <dbl>
#> 1 PURCHASE ORDER ITEM CODE ABC                    NA  NA  
#> 2 1 Computer soft ware                             3  11.0
#> 3 2 Mathem atics curriculum and calculators        2   8.5

创建于2023-06-21使用reprex v2.0.2

ki1q1bka

ki1q1bka2#

以下是整理数据的一个选项:
注意:恕我直言,前两行("PURCHASE ....")反映了列名,应该删除。

library(dplyr, warn=FALSE)
library(tidyr)
library(stringr)

df |> 
  mutate(no = as.numeric(str_extract(item, "^\\d+"))) |> 
  fill(no) |> 
  replace_na(list(no = 0)) |> 
  group_by(no) |> 
  summarise(across(everything(), \(x) paste0(x[!is.na(x)], collapse = " "))) |> 
  select(-no)
#> # A tibble: 3 × 3
#>   item                                      quantity price  
#>   <chr>                                     <chr>    <chr>  
#> 1 PURCHASE ORDER ITEM CODE ABC              ""       ""     
#> 2 1 Computer soft ware                      "3"      "10.99"
#> 3 2 Mathem atics curriculum and calculators "2"      "8.5"
c9x0cxw0

c9x0cxw03#

从你的帖子中,不清楚quantityprice是否会对同一行项目有值。所以,我只是取了sum,你可以决定如何聚合这些行。

library(dplyr)

df |>
  mutate(line = cumsum(grepl("^\\d+", item))) |>
  summarize(item = paste(item, collapse = " "),
            across(c(quantity, price), ~if(any(!is.na(.x))) sum(.x, na.rm = T) else NA),
            .by = line) |>
  select(-line)

第一个“行”没有任何值,也不像其他行那样以数字开头。正如@stefan所提到的,如果这些是您想要排除的某种类型的头,您可以将across语句简化为across(c(quantity, price), na.omit),并将summarize更改为reframe

输出

item quantity price
1              PURCHASE ORDER ITEM CODE ABC       NA    NA
2                      1 Computer soft ware        3 10.99
3 2 Mathem atics curriculum and calculators        2  8.50

相关问题