如何在R中合并溢出多行的文本

wbrvyc0a 于 2023-06-27 发布在其他

关注(0)|答案(3)|浏览(90)

我使用tabulizer在R中抓取PDF，发现多行文本经常溢出到多行中，如下所示：

item quantity price
1   PURCHASE ORDER       NA    NA
2    ITEM CODE ABC       NA    NA
3  1 Computer soft        3 10.99
4             ware       NA    NA
5         2 Mathem        2  8.50
6 atics curriculum       NA    NA
7  and calculators       NA    NA

我想将跨多行的文本合并到有限数量的行中，所以它看起来像这样：

item                                    quantity price
1 PURCHASE ORDER ITEM CODE ABC                NA    NA  
2 1 Computer software                           3 10.99
3 2 Mathematics curriculum and calculators      2  8.5

下面是重新创建这些tibles的代码：

#Messy tibble
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df

#Clean tibble
item <- c("PURCHASE ORDER ITEM CODE ABC", "1 Computer software", "2 Mathematics curriculum and calculators")
quantity <- c(NA, 3, 2)
price <- c(NA, 10.99, 8.50)
df <- data.frame(item, quantity, price)
df

来源：https://stackoverflow.com/questions/76527223/how-to-combine-text-that-has-spilled-over-multiple-rows-in-r

3条答案

按热度按时间

56lgkhnf1#

您可以通过那些在quantity和price上具有缺失值的组来标识这些组。然后，为要放在一起的文本创建唯一的组标识符。最后，您可以将文本粘贴在一起，并取属于每组中第一个观测的quantity和price的值：

library(dplyr)
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df
#>               item quantity price
#> 1   PURCHASE ORDER       NA    NA
#> 2    ITEM CODE ABC       NA    NA
#> 3  1 Computer soft        3 10.99
#> 4             ware       NA    NA
#> 5         2 Mathem        2  8.50
#> 6 atics curriculum       NA    NA
#> 7  and calculators       NA    NA

df %>% 
  mutate(grp = ifelse(is.na(quantity) & is.na(price) & !row_number() == 1, 0, 1),
         grp = cumsum(grp)) %>%
  group_by(grp) %>% 
  reframe(item = paste(item, collapse=" "), 
            across(c(quantity, price), first)) %>%
  select(-grp)
#> # A tibble: 3 × 3
#>   item                                      quantity price
#>   <chr>                                        <dbl> <dbl>
#> 1 PURCHASE ORDER ITEM CODE ABC                    NA  NA  
#> 2 1 Computer soft ware                             3  11.0
#> 3 2 Mathem atics curriculum and calculators        2   8.5

创建于2023-06-21使用reprex v2.0.2

赞(0）回复(0）举报 2023-06-27

ki1q1bka2#

以下是整理数据的一个选项：
注意：恕我直言，前两行（"PURCHASE ...."）反映了列名，应该删除。

library(dplyr, warn=FALSE)
library(tidyr)
library(stringr)

df |> 
  mutate(no = as.numeric(str_extract(item, "^\\d+"))) |> 
  fill(no) |> 
  replace_na(list(no = 0)) |> 
  group_by(no) |> 
  summarise(across(everything(), \(x) paste0(x[!is.na(x)], collapse = " "))) |> 
  select(-no)
#> # A tibble: 3 × 3
#>   item                                      quantity price  
#>   <chr>                                     <chr>    <chr>  
#> 1 PURCHASE ORDER ITEM CODE ABC              ""       ""     
#> 2 1 Computer soft ware                      "3"      "10.99"
#> 3 2 Mathem atics curriculum and calculators "2"      "8.5"

赞(0）回复(0）举报 2023-06-27

c9x0cxw03#

从你的帖子中，不清楚quantity或price是否会对同一行项目有值。所以，我只是取了sum，你可以决定如何聚合这些行。

library(dplyr)

df |>
  mutate(line = cumsum(grepl("^\\d+", item))) |>
  summarize(item = paste(item, collapse = " "),
            across(c(quantity, price), ~if(any(!is.na(.x))) sum(.x, na.rm = T) else NA),
            .by = line) |>
  select(-line)

第一个“行”没有任何值，也不像其他行那样以数字开头。正如@stefan所提到的，如果这些是您想要排除的某种类型的头，您可以将across语句简化为across(c(quantity, price), na.omit)，并将summarize更改为reframe。

输出

item quantity price
1              PURCHASE ORDER ITEM CODE ABC       NA    NA
2                      1 Computer soft ware        3 10.99
3 2 Mathem atics curriculum and calculators        2  8.50

赞(0）回复(0）举报 2023-06-27

我来回答

如何在R中合并溢出多行的文本

3条答案

相关问题

热门标签

最新问答