通过 Dataframe 变量对语料库中的文本数据进行分组

uemypmqf 于 2023-01-28 发布在其他

关注(0)|答案(2)|浏览(126)

我在R中有一个数据框，其中有一列需要进行基本的文本分析。我可以根据需要修改this source中的代码。但是，我现在需要进行相同的分析，但针对的是数据组。我在这里包含了一个小示例的dput。

structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "MISSOURI W", 
"LEE", "LEE", "LEE"), Message = c("pump maint", "PUMP MAINT", "Pump Maintenance", 
"waiting on wireline", 
"seating the ball", "Waiting on wireline")), row.names = 11:16, class = "data.frame")

我想按变量Pad.Name分组。我尝试使用quanteda中的corpus_group函数以及同一个包中的corpus函数，设置参数如下：docid_field = dat$Pad.Name和text_field = dat$Message。然而这些似乎都不起作用。
对于每个唯一的Pad.Name，我想要的输出是最常用的单词，比如说前10个最常用的单词，以及这些单词的计数。类似于下面的内容，但是很明显，真实的计数会得到：
edit：table选项在这里似乎从来都不起作用，所以这里有一个dput和我想要的输出的数据框

structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE", 
"LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3, 
2, 2, 2)), class = "data.frame", row.names = c(NA, -4L))

output <- data.frame(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE", "LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3,2,2,2))

来源：https://stackoverflow.com/questions/75240863/grouping-text-data-in-a-corpus-by-a-data-frame-variable

2条答案

按热度按时间

kkbh8khc1#

dplyr * 和 * tidytext * 行吗？

library(tidytext)
library(dplyr)

as_tibble(data) %>% 
  # split to words
  unnest_tokens(word,Message) %>% 
  # filter out stopwords
  anti_join(get_stopwords()) %>% 
  # count by (Pad.Name, word) groups 
  count(Pad.Name, word, name = "Count", sort = T) %>%
  # output is sorted by Count, no grouping, keep top-4
  slice_head(n = 4) %>% 
  arrange(Pad.Name, desc(Count))
#> Joining, by = "word"
#> # A tibble: 4 × 3
#>   Pad.Name   word     Count
#>   <chr>      <chr>    <int>
#> 1 LEE        waiting      2
#> 2 LEE        wireline     2
#> 3 MISSOURI W pump         3
#> 4 MISSOURI W maint        2

输入：

data <- structure(list(Pad.Name = c(
  "MISSOURI W", "MISSOURI W", "MISSOURI W",
  "LEE", "LEE", "LEE"
), Message = c(
  "pump maint", "PUMP MAINT", "Pump Maintenance",
  "waiting on wireline",
  "seating the ball", "Waiting on wireline"
)), row.names = 11:16, class = "data.frame")

创建于2023年1月26日，使用reprex v2.0.2

赞(0）回复(0）举报 2023-01-28

mrphzbgm2#

您可以按 Pad.Namesplit，strsplit字符串并使用table计算单词数。

. <- split(dat, dat$Pad.Name)
. <- lapply(., \(s) data.frame(row.names = NULL, s["Pad.Name"],
  setNames(stack(table(unlist(strsplit(tolower(s$Message), " "))))[2:1],
           c("Word", "Count") )))
. <- do.call(rbind, unname(.))
head(.[order(.$Count, .$Word, decreasing = TRUE),], 10)
#    Pad.Name        Word Count
#9 MISSOURI W        pump     3
#7 MISSOURI W       maint     2
#6        LEE    wireline     2
#5        LEE     waiting     2
#2        LEE          on     2
#8 MISSOURI W maintenance     1
#4        LEE         the     1
#3        LEE     seating     1
#1        LEE        ball     1

数据

dat <- structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "MISSOURI W", 
"LEE", "LEE", "LEE"), Message = c("pump maint", "PUMP MAINT", "Pump Maintenance", 
"waiting on wireline", 
"seating the ball", "Waiting on wireline")), row.names = 11:16, class = "data.frame")

赞(0）回复(0）举报 2023-01-28

我来回答

通过 Dataframe 变量对语料库中的文本数据进行分组

2条答案

相关问题

热门标签

最新问答