R语言 计算各自列中的关键字数量

js81xvg6  于 2023-05-20  发布在  其他
关注(0)|答案(3)|浏览(138)

假设我有这样的数据:

df <-  read.table(text= "title     date    text
blablabla   22.07.2023  'blablablabla Blue blablabla'
blablabla   23.06.2023  'bala Blue blabla Blue Night Blue'
blablabla   23.08.2023  'bala Mountain blabla House Night Blue'", 
header = T, stringsAsFactor = F)

和一个向量words,我认为关键字:

words <- c("House", "Mountain", "Blue", "Night")

我想实现的是计数wordsdf$text中出现的次数,但在其自己的列中分别计数word的每种类型。到目前为止,我有这样的代码:

llibrary(tidyverse)
df %>%
  # extract instances of keywords:
  mutate(
    keyword = str_extract_all(text, 
                              str_c("(?i)\\b(", str_c(words, collapse = "|"), ")\\b")
  )) %>%
  # turn into alternation pattern:
  mutate(keyword = lapply(keyword, function(x) str_c(x, collapse = "|"))) %>%
  # create row ID:
  mutate(row = row_number()) %>%
  # separate into rows splitting by `|`:
  separate_rows(keyword, sep = '\\|') %>% 
  # cast each keyword in its own row:
  pivot_wider(names_from = keyword, values_from = keyword, 
              values_fn = function(x) 1, values_fill = 0
              ) %>%
  select(-row)
# A tibble: 3 × 7
  title     date       text                                   Blue Night Mountain House
  <chr>     <chr>      <chr>                                 <dbl> <dbl>    <dbl> <dbl>
1 blablabla 22.07.2023 blablablabla Blue blablabla               1     0        0     0
2 blablabla 23.06.2023 bala Blue blabla Blue Night Blue          1     1        0     0
3 blablabla 23.08.2023 bala Mountain blabla House Night Blue     1     1        1     1

这不是我想要的,因为function(x) 1部分并不求和,而只是记录word是否存在。如何更改此设置以获得此输出:

# A tibble: 3 × 7
  title     date       text                                   Blue Night Mountain House
  <chr>     <chr>      <chr>                                 <dbl> <dbl>    <dbl> <dbl>
1 blablabla 22.07.2023 blablablabla Blue blablabla               1     0        0     0
2 blablabla 23.06.2023 bala Blue blabla Blue Night Blue          3     1        0     0
3 blablabla 23.08.2023 bala Mountain blabla House Night Blue     1     1        1     1
hyrbngr7

hyrbngr71#

另一种方法,在空间上 * 分割 *,设置 * 因子 * 水平,获取频率 * 表 *:

cbind(df[ "text" ],
      t(sapply(strsplit(df$text, " ", fixed = TRUE), 
               function(i) table(factor(i, levels = words))))
      )
#                                    text House Mountain Blue Night
# 1           blablablabla Blue blablabla     0        0    1     0
# 2      bala Blue blabla Blue Night Blue     0        0    3     1
# 3 bala Mountain blabla House Night Blue     1        1    1     1

或者使用 length 来修复您的解决方案:

#...
pivot_wider(names_from = keyword, values_from = keyword, 
            values_fn = length, values_fill = 0)
#...
e0bqpujr

e0bqpujr2#

另一种选择是data.tablestringr

library(data.table)
library(stringr)
for(word in words){
  set(setDT(df), j=word, value = str_count(df$text, word))
}

输出

title       date                                  text House Mountain  Blue Night
      <char>     <char>                                <char> <int>    <int> <int> <int>
1: blablabla 22.07.2023           blablablabla Blue blablabla     0        0     1     0
2: blablabla 23.06.2023      bala Blue blabla Blue Night Blue     0        0     3     1
3: blablabla 23.08.2023 bala Mountain blabla House Night Blue     1        1     1     1
yyhrrdl8

yyhrrdl83#

你可以试试

df %>%
    mutate(tokens = strsplit(text, " ")) %>%
    unnest(tokens) %>%
    filter(tokens %in% words) %>%
    pivot_wider(
        names_from = tokens,
        values_from = tokens,
        values_fn = length,
        values_fill = 0
    ) %>%
    left_join(df)

它给出了

# A tibble: 3 × 7
  title     date       text                            Blue Night Mountain House
  <chr>     <chr>      <chr>                          <int> <int>    <int> <int>
1 blablabla 22.07.2023 blablablabla Blue blablabla        1     0        0     0
2 blablabla 23.06.2023 bala Blue blabla Blue Night B…     3     1        0     0
3 blablabla 23.08.2023 bala Mountain blabla House Ni…     1     1        1     1

相关问题