假设我有这样的数据:
df <- read.table(text= "title date text
blablabla 22.07.2023 'blablablabla Blue blablabla'
blablabla 23.06.2023 'bala Blue blabla Blue Night Blue'
blablabla 23.08.2023 'bala Mountain blabla House Night Blue'",
header = T, stringsAsFactor = F)
和一个向量words
,我认为关键字:
words <- c("House", "Mountain", "Blue", "Night")
我想实现的是计数words
在df$text
中出现的次数,但在其自己的列中分别计数word
的每种类型。到目前为止,我有这样的代码:
llibrary(tidyverse)
df %>%
# extract instances of keywords:
mutate(
keyword = str_extract_all(text,
str_c("(?i)\\b(", str_c(words, collapse = "|"), ")\\b")
)) %>%
# turn into alternation pattern:
mutate(keyword = lapply(keyword, function(x) str_c(x, collapse = "|"))) %>%
# create row ID:
mutate(row = row_number()) %>%
# separate into rows splitting by `|`:
separate_rows(keyword, sep = '\\|') %>%
# cast each keyword in its own row:
pivot_wider(names_from = keyword, values_from = keyword,
values_fn = function(x) 1, values_fill = 0
) %>%
select(-row)
# A tibble: 3 × 7
title date text Blue Night Mountain House
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 blablabla 22.07.2023 blablablabla Blue blablabla 1 0 0 0
2 blablabla 23.06.2023 bala Blue blabla Blue Night Blue 1 1 0 0
3 blablabla 23.08.2023 bala Mountain blabla House Night Blue 1 1 1 1
这不是我想要的,因为function(x) 1
部分并不求和,而只是记录word
是否存在。如何更改此设置以获得此输出:
# A tibble: 3 × 7
title date text Blue Night Mountain House
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 blablabla 22.07.2023 blablablabla Blue blablabla 1 0 0 0
2 blablabla 23.06.2023 bala Blue blabla Blue Night Blue 3 1 0 0
3 blablabla 23.08.2023 bala Mountain blabla House Night Blue 1 1 1 1
3条答案
按热度按时间hyrbngr71#
另一种方法,在空间上 * 分割 *,设置 * 因子 * 水平,获取频率 * 表 *:
或者使用 length 来修复您的解决方案:
e0bqpujr2#
另一种选择是
data.table
和stringr
。输出
yyhrrdl83#
你可以试试
它给出了