使用hive计算文本变量的单词频率

wtzytmuj 于 2021-05-27 发布在 Hadoop

关注(0)|答案(2)|浏览(400)

我有一个变量，每一行都是一个句子。例子：

-Row1 "Hey, how are you?
 -Rwo2 "Hey, Who is there?

我希望输出是按单词分组的计数。
例子：

Hey 2
How 1
are 1
...

我正在使用split函数，但是我有点卡住了。有什么想法吗？
谢谢！

hadoop Hive hiveql Counter text

来源：https://stackoverflow.com/questions/59855489/count-frequency-of-words-of-a-text-variable-with-hive

2条答案

按热度按时间

41ik7eoe1#

Hive无法独自完成这项任务。您可以将数据从配置单元读入一个Dataframe，并在那里用python进行处理。那么你的问题是如何计算Dataframe列中的词频。
计算Dataframe中单词的频率

赞(0）回复(0）举报 2021-05-27

fxnxkyjh2#

这在 hive 里是可能的。按非字母字符拆分，使用侧视图+分解，然后计算字数：

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;

结果：

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

还有一种方法 sentences 函数，它返回标记化句子数组（单词数组）：

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
   lateral view explode(s.sentence) w as word
group by w.word;

结果：

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

语句（string str，string lang，string locale）函数将一个自然语言文本字符串标记为单词和句子，其中每个句子在适当的句子边界处被打断，并作为单词数组返回。“lang”和“locale”是可选参数。例如，句子（'你好！“你好吗？”）返回（（“你好”，“那里”），（“你好”，“是”，“你”））

赞(0）回复(0）举报 2021-05-27

我来回答

使用hive计算文本变量的单词频率

2条答案

相关问题

热门标签

最新问答