使用hive计算文本变量的单词频率

wtzytmuj  于 2021-05-27  发布在  Hadoop
关注(0)|答案(2)|浏览(400)

我有一个变量,每一行都是一个句子。例子:

-Row1 "Hey, how are you?
 -Rwo2 "Hey, Who is there?

我希望输出是按单词分组的计数。
例子:

Hey 2
How 1
are 1
...

我正在使用split函数,但是我有点卡住了。有什么想法吗?
谢谢!

41ik7eoe

41ik7eoe1#

Hive无法独自完成这项任务。您可以将数据从配置单元读入一个Dataframe,并在那里用python进行处理。那么你的问题是如何计算Dataframe列中的词频。
计算Dataframe中单词的频率

fxnxkyjh

fxnxkyjh2#

这在 hive 里是可能的。按非字母字符拆分,使用侧视图+分解,然后计算字数:

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;

结果:

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

还有一种方法 sentences 函数,它返回标记化句子数组(单词数组):

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
   lateral view explode(s.sentence) w as word
group by w.word;

结果:

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

语句(string str,string lang,string locale)函数将一个自然语言文本字符串标记为单词和句子,其中每个句子在适当的句子边界处被打断,并作为单词数组返回。“lang”和“locale”是可选参数。例如,句子('你好!“你好吗?”)返回((“你好”,“那里”),(“你好”,“是”,“你”))

相关问题