with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)
select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;
结果:
word cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1
还有一种方法 sentences 函数,它返回标记化句子数组(单词数组):
with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)
select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
lateral view explode(s.sentence) w as word
group by w.word;
结果:
word cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1
2条答案
按热度按时间41ik7eoe1#
Hive无法独自完成这项任务。您可以将数据从配置单元读入一个Dataframe,并在那里用python进行处理。那么你的问题是如何计算Dataframe列中的词频。
计算Dataframe中单词的频率
fxnxkyjh2#
这在 hive 里是可能的。按非字母字符拆分,使用侧视图+分解,然后计算字数:
结果:
还有一种方法
sentences
函数,它返回标记化句子数组(单词数组):结果:
语句(string str,string lang,string locale)函数将一个自然语言文本字符串标记为单词和句子,其中每个句子在适当的句子边界处被打断,并作为单词数组返回。“lang”和“locale”是可选参数。例如,句子('你好!“你好吗?”)返回((“你好”,“那里”),(“你好”,“是”,“你”))