hive ngram udf使用什么分隔符来标记?

jbose2ul  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(512)

我正在进行情绪分析。
我需要计算课文中的词汇(不同的单词)。
ngramudf似乎在确定unigram方面做得很好。我想知道它使用什么分隔符来确定unigrams/令牌。如果我想用splitudf来模拟词汇表计数,这一点很重要。例如,给定以下文本(产品评论)
当我看到1盎司到底有多少钱时,我震惊得要命。在7.60美元,我错误地认为这将是一个体面的大小可以。在当地,我可以花3美元左右买一管中等大小的芥末酱,但从来没有用得太快,所以会变老。我想粉末会更好,所以我可以根据需要混合。当我打开盒子,翻开 Package ,看到这个小jar,我开始寻找隐藏的摄像头。。。我以为这是个玩笑。不。。而且也不能退货。所以我吸取了教训。如果你决定要这种昂贵的芥末粉,请注意。
ngram udg有82个unigrams/令牌

SELECT count(*) FROM 
(SELECT explode(ngrams(sentences(upper(reviewtext)),1,9999999))  
FROM  amazon.Food_review_part_small WHERE asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR') t;
82

但是,使用带有空格、逗号、句点、连字符和双引号的拆分自定义项作为分隔符,有85个unigrams/标记

select  count(distinct(te)) FROM amazon.Food_review_part_small 
lateral view explode(split(upper(reviewtext), '[\\s,.-]|\"')) t as te
WHERE te <> '' AND asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR';
85

当然,我几乎找不到任何文档。有人知道ngramudf使用什么分隔符来确定unigram令牌吗?

ntjbwcob

ntjbwcob1#

udafngram不分割数据,实际上它已经期望一个字符串数组或一个字符串数组作为输入。在本例中,udf语句从java注解中分割数据:

+ "Unnecessary punctuation, such as periods and commas in English, is automatically stripped."
+ " If specified, 'lang' should be a two-letter ISO-639 language code (such as 'en'), and "
+ "'country' should be a two-letter ISO-3166 code (such as 'us'). Not all country and "
+ "language codes are fully supported, and if an unsupported code is specified, a default "

如果运行以下查询

select sentences("I was aboslutely shocked to see how much 1 oz really was. At $7.60, I mistakenly assumed it would be a decent sized can. As locally I am able to buy a medium sized tube of wasabi paste for around $3, but never used it fast enough so it would get old. I figured a powder would be better, so I can mix it as I needed it. When I opened the box and dug thru the packing and saw this little little can, I started looking for the hidden cameras ... thought this HAD to be a joke. Nope .. and it's NOT returnable either. SO I HAVE LEARNED MY LESSON. Please just be aware if you should decide you want this EXPENSIVE wasabi powder.");

您将得到以下结果

[["I","was","aboslutely","shocked","to","see","how","much","1","oz","really","was"],["At","I","mistakenly","assumed","it","would","be","a","decent","sized","can"],["As","locally","I","am","able","to","buy","a","medium","sized","tube","of","wasabi","paste","for","around","but","never","used","it","fast","enough","so","it","would","get","old"],["I","figured","a","powder","would","be","better","so","I","can","mix","it","as","I","needed","it"],["When","I","opened","the","box","and","dug","thru","the","packing","and","saw","this","little","little","can","I","started","looking","for","the","hidden","cameras","thought","this","HAD","to","be","a","joke"],["Nope","and","it's","NOT","returnable","either"],["SO","I","HAVE","LEARNED","MY","LESSON"],["Please","just","be","aware","if","you","should","decide","you","want","this","EXPENSIVE","wasabi","powder"]]

如您所见,udf语句正在删除一些“杂音”,如“$7.60”,“$3”也是空字符串。

相关问题