hive ngram udf使用什么分隔符来标记？

jbose2ul 于 2021-06-26 发布在 Hive

关注(0)|答案(1)|浏览(554)

我正在进行情绪分析。
我需要计算课文中的词汇（不同的单词）。
ngramudf似乎在确定unigram方面做得很好。我想知道它使用什么分隔符来确定unigrams/令牌。如果我想用splitudf来模拟词汇表计数，这一点很重要。例如，给定以下文本（产品评论）
当我看到1盎司到底有多少钱时，我震惊得要命。在7.60美元，我错误地认为这将是一个体面的大小可以。在当地，我可以花3美元左右买一管中等大小的芥末酱，但从来没有用得太快，所以会变老。我想粉末会更好，所以我可以根据需要混合。当我打开盒子，翻开 Package ，看到这个小jar，我开始寻找隐藏的摄像头。。。我以为这是个玩笑。不。。而且也不能退货。所以我吸取了教训。如果你决定要这种昂贵的芥末粉，请注意。
ngram udg有82个unigrams/令牌

SELECT count(*) FROM 
(SELECT explode(ngrams(sentences(upper(reviewtext)),1,9999999))  
FROM  amazon.Food_review_part_small WHERE asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR') t;
82

但是，使用带有空格、逗号、句点、连字符和双引号的拆分自定义项作为分隔符，有85个unigrams/标记

select  count(distinct(te)) FROM amazon.Food_review_part_small 
lateral view explode(split(upper(reviewtext), '[\\s,.-]|\"')) t as te
WHERE te <> '' AND asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR';
85

当然，我几乎找不到任何文档。有人知道ngramudf使用什么分隔符来确定unigram令牌吗？

Hive split nlp sentiment-analysis n-gram

来源：https://stackoverflow.com/questions/49021741/what-separators-does-the-hive-ngram-udf-use-to-tokenize

1条答案

按热度按时间

ntjbwcob1#

udafngram不分割数据，实际上它已经期望一个字符串数组或一个字符串数组作为输入。在本例中，udf语句从java注解中分割数据：

+ "Unnecessary punctuation, such as periods and commas in English, is automatically stripped."
+ " If specified, 'lang' should be a two-letter ISO-639 language code (such as 'en'), and "
+ "'country' should be a two-letter ISO-3166 code (such as 'us'). Not all country and "
+ "language codes are fully supported, and if an unsupported code is specified, a default "

如果运行以下查询

select sentences("I was aboslutely shocked to see how much 1 oz really was. At $7.60, I mistakenly assumed it would be a decent sized can. As locally I am able to buy a medium sized tube of wasabi paste for around $3, but never used it fast enough so it would get old. I figured a powder would be better, so I can mix it as I needed it. When I opened the box and dug thru the packing and saw this little little can, I started looking for the hidden cameras ... thought this HAD to be a joke. Nope .. and it's NOT returnable either. SO I HAVE LEARNED MY LESSON. Please just be aware if you should decide you want this EXPENSIVE wasabi powder.");

您将得到以下结果

[["I","was","aboslutely","shocked","to","see","how","much","1","oz","really","was"],["At","I","mistakenly","assumed","it","would","be","a","decent","sized","can"],["As","locally","I","am","able","to","buy","a","medium","sized","tube","of","wasabi","paste","for","around","but","never","used","it","fast","enough","so","it","would","get","old"],["I","figured","a","powder","would","be","better","so","I","can","mix","it","as","I","needed","it"],["When","I","opened","the","box","and","dug","thru","the","packing","and","saw","this","little","little","can","I","started","looking","for","the","hidden","cameras","thought","this","HAD","to","be","a","joke"],["Nope","and","it's","NOT","returnable","either"],["SO","I","HAVE","LEARNED","MY","LESSON"],["Please","just","be","aware","if","you","should","decide","you","want","this","EXPENSIVE","wasabi","powder"]]

如您所见，udf语句正在删除一些“杂音”，如“$7.60”，“$3”也是空字符串。

赞(0）回复(0）举报 2021-06-26

我来回答

hive ngram udf使用什么分隔符来标记？

1条答案

相关问题

热门标签

最新问答