文本域词频统计的clickhouse方法

6jjcrrmo  于 2021-07-15  发布在  ClickHouse
关注(0)|答案(2)|浏览(922)

我有一个clickhouse表,其中一个字段包含文本描述(约300个单词)。
例如评论:

  1. Rev_id Place_id Stars Category Text
  2. 1 12 3 Food Nice food but a bad dirty place.
  3. 2 31 4 Sport Not bad, they have everything.
  4. 3 55 1 Bar Poor place,bad audience.

我想做一些词数分析,比如一般词频统计(每个词出现的次数)或者每个类别的前k个词。
在示例中:

  1. word count
  2. bad 3
  3. place 2

... 有没有一种方法可以只在clickhouse中完成而不涉及编程语言?

hvvq6cgz

hvvq6cgz1#

  1. SELECT
  2. arrayJoin(splitByChar(' ', replaceRegexpAll(x, '[.,]', ' '))) AS w,
  3. count()
  4. FROM
  5. (
  6. SELECT 'Nice food but a bad dirty place.' AS x
  7. UNION ALL
  8. SELECT 'Not bad, they have everything.'
  9. UNION ALL
  10. SELECT 'Poor place,bad audience.'
  11. )
  12. GROUP BY w
  13. ORDER BY count() DESC
  14. ┌─w──────────┬─count()─┐
  15. 4
  16. bad 3
  17. place 2
  18. have 1
  19. Poor 1
  20. food 1
  21. Not 1
  22. they 1
  23. audience 1
  24. Nice 1
  25. but 1
  26. dirty 1
  27. a 1
  28. everything 1
  29. └────────────┴─────────┘
  30. SELECT CATEGORY, ....
  31. GROUP BY CATEGORY, w
展开查看全部
v09wglhw

v09wglhw2#

如果它适用于你的情况,我会考虑使用alphatokens作为一个更有效的。

  1. SELECT
  2. category,
  3. arrayJoin(arrayFilter(x -> NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x), alphaTokens(text))) token,
  4. count() count
  5. FROM
  6. (
  7. /* test data */
  8. SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
  9. FROM
  10. (
  11. SELECT arrayJoin([
  12. (1, 12, 3, 'Food', 'Nice food but a bad dirty place.'),
  13. (4, 12, 3, 'Food', ' the the the the good food ..'),
  14. (2, 31, 4, 'Sport', 'Not bad,,, they have everything.'),
  15. (3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
  16. )
  17. )
  18. GROUP BY category, token
  19. ORDER BY count DESC
  20. LIMIT 5;
  21. /*
  22. ┌─category─┬─token────┬─count─┐
  23. │ Food │ food │ 2 │
  24. │ Food │ bad │ 1 │
  25. │ Bar │ audience │ 1 │
  26. │ Food │ Nice │ 1 │
  27. │ Bar │ Poor │ 1 │
  28. └──────────┴──────────┴───────┘
  29. * /

使用topk的示例:

  1. SELECT
  2. category,
  3. arrayReduce('topK(3)',
  4. arrayFilter(x -> (NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x)), groupArrayArray(alphaTokens(text)))) AS result
  5. FROM
  6. (
  7. /* test data */
  8. SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
  9. FROM
  10. (
  11. SELECT arrayJoin([
  12. (1, 12, 3, 'Food', 'Nice food but a bad dirty place.'),
  13. (4, 12, 3, 'Food', ' the the the the good food ..'),
  14. (2, 31, 4, 'Sport', 'Not bad,,, they have everything.'),
  15. (3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
  16. )
  17. )
  18. GROUP BY category;
  19. /* result
  20. ┌─category─┬─result─────────────────┐
  21. │ Bar │ ['Poor','place','bad'] │
  22. │ Food │ ['food','Nice','bad'] │
  23. │ Sport │ ['Not','bad','they'] │
  24. └──────────┴────────────────────────┘
  25. * /

ps:在处理之前降低所有字符串/标记可能是有意义的

展开查看全部

相关问题