文本域词频统计的clickhouse方法

6jjcrrmo 于 2021-07-15 发布在 ClickHouse

关注(0)|答案(2)|浏览(925)

我有一个clickhouse表，其中一个字段包含文本描述（约300个单词）。
例如评论：

Rev_id    Place_id    Stars    Category    Text
1         12           3        Food       Nice food but a bad dirty place.
2         31           4        Sport      Not bad, they have everything.
3         55           1        Bar        Poor place,bad audience.

我想做一些词数分析，比如一般词频统计（每个词出现的次数）或者每个类别的前k个词。
在示例中：

word    count
bad     3
place   2

... 有没有一种方法可以只在clickhouse中完成而不涉及编程语言？

clickhouse word-count

来源：https://stackoverflow.com/questions/60597863/clickhouse-approach-for-word-frequency-count-on-textual-field

2条答案

按热度按时间

hvvq6cgz1#

SELECT
    arrayJoin(splitByChar(' ', replaceRegexpAll(x, '[.,]', ' '))) AS w,
    count()
FROM
(
    SELECT 'Nice food but a bad dirty place.' AS x
    UNION ALL
    SELECT 'Not bad, they have everything.'
    UNION ALL
    SELECT 'Poor place,bad audience.'
)
GROUP BY w
ORDER BY count() DESC

┌─w──────────┬─count()─┐
│            │       4 │
│ bad        │       3 │
│ place      │       2 │
│ have       │       1 │
│ Poor       │       1 │
│ food       │       1 │
│ Not        │       1 │
│ they       │       1 │
│ audience   │       1 │
│ Nice       │       1 │
│ but        │       1 │
│ dirty      │       1 │
│ a          │       1 │
│ everything │       1 │
└────────────┴─────────┘

SELECT CATEGORY, ....
GROUP BY CATEGORY, w

赞(0）回复(0）举报 2021-07-15

v09wglhw2#

如果它适用于你的情况，我会考虑使用alphatokens作为一个更有效的。

SELECT
    category,
    arrayJoin(arrayFilter(x -> NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x), alphaTokens(text))) token,
    count() count
FROM
(
    /* test data */
    SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
    FROM
    (
        SELECT arrayJoin([
          (1, 12, 3, 'Food', 'Nice      food but a bad dirty place.'), 
          (4, 12, 3, 'Food', ' the the the the good food   ..'), 
          (2, 31, 4, 'Sport', 'Not bad,,, they have everything.'), 
          (3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
    )
)
GROUP BY category, token
ORDER BY count DESC
LIMIT 5;
/*
┌─category─┬─token────┬─count─┐
│ Food     │ food     │     2 │
│ Food     │ bad      │     1 │
│ Bar      │ audience │     1 │
│ Food     │ Nice     │     1 │
│ Bar      │ Poor     │     1 │
└──────────┴──────────┴───────┘

* /

使用topk的示例：

SELECT
    category,
    arrayReduce('topK(3)', 
                arrayFilter(x -> (NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x)), groupArrayArray(alphaTokens(text)))) AS result
FROM
(
    /* test data */
    SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
    FROM
    (
        SELECT arrayJoin([
          (1, 12, 3, 'Food', 'Nice      food but a bad dirty place.'), 
          (4, 12, 3, 'Food', ' the the the the good food   ..'), 
          (2, 31, 4, 'Sport', 'Not bad,,, they have everything.'), 
          (3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
    )
)
GROUP BY category;
/* result
┌─category─┬─result─────────────────┐
│ Bar      │ ['Poor','place','bad'] │
│ Food     │ ['food','Nice','bad']  │
│ Sport    │ ['Not','bad','they']   │
└──────────┴────────────────────────┘

* /

ps：在处理之前降低所有字符串/标记可能是有意义的

赞(0）回复(0）举报 2021-07-15

我来回答

文本域词频统计的clickhouse方法

2条答案

相关问题

热门标签

最新问答