sql查询“折叠”相邻的行

c86crjj0  于 2021-07-15  发布在  ClickHouse
关注(0)|答案(4)|浏览(278)

我们有一个应用程序可以捕获用户所做的搜索。由于我们搜索的性质(我们在几个字符后提供结果)和人们键入的速度,我们得到了每个搜索/信件的日志条目。这看起来像这样:

(看起来像是圣诞树上下的一面…)
我们在内部需要这些数据来计算搜索次数(又称api调用),但对于向客户报告的情况,报告“一半”查询并不是很好。
我正在寻找一种方法,可以将这些行折叠成一个具有最长/最后一个搜索项的行。
有一个陷阱:一个用户(cid)可以在一个会话中进行1次以上的搜索,但是如果我们查看时间戳,我们可以将其分开。。
必须是这样的:
1) 将间隔不超过2秒的行分组
2) 按长度排序(或最后一个)查询以获取最终查询
3) 按术语分组,计算术语用于报告的频率
数据作为文本:

2019-12-09  2019-12-09 12:58:45 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum cleaner
2019-12-09  2019-12-09 12:58:45 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum cleane
2019-12-09  2019-12-09 12:58:44 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum clean
2019-12-09  2019-12-09 12:58:43 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum clea
2019-12-09  2019-12-09 12:58:43 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum cle
2019-12-09  2019-12-09 12:58:42 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum cl
2019-12-09  2019-12-09 12:58:41 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum c
2019-12-09  2019-12-09 12:58:40 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum
2019-12-09  2019-12-09 12:58:39 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuu
2019-12-09  2019-12-09 12:58:38 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacu
2019-12-09  2019-12-09 12:58:37 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vac
2019-12-09  2019-12-09 12:58:15 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue widget
2019-12-09  2019-12-09 12:58:14 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue widge
2019-12-09  2019-12-09 12:58:13 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue widg
2019-12-09  2019-12-09 12:58:12 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue wid
2019-12-09  2019-12-09 12:58:12 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue wi
2019-12-09  2019-12-09 12:58:11 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue w
2019-12-09  2019-12-09 12:58:10 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue
2019-12-09  2019-12-09 12:58:09 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blu
2019-12-09  2019-12-09 12:57:38 5dea585477c94502b52c43fb    f96305d9-590b-4a10-95a2-2d49a9fc63a3    search  1   search  query   widget
2019-12-09  2019-12-09 12:57:37 5dea585477c94502b52c43fb    f96305d9-590b-4a10-95a2-2d49a9fc63a3    search  1   search  query   widge
2019-12-09  2019-12-09 12:57:36 5dea585477c94502b52c43fb    f96305d9-590b-4a10-95a2-2d49a9fc63a3    search  1   search  query   widg
2019-12-09  2019-12-09 12:57:35 5dea585477c94502b52c43fb    f96305d9-590b-4a10-95a2-2d49a9fc63a3    search  1   search  query   wid

预期结果:

vacuum cleaner  1
blue widget     1
widget          1
6qftjkof

6qftjkof1#

一个看起来像倒着的圣诞树的日志文件;假设最后一个条目在最上面可能是安全的,因此,如果一个人能抓住每个组的第一条记录,那么你的问题就解决了,包括这个人拼错了什么然后改正的情况。这使得日期和时间与找到解决方案几乎无关。
假设您的字段“cid”表示一个搜索,这里有一个基于您的数据的解决方案,它精确地生成您的预期结果。输入数据应位于一个名为“inputable”的表中,创建为:

CREATE TABLE `inputTable` (
  `date` datetime DEFAULT NULL,
  `ts` datetime DEFAULT NULL,
  `tid` varchar(255) DEFAULT NULL,
  `cid` varchar(255) DEFAULT NULL,
  `xid` varchar(255) DEFAULT NULL,
  `xvar` int(11) DEFAULT NULL,
  `ec` varchar(255) DEFAULT NULL,
  `ea` varchar(255) DEFAULT NULL,
  `ev` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

在不同的场景中运行下面的代码,以找到那些可能不起作用并且需要微调的特殊情况。

DROP PROCEDURE IF EXISTS SP_cursor1;
DELIMITER $$
CREATE PROCEDURE `SP_cursor1`()
BEGIN
    DECLARE done INT DEFAULT FALSE;
    DECLARE tempCid VARCHAR(255);

    DECLARE _date DATETIME;
    DECLARE _ts DATETIME;
    DECLARE _tid VARCHAR(255);
    DECLARE _cid VARCHAR(255);
    DECLARE _xid VARCHAR(255);
    DECLARE _xvar INT;
    DECLARE _ec VARCHAR(255);
    DECLARE _ea VARCHAR(255);
    DECLARE _ev VARCHAR(255);

    DECLARE cursor1 CURSOR for 
    SELECT * FROM inputTable;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    OPEN cursor1;
    SET TempCid = "";
    DROP TABLE IF EXISTS tempTbl;
    CREATE TABLE tempTbl AS 
        SELECT * FROM inputTable WHERE 1 = 0;
    read_loop: LOOP
        fetch cursor1 into _date, _ts, _tid, _cid, _xid, _xvar, _ec, _ea, _ev;
        IF done THEN
            LEAVE read_loop;
        END IF;    
        IF tempCid <> _cid then
        INSERT INTO tempTbl
            SELECT _date, _ts, _tid, _cid, _xid, _xvar, _ec, _ea, _ev;
            SET tempCid = _cid;
        END IF; 
    END LOOP;
    CLOSE cursor1;
    SELECT ev FROM tempTbl;
    DROP TABLE tempTbl;
END$$

call SP_cursor1();

希望这能帮到你。

9nvpjoqh

9nvpjoqh2#

不确定基于时间的方法是否最适合选择哪个值(例如2秒),用户在键入搜索词时可能会暂停更长时间。建议只查找本身不是同一会话中稍后键入的较长搜索词的子字符串的字符串。这种简单的方法可以通过标准sql实现:

SELECT ev,
       COUNT(ev) AS ev_count
FROM tbl t1
WHERE
  (SELECT COUNT(*)
   FROM tbl t2
   WHERE t2.cid = t1.cid
     AND t2.date <= t1.date
     AND t1.ev <> t2.ev
     AND t2.ev LIKE CONCAT(t1.ev, '%')) = 0
GROUP BY ev;

例如,如果用户在删除“r”并正确键入“vacuum cleaner”之前错误地键入了“vacuum cleaner”,则查询将返回“vacuum cleaner”和“vacuum cleaner”。
请看这里的演示(用mysql编写,但clickhouse应该类似)。

ddarikpa

ddarikpa3#

我假设用户不仅可以添加新字符,还可以删除字符,因此“xmas tree”规则不适用于每个最终查询。
此查询返回会话中最新的搜索输入(最终查询),可能不是会话中最长的搜索输入。

SELECT search_input, count()
FROM (
  SELECT 
    /* create group of pairs (input_seconds, input_text). */
    groupArray((toInt32(ts), ev)) inputs,
    /* define the end of each session. */
    arrayMap((x, index) -> index = 1 ? 1 : (inputs[index - 1].1 - x.1 > 2 /* 2 is max delay between inputs */ ? index : 0), inputs, arrayEnumerate(inputs)) session_end_points,
    /* take the latest input in each session. */
    arrayMap(x -> inputs[x].2, arrayFilter(x -> x > 0, session_end_points)) search_inputs,
    arrayJoin(search_inputs) search_input
  FROM (
    /* test data, sorted by DESCending */
    SELECT toDateTime(data.1) ts, data.2 cid, data.3 ev
    FROM (
      SELECT arrayJoin([
      ('2019-12-09 12:58:55', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'X'),    

      ('2019-12-09 12:58:45', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cleaner'),
      ('2019-12-09 12:58:45', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cleane'),
      ('2019-12-09 12:58:44', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clean'),
      ('2019-12-09 12:58:43', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clea'),
      ('2019-12-09 12:58:43', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cle'),
      ('2019-12-09 12:58:42', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cl'),
      ('2019-12-09 12:58:41', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum c'),
      ('2019-12-09 12:58:40', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum'),
      ('2019-12-09 12:58:39', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuu'),
      ('2019-12-09 12:58:38', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacu'),
      ('2019-12-09 12:58:37', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vac'),

      ('2019-12-09 12:58:15', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue widget'),
      ('2019-12-09 12:58:14', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue widge'),
      ('2019-12-09 12:58:13', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue widg'),
      ('2019-12-09 12:58:12', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue wid'),
      ('2019-12-09 12:58:12', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue wi'),
      ('2019-12-09 12:58:11', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue w'),
      ('2019-12-09 12:58:10', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue'),
      ('2019-12-09 12:58:09', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blu'),

      ('2019-12-09 12:57:38', 'f96305d9-590b-4a10-95a2-2d49a9fc63a3', 'widget'),
      ('2019-12-09 12:57:37', 'f96305d9-590b-4a10-95a2-2d49a9fc63a3', 'widge'),
      ('2019-12-09 12:57:36', 'f96305d9-590b-4a10-95a2-2d49a9fc63a3', 'widg'),
      ('2019-12-09 12:57:35', 'f96305d9-590b-4a10-95a2-2d49a9fc63a3', 'wid'),

      ('2019-12-09 12:58:34', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vaX'),  

      ('2019-12-09 12:58:30', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clean'),
      ('2019-12-09 12:58:28', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cleane'),
      ('2019-12-09 12:58:26', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clean'),
      ('2019-12-09 12:58:24', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clea'),
      ('2019-12-09 12:58:22', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cle'),

      ('2019-12-09 12:58:15', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vac'),
      ('2019-12-09 12:58:14', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacu'),
      ('2019-12-09 12:58:13', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vac'),

      ('2019-12-09 12:57:34', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vaX')]) data)
    ORDER BY ts DESC)
  GROUP BY cid)
GROUP BY search_input;

/* Result:
┌─search_input───┬─count()─┐
│ widget         │       1 │
│ vacuum cleaner │       1 │
│ blue widget    │       1 │
│ vac            │       1 │
│ vaX            │       2 │
│ X              │       1 │
│ vacuum clean   │       1 │
└────────────────┴─────────┘

* /
gz5pxeao

gz5pxeao4#

这个问题看起来很老。但不管怎样,我还是要和大家分享我的想法,这会帮助别人。
查看日志文件我看到的是它缺少一个强大的组键来选择每个客户的最新搜索。我从日志中注意到,每个客户的每次搜索都在1分钟内完成。考虑到这一事实,我要做的是在每个日志条目的timestamp列上创建一个新的dataset列(比如timestamp\ux),并将其格式化为“yyyy-mm-dd hh:mm”(因此秒被截断或放入00)。所以我有一个强大的分组键集(timestamp\ux和cid)。
之后,我将使用timestamp\ux和cid对数据集进行分组,并使用这些组键查询数据集,并在timestamp-through和order-by上取每个组的最新值。
在t-sql中,这可以通过with语句(公共表表达式)实现。不知道clickhouse里有什么相似之处。但是,我确信上述逻辑可以通过任何sql语言实现。
希望这有帮助!

相关问题