聚合配置单元sql中的clickstream数据(组中的第一个值)cte sql

ddarikpa  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(383)

我正在用apachehivesql分析来自不同网站的网站数据,我想找出客户旅程中涉及的域。clickstream数据具有以下结构:

id1 domain
--- ---
1    1
1    1
1    1
1    3
1    1
1    2
3    4
3    5
3    5
6    7
6    6
6    7

为了以正确的方式分析数据,我想将数据转换为以下结构:

id1 domain
--- ---
1    1
1    3
1    1
1    2
3    4
3    5
6    7
6    6
6    7

如何在配置单元sql中实现这一点?
我必须找出哪个接触点包含在哪个序列/顺序中,并消除域之间的冗余点击。最后通牒是,此孔查询将生成如下表:

1: 1>3>1>2
3: 4>5
6: 7>6>7

我不会说这是一个特定于Hive的问题!

rkue9o1l

rkue9o1l1#

解决办法是

SELECT * FROM (
SELECT id
       ,domain
       ,LAG(domain) OVER (PARTITION BY id ORDER BY date) AS lag_domain
FROM t_domain
) t1
WHERE lag_domain NOT LIKE domain
f4t66c6m

f4t66c6m2#

在sql中,我们可以在hive中使用cte,我想我们必须将varchar(100)改为string,我怀疑corelated子查询在hive中是否有效。

;WITH list (id1 , domain, [level], list)
     AS (
    SELECT id1, domain, 1, CAST(domain AS varchar(100)) AS list
    FROM Hivetable
    WHERE domain IS NOT NULL

    UNION ALL

    SELECT list.id1, conn.domain, list.[level]+1,
          CAST(list+' <- '+CAST(conn.domain AS varchar(100)) AS varchar(100)) AS list
    FROM list
    INNER JOIN Hivetable AS conn ON list.domain=conn.id1
    WHERE conn.domain IS NOT NULL)

SELECT id1, list AS blockedByChain
FROM list
WHERE [level]=(SELECT MAX(sub.[level]) FROM list AS sub WHERE sub.id1=list.id1)
ORDER BY id1, [level], domain;

相关问题