google大查询排序与客户点击问题

tvmytwxo  于 2021-07-29  发布在  Java
关注(0)|答案(2)|浏览(349)

我使用谷歌大查询跟踪一些客户点击到我的网站。我遵循一个简单的规则集:
我只想看到第一个击中来源,如果客户已经通过同一来源多次连续同一天。
我只想看到第一个击中来源,如果客户已经通过同一来源多次连续在不同的日子。
我想看到所有的命中源,如果他们出现在同一天,但不是连续的。
目前,我正在使用以下工具:

rank() over (partition by customer_code, hit_source order by hit_timedate) rnk

如果我过滤“where rnk=1”,这允许我完成前2个步骤。这只会给我不同的命中源,无论他们是否在同一天,因为我有一个命中时间内的时间。但是它没有给我第三步,因为排名是由命中源划分的,当它看到同一个源时会改变。
如果有人能帮我这个忙,我将不胜感激。
编辑:
不确定如何添加/上载示例数据集,因此我尝试在此处执行此操作:

Customer_Code       Hit_Source               Hit_Timedate
     101             Facebook             25/05/2020 10:30am
     101             Facebook             25/05/2020 11:45am
     101             Facebook             25/05/2020 11:55am
     101             Twitter              25/05/2020 12:30am
     101             Facebook             25/05/2020 13:00pm 
     101             Google               25/05/2020 15:00pm
     101             Instagram            26/05/2020 09:00am

理想的结果集应该是这样的:

Customer_Code       Hit_Source               Hit_Timedate        Rank
     101             Facebook             25/05/2020 10:30am       1
     101             Facebook             25/05/2020 11:45am       2
     101             Facebook             25/05/2020 11:55am       3
     101             Twitter              25/05/2020 12:30am       1
     101             Facebook             25/05/2020 13:00pm       1
     101             Google               25/05/2020 15:00pm       1
     101             Instagram            26/05/2020 09:00am       1

所以,根据我的规则,我想实现上面的。。我在这里遇到的主要问题是能否将示例中的第5行排序为“1”。我之所以这么做,是因为上两次“facebook”的点击不是连续的。但能做到这一点与前两个规则,我已经实施了我挣扎。

ebdffaop

ebdffaop1#

你可以用 lag() 以及累计计数:

select t.*,
       1 + countif(prev_source = source) over (partition by customer_code, datetime_trunc(hit_timedate, day) order by hit_timedate) as ranking
from (select t.*,
             lag(source) over (partition by customer_code, datetime_trunc(hit_timedate, day)
                               order by hit_timedate
                              ) as prev_source
     from t
    ) t;

其思想是创建一个标志,表明前一个源是否与当前行相同——如果不相同,则添加1。这个 1 + 因为计数将从 0 你想让计数从 1 .

fjnneemd

fjnneemd2#

为了进一步为社区做出贡献,我将使用lag()、sum()、case-when和用于bigquery的min()内置函数来共享不同的方法。
以下代码使用您提供的示例数据分为两个步骤(在中注解):

with data as (
SELECT 101 as Customer_Code,"Facebook" as Hit_Source ,DATETIME(2020,05,25,10,30,00) as Hit_Timedate UNION ALL
SELECT 101 as Customer_Code,"Facebook" as Hit_Source ,DATETIME(2020,05,25,11,45,0) as Hit_Timedate UNION ALL
SELECT 101 as Customer_Code,"Facebook" as Hit_Source ,DATETIME(2020,05,25,11,55,0) as Hit_Timedate UNION ALL
SELECT 101 as Customer_Code,"Twitter" as Hit_Source ,DATETIME(2020,05,25,12,30,0) as Hit_Timedate UNION ALL
SELECT 101 as Customer_Code,"Facebook" as Hit_Source ,DATETIME(2020,05,25,13,00,0) as Hit_Timedate UNION ALL
SELECT 101 as Customer_Code,"Google" as Hit_Source ,DATETIME(2020,05,25,15,00,0) as Hit_Timedate UNION ALL
SELECT 101 as Customer_Code,"Instagram" as Hit_Source ,DATETIME(2020,05,25,09,00,0) as Hit_Timedate 
),

# step 1

data1 as (
SELECT Customer_Code, Hit_Source, Hit_Timedate, LAG(Hit_Source,1) OVER (ORDER BY Hit_Timedate) as PrevHit from data
),

# step 2

data2 as (
SELECT Customer_Code, Hit_Source,PrevHit , Hit_Timedate, SUM(CASE WHEN Hit_Source = PrevHit THEN 0 ELSE 1 END) OVER (ORDER BY Hit_Timedate,Hit_Source) AS rank_aux
FROM data1
)

SELECT Customer_Code, Hit_Source, MIN(Hit_Timedate) AS first_Hit_Timedate, RANK() OVER (PARTITION BY rank_aux order by Hit_Timedate) as rank FROM data2 
GROUP BY Customer_Code, Hit_Source, rank_aux,Hit_Timedate
ORDER BY Customer_Code,first_Hit_Timedate, Hit_Source

以及最终输出,

Row Customer_Code   Hit_Source  first_Hit_Timedate  rank
1   101             Instagram   2020-05-25T09:00:00 1
2   101             Facebook    2020-05-25T10:30:00 1
3   101             Facebook    2020-05-25T11:45:00 2
4   101             Facebook    2020-05-25T11:55:00 3
5   101             Twitter     2020-05-25T12:30:00 1
6   101             Facebook    2020-05-25T13:00:00 1
7   101             Google      2020-05-25T15:00:00 1

请注意,在步骤1中,一个新列的前一个值为 Hit_Source 已创建。然后在步骤2中,一个新的列 rank_aux 是为了正确聚合结果而创建的。以下是步骤2的输出(仅用于解释目的):

Row Customer_Code   Hit_Source  PrevHit     Hit_Timedate        rank_aux
1   101             Instagram   null        2020-05-25T09:00:00 1
2   101             Facebook    Instagram   2020-05-25T10:30:00 2
3   101             Facebook    Facebook    2020-05-25T11:45:00 2
4   101             Facebook    Facebook    2020-05-25T11:55:00 2
5   101             Twitter     Facebook    2020-05-25T12:30:00 3
6   101             Facebook    Twitter     2020-05-25T13:00:00 4
7   101             Google      Facebook    2020-05-25T15:00:00 5

注意第二、三、四排 rank_aux=2 ,这是所需的输出,因此可以将这些列聚合为1并仅显示最小值 Hit_Timedate 以获取已共享的最终输出。

相关问题