jaccard相似度计算

cbwuti44  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(484)

我的数据如下:

CustomerId   Category
  100            2
  100            2
  100            3
  100            6
  100            4
  200            3
  200            6
  200            7
  300            2

所以我想要的结果是jaccard相似性指数:
jaccard(100200)=2(项目共享)/5(项目并集)
jaccard(100300)=1(项目共享)/4(项目并集)
jaccard(200300)=0(项目共享)/4(项目并集)。
我最初尝试的是寻找术语的并集和交集,但我不确定这是否是最有效的方法。另外,我想避免重复像jaccard(100300)和jaccard(300100)出现在一起。有人能帮忙吗?

select t1.customer_id, t2.customer_id,
              sum(case when t1.category_id = t2.category_id then 1 else 0 end) intersection,
sum(case when t1.category = t2.category then 1
         when t1.category <> t2.category then 1 else 0 end)
union
    from t t1 cross join
         t t2
  Where t1.customer_id <> t2.customer_id
    group by t1.customer_id, t2.customer_id

不幸的是,我也检查了,我有一个客户购买多个项目在同一类别。所以我编辑了这个表,以反映Customer100在类别2中有两个项目。但是,它不应更改jaccard相似性度量值。

kmbjn2e3

kmbjn2e31#

你不需要一个 cross join . 通过计算一对的不同类别id的总和并从中减去相交的类别id,得到分母。

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)
     / (count(DISTINCT t1.category_id)+count(DISTINCT t2.category_id)-sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)) AS jaccard_similarity
FROM t t1
JOIN t t2 ON t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id

如果不支持不等式 join ,使用

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)
       / (count(DISTINCT t1.category_id)+count(DISTINCT t2.category_id)-sum(CASE WHEN t1.category_id = t2.category_id THEN 1 ELSE 0 END)) AS jaccard_similarity
FROM t t1
CROSS JOIN t t2 
WHERE t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id

如果您只需要对的交叉点计数,下面的查询就足够了。

select t1.customer_id as id1, t2.customer_id as id2
,sum(case when t1.category_id = t2.category_id then 1 else 0 end) as intersection
from t t1 
join t t2 on t1.customer_id<t2.customer_id
group by t1.customer_id, t2.customer_id

编辑:根据op的评论,一个客户可以多次拥有同一类别,但只应计算一次。

SELECT t1.customer_id AS id1,
       t2.customer_id AS id2,
       1.0*COUNT(DISTINCT CASE WHEN t1.category_id = t2.category_id THEN t1.category_id END)
       / (COUNT(DISTINCT t1.category_id)+COUNT(DISTINCT t2.category_id)
         -COUNT(DISTINCT CASE WHEN t1.category_id = t2.category_id THEN t1.category_id END)) AS jaccard_similarity
FROM t t1
CROSS JOIN t t2 
WHERE t1.customer_id<t2.customer_id
GROUP BY t1.customer_id, t2.customer_id

相关问题