sqlimpala按最频繁值减少id

btxsgosb  于 2021-06-26  发布在  Impala
关注(0)|答案(1)|浏览(399)

给出如下表格:

  1. +--+------------------+-----------+
  2. |id| diagnosis_age| diagnosis|
  3. +--+------------------+-----------+
  4. | 1|2.1843037179180302| 315.320000|
  5. | 1| 2.80033330216659| 315.320000|
  6. | 1| 2.8222365762732| 315.320000|
  7. | 1| 5.64822705794013| 325.320000|
  8. | 1| 5.686557787521759| 335.320000|
  9. | 2| 5.70572315231258| 315.320000|
  10. | 2| 5.724888517103389| 315.320000|
  11. | 3| 5.744053881894209| 315.320000|
  12. | 3|5.7604813374292005| 315.320000|
  13. | 3| 5.77993740687426| 315.320000|
  14. +--+------------------+-----------+

我想通过选择最常诊断的诊断和年龄,将每个id减少到一行。
输出如下所示:

  1. +--+------------------+-----------+
  2. |id| diagnosis_age| diagnosis|
  3. +--+------------------+-----------+
  4. | 1|2.1843037179180302| 315.320000|
  5. | 2| 5.70572315231258| 315.320000|
  6. | 3| 5.744053881894209| 315.320000|
  7. +--+------------------+-----------+

我试过这样的方法:

  1. SELECT id, diagnosis, age,
  2. COUNT(distinct diagnosis) OVER (partition by id)
  3. FROM table

但我真的什么也做不到!
使用以下代码:

  1. wc=Window().partitionBy("id", "diagnosis").orderBy("diagnosis_age")
  2. wc2=Window().partitionBy("id")
  3. ddfc.withColumn("count", F.count("diagnosis").over(wc))\
  4. .withColumn("max", F.max("count").over(wc2))\
  5. .filter("count=max")\
  6. .groupBy("id").agg(F.first("diagnosis_age").alias("diagnosis_age"), F.first("diagnosis").alias("diagnosis"))\
  7. .orderBy("id")\
  8. .groupBy("diagnosis")\
  9. .count()\
  10. .orderBy("count", ascending = False)\
  11. .show(5)

我得到:

  1. +--------------+-----+
  2. |diagnosis |count|
  3. +--------------+-----+
  4. | V20.2|22179|
  5. | 382.900000|12985|
  6. | 389.900000|11333|
  7. | 381.810000| 7448|
  8. | 493.900000| 3249|
  9. +--------------+-----+

鉴于您的代码(修改为按诊断分组并获得5个最常见的诊断)输出:

  1. +--------------+------+
  2. |diagnosis |count |
  3. +--------------+------+
  4. | 389.900000 |13268 |
  5. | 382.900000 | 7572 |
  6. | V20.2 | 6193 |
  7. | 381.810000 | 4735 |
  8. | V72.19 | 4115 |
  9. +--------------+------+

谢谢你的帮助。

a11xaf1n

a11xaf1n1#

在统计学中,这是最常见的值,称为模式。
您可以使用聚合和 row_number() :

  1. select id, diagnosis, age
  2. from (select id, diagnosis, min(age) as age, count(*) as cnt,
  3. row_number() over (partition by id order by count(*) desc) as seqnum
  4. from t
  5. group by id, diagnosis
  6. ) da
  7. where seqnum = 1;

相关问题