使用sql、pandas或pyspark基于groupby元素创建列?

rjee0c15  于 2021-05-29  发布在  Spark
关注(0)|答案(4)|浏览(445)

这是我的数据:

| ID | Date  |
| 1  | 20-Mar|
| 1  | 30-Mar|
| 1  | 20-Apr|
| 2  | 10-Mar|
| 2  | 12-Mar|
| 3  | 20-Mar|
| 4  | 20-Mar|
| 4  | 9-Mar |

我想按ID分组并创建新列,这些列的日期如下:

| ID | Date_1 | Date_2 | Date_3  | Date_4  |
| 1  | 20-Mar | 30-Mar | 20-Apr  |
| 2  | 10-Mar | 12-Mar |
| 3  | 20-Mar |
| 4  | 9-Mar  | 20-Mar |

谢谢

fcipmucu

fcipmucu1#

你可以这样试试

df = df.groupby('ID').apply(lambda x:x.assign(flag=["Date"+str(s+1) for s in range(len(x))])).reset_index(drop=True)
res = df.pivot(index='ID', columns='flag', values='Date')
print(res)
ttcibm8c

ttcibm8c2#

在sql中,可以使用 row_number() 和条件聚合:

select
    id,
    max(case when rn = 1 then date end) date_1,
    max(case when rn = 2 then date end) date_2,
    max(case when rn = 3 then date end) date_3,
    max(case when rn = 4 then date end) date_4
from (
    select t.*, row_number() over(partition by id order by date) rn
    from mytable t
) t
group by id
xpszyzbs

xpszyzbs3#

使用 groupBy, collect_list 函数然后基于数组索引值创建日期列。 Example: ```

sample dataframe

df.show()

+---+------+

| ID| Date|

+---+------+

| 1|20-Mar|

| 1|30-Mar|

| 1|20-Apr|

| 2|10-Mar|

+---+------+

from pyspark.sql.functions import *

df.groupBy("id").agg(collect_list(col("Date")).alias("tmp")).
withColumn("Date_1",col("tmp")[0]).
withColumn("Date_2",col("tmp")[1]).
withColumn("Date_3",col("tmp")[2]).
withColumn("Date_4",col("tmp")[3]).
drop("tmp").
show(10,False)

+---+------+------+------+------+

|id |Date_1|Date_2|Date_3|Date_4|

+---+------+------+------+------+

|1 |20-Mar|30-Mar|20-Apr|null |

|2 |10-Mar|null |null |null |

+---+------+------+------+------+

从spark-2.4使用开始 `element_at` 功能:

df.groupBy("id").agg(collect_list(col("Date")).alias("tmp")).
withColumn("Date_1",element_at(col("tmp"),1)).
withColumn("Date_2",element_at(col("tmp"),2)).
withColumn("Date_3",element_at(col("tmp"),3)).
withColumn("Date_4",element_at(col("tmp"),4)).
drop("tmp").
show(10,False)
`Dynamic way:`
df1=df.groupBy(col("id")).agg(collect_list(col("date")))

get the max size of array

size=df.groupBy("id").agg(collect_list(col("Date")).alias("tmp")).select(max(size("tmp"))).collect()[0][0]

df1.select([df1.id]+ [df1.tmp[i].alias("date_"+ str(i+1)) for i in range(size+1)]).
show()

+---+------+------+------+------+

| id|date_1|date_2|date_3|date_4|

+---+------+------+------+------+

| 1|20-Mar|30-Mar|20-Apr| null|

| 2|10-Mar| null| null| null|

+---+------+------+------+------+

wqlqzqxt

wqlqzqxt4#

加载Dataframe
分组依据 ID 将日期收集为数组 DateArr 运行选择 ID 使用动态添加列 map (在python+scala中提供)
如果您想使您的代码更独立于列的数量,那么请选中 Solution 2 ```
import org.apache.spark.sql.functions._

object ArrayToColumns {

def main(args: Array[String]): Unit = {

val spark = Constant.getSparkSess

import spark.implicits._

val df = List( (1,"20-Mar"),(1,"30-Mar"),(1,"20-Mar"),
  (2,"10-Mar"),(2,"12-Mar"),
  (3,"20-Mar"),
  (4,"20-Mar"),(4,"09-Mar")
).toDF("ID","Date")

df.groupBy("ID")
  .agg(collect_list("Date").as("DateArr"))
  .select(col("ID") +: (0 until 4).map(i => coalesce(col("DateArr")(i),lit("")).alias(s"Date_${i+1}")): _*)
  .show()

}

}

解决方案2

val dfNew = df.groupBy("ID")
.agg(collect_list("Date").as("DateArr"))

val maxArraySize : Int = dfNew.select(max(size(col("DateArr")).as("ArraySize"))).head().getInt(0)

dfNew
  .select(col("ID") +: (0 until maxArraySize).map(i => coalesce(col("DateArr")(i),lit("")).alias(s"Date_${i+1}")): _*)
  .show()

相关问题