在一行中合并行

falq053o  于 2021-05-29  发布在  Spark
关注(0)|答案(4)|浏览(369)

我有以下Dataframe:


# +-----------------------------+--------+--------+---------+

# |PROVINCE_STATE|COUNTRY_REGION|2/1/2020|2/1/2020|2/11/2020|

# +--------------+--------------+--------+--------+---------+

# |             -|     Australia|      12|      15|       15|

# +--------------+--------------+--------+--------+---------+

我需要合并在一个所有的行,并为日期有基于国家/地区的总和。问题是我有更多的列,不知道如何动态地做。试过groupby,但还是不行。谢谢。

63lcw9qa

63lcw9qa1#

使用聚合:

select '-' as province_state, country_region,
       sum(`2/1/2020`), sum(`2/10/2020`), sum(`2/11/2020`)
from t
group by country_region;

我不知道你说的“动态”是什么意思。作为sql查询,需要单独列出每个表达式。

qeeaahzv

qeeaahzv2#

在pyspark中试试这个:一种方法是使用窗口函数

from pyspark.sql import SparkSession
    from pyspark.sql import functions as F
    from pyspark.sql.window import Window

    spark = SparkSession.builder \
        .appName('SO')\
        .getOrCreate()

    sc= spark.sparkContext

    df = sc.parallelize([
        ("new south wales", "aus", 4, 4, 4),("victoria",  "aus", 4, 4, 4), ("queensland",  "aus", 3, 5, 5), ("south australia","aus", 1, 2, 2)
    ]).toDF(["province_state", "country_region", "2/1/2020", "2/10/2020", "2/11/2020"])

    df.show()
    #
    # +---------------+--------------+--------+---------+---------+
    # | province_state|country_region|2/1/2020|2/10/2020|2/11/2020|
    # +---------------+--------------+--------+---------+---------+
    # |new south wales|           aus|       4|        4|        4|
    # |       victoria|           aus|       4|        4|        4|
    # |     queensland|           aus|       3|        5|        5|
    # |south australia|           aus|       1|        2|        2|
    # +---------------+--------------+--------+---------+---------+

    w = Window().partitionBy('country_region')

    w1 = Window().partitionBy('country_region').orderBy('country_region')

    for column in df.columns:
        if column not in ['country_region','province_state']:
            df = df.withColumn(column, F.sum(column).over(w) )

    df1 = df.withColumn("r_no", F.row_number().over(w1)).where(F.col('r_no')==1)

    df1.select(F.lit('_').alias('province_state'), *[ column for column in df1.columns if column not in ['province_state']]).drop(F.col('r_no')).show()

    # +--------------+--------------+--------+---------+---------+
    # |province_state|country_region|2/1/2020|2/10/2020|2/11/2020|
    # +--------------+--------------+--------+---------+---------+
    # |             _|           aus|      12|       15|       15|
    # +--------------+--------------+--------+---------+---------+
ddhy6vgd

ddhy6vgd3#

如果您的前两列始终是省和州,其他n列是日期,您可以在下面尝试(scala):

import org.apache.spark.sql.functions._
val dateCols = df.columns.drop(2).map(c => sum(c).as(c)) // select all columns except first 2 and perform sum on each of them
df.groupBy('country_region).agg(dateCols.head,dateCols.tail:_*).show()

python版本:

import pyspark.sql.functions as f
dateCols = [f.sum(c) for c in df.columns][2:] # select all columns except first 2 and perform sum on each of them
df.groupBy('country_region').agg(*dateCols).show()

输出:

+--------------+--------+---------+---------+
|country_region|2/1/2020|2/10/2020|2/11/2020|
+--------------+--------+---------+---------+
|           aus|      12|       15|       15|
+--------------+--------+---------+---------+
mf98qq94

mf98qq944#

试试这个。

from pyspark.sql import functions as F
from dateutil.parser import parse

def is_date(string, fuzzy=False):
    try: 
        parse(string, fuzzy=fuzzy)
        return True
    except ValueError:
        return False

df.groupBy(F.lit('-').alias("PROVINCE_STATE"),'COUNTRY_REGION')\
  .agg(*((F.sum(x)).cast('int').alias(x) for x in df.columns if is_date(x)==True)).show()

# +--------------+--------------+--------+---------+---------+

# |PROVINCE_STATE|COUNTRY_REGION|2/1/2020|2/10/2020|2/11/2020|

# +--------------+--------------+--------+---------+---------+

# |             -|     Australia|      12|       15|       15|

# +--------------+--------------+--------+---------+---------+

相关问题