在pyspark中的groupby上创建一个新的计算列

s2j5cfk0 于 2021-07-13 发布在 Spark

关注(0)|答案(2)|浏览(347)

我在pyspark中有下面的dataframe，它已经在groupby中的“accountname”列中了。

accountname |   namespace   |   cost    |   cost_to_pay
account001  |   ns1         |   93      |   9
account001  |   Transversal |   93      |   25
account002  |   ns2         |   50      |   27
account002  |   Transversal |   50      |   12

我需要一个新的专栏 "cost" - "cost_to_pay" 哪里 "namespace" == "Transversal" ，我需要在新列的所有字段中使用此结果，如下所示：

accountname |   namespace   |   cost    |   cost_to_pay |   new_column1                                         
account001  |   ns1         |   93      |   9           |   68                    
account001  |   Transversal |   93      |   25          |   68
account002  |   ns2         |   50      |   27          |   38
account002  |   Transversal |   50      |   12          |   38

68是从account001中减去groupby的93-25的结果。38减去50-12的结果为account002。
你知道我怎样才能做到吗？

python apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/66371159/create-a-new-calculated-column-on-groupby-in-pyspark

2条答案

按热度按时间

szqfcxe21#

您可以使用最大隐藏差异来获取每个accountname的差异：

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'new_column1',
    F.max(
        F.when(
            F.col('namespace') == 'Transversal',
            F.col('cost') - F.col('cost_to_pay')
        )
    ).over(Window.partitionBy('accountname'))
)

df2.show()
+-----------+-----------+----+-----------+-----------+
|accountname|  namespace|cost|cost_to_pay|new_column1|
+-----------+-----------+----+-----------+-----------+
| account001|        ns1|  93|          9|         68|
| account001|Transversal|  93|         25|         68|
| account002|        ns2|  50|         27|         38|
| account002|Transversal|  50|         12|         38|
+-----------+-----------+----+-----------+-----------+

赞(0）回复(0）举报 2021-07-13

mqkwyuun2#

如果 df 你的Dataframe在 groupby ，你可以找到 df_temp 使用：

df_temp = df.filter(F.col('namespace')=='Transversal')
df_temp = df_temp.withcolumn('new_column1', F.col('cost') - F.col('cost_to_pay'))
df_temp = df_temp.select('accountname', 'new_column1') ## keep only relevant columns

## you might want to have some extra checks, like droping duplicates, etc

## and finally join df_temp with you main dataframe df

df = df.join(df_temp, on='accountname', how='left')
df = df.na.fill({'accountname':'some predefined value, like 0}) ## if you wish to fill nulls

赞(0）回复(0）举报 2021-07-13

我来回答

在pyspark中的groupby上创建一个新的计算列

2条答案

相关问题

热门标签

最新问答