我有一个带列的sparkDataframe id
( date_from
)以及 price
. 例子:
id date_from price
10000012 2021-08-12 19283.334
10000012 2021-05-16 4400.0
10000012 2021-06-08 5718.69
10000012 2021-07-09 15283.333
10000012 2021-07-02 9087.5
10000012 2021-07-04 15283.333
10000012 2021-06-22 9061.111
10000012 2021-06-26 9076.667
10000012 2021-06-27 9080.77
10000012 2021-07-10 15283.333
10000012 2021-08-14 19283.334
10000012 2021-05-09 4400.0
10000012 2021-05-12 4400.0
10000012 2021-06-17 9065.64
10000012 2021-05-19 4400.0
10000166 2021-05-06 5801.4287
10000166 2021-04-01 4954.375
10000166 2021-04-22 5173.7856
10000166 2021-06-27 12655.429
10000166 2021-02-23 5167.5
我想计算最低价格和平均价格。为此我试过:
groupBy_id = ["id"]
aggregate = ["price"]
funs = [min, mean]
exprs = [f(col(c)) for f in funs for c in aggregate]
df = df.groupby(*groupBy_id).agg(*exprs)
还有:
df = df.groupby("id").agg(min("price").alias("min(norm_price)"),avg("price").alias("avg(norm_price)"))
但有些 min(norm_price)
值大于 avg(norm_price)
一个。输出:
id,min(norm_price),avg(norm_price)
10000012,11150.0,10287.276085889778
10000166,10370.761904761903,6082.360302835207
10000185,5054.642857142857,5424.533834586466
10000421,3990.0,3990.0
我做错什么了?
2条答案
按热度按时间tjjdgumg1#
您需要确保norm\u price是双精度类型,而不是字符串类型。否则
min
将返回最小字符串,而不是最小数字。ijxebb2r2#
我做了一件相当简单的事:
这给了我想要的结果: