pyspark-如何在单个Dataframe中获得基本统计数据(平均值、最小值、最大值)以及数值列的分位数(25%、50%)

twh00eeo  于 2021-05-29  发布在  Spark
关注(0)|答案(3)|浏览(774)

我有Spark

spark_df = spark.createDataFrame(
    [(1, 7, 'foo'), 
     (2, 6, 'bar'),
     (3, 4, 'foo'),
     (4, 8, 'bar'),
     (5, 1, 'bar')
    ],
    ['v1', 'v2', 'id'] 
)

预期产量

id  avg(v1)   avg(v2)   min(v1) min(v2) 0.25(v1)     0.25(v2)    0.5(v1)     0.5(v2)
0   bar 3.666667    5.0     2        1       some-value  some-value  some-value  some-value
1   foo 2.000000    5.5     1        4.      some-value  some-value  some-value  some-value

到目前为止,我可以得到基本的统计数据,比如平均值,最小值,最大值,但是不能得到分位数。我知道,这在Pandas身上很容易实现,但在Pypark身上却做不到
另外,我知道近似分位数,但是我不能在pyspark中把基本状态和分位数结合起来
到目前为止,我可以通过使用agg获得基本的统计数据,比如mean和min。我还想要同一个df中的分位数

func = [F.mean, F.min,]
NUMERICAL_FEATURE_LIST = ['v1', 'v2']
GROUP_BY_FIELDS = ['id']
exp = [f(F.col(c)) for f in func for c in NUMERICAL_FEATURE_LIST]
df_fin = spark_df.groupby(*GROUP_BY_FIELDS).agg(*exp)
kgqe7b3p

kgqe7b3p1#

也许这是有帮助的-

val spark_df = Seq((1, 7, "foo"),
      (2, 6, "bar"),
      (3, 4, "foo"),
      (4, 8, "bar"),
      (5, 1, "bar")
    ).toDF("v1", "v2", "id")
    spark_df.show(false)
    spark_df.printSchema()
    spark_df.summary() // default= "count", "mean", "stddev", "min", "25%", "50%", "75%", "max"
      .show(false)

    /**
      * +---+---+---+
      * |v1 |v2 |id |
      * +---+---+---+
      * |1  |7  |foo|
      * |2  |6  |bar|
      * |3  |4  |foo|
      * |4  |8  |bar|
      * |5  |1  |bar|
      * +---+---+---+
      *
      * root
      * |-- v1: integer (nullable = false)
      * |-- v2: integer (nullable = false)
      * |-- id: string (nullable = true)
      *
      * +-------+------------------+------------------+----+
      * |summary|v1                |v2                |id  |
      * +-------+------------------+------------------+----+
      * |count  |5                 |5                 |5   |
      * |mean   |3.0               |5.2               |null|
      * |stddev |1.5811388300841898|2.7748873851023217|null|
      * |min    |1                 |1                 |bar |
      * |25%    |2                 |4                 |null|
      * |50%    |3                 |6                 |null|
      * |75%    |4                 |7                 |null|
      * |max    |5                 |8                 |foo |
      * +-------+------------------+------------------+----+
      */

如果您需要在格式,然后使用下面的答案。

ztigrdn8

ztigrdn82#

方法DESCRIPE计算Dataframe中数字列的统计信息,如平均值、最小值、最大值等。
df.descripe().show()

blmhpbnm

blmhpbnm3#

我认为这样的语法就是你想要的:

spark.createOrRegisterTempTable("spark_table")    
spark.sql("SELECT id, AVG(v1) AS avg_v1, AVG(v2) AS avg_v2, \
 MIN(v1) AS min_v1, MIN(v2) AS min_v2, \
 percentile_approx(v1, 0.25) AS p25_v1, percentile_approx(v2, 0.25) AS p25_v2, \
 percentile_approx(v1, 0.5)AS p50_v1, percentile_approx(v2, 0.5) AS p50_v2 \
 FROM spark_table GROUP BY id").show(5)

它有助于创建别名,因为未格式化的列名很难处理。

相关问题