pyspark:使用条件对列中的单元格进行计数

7kjnsjlb  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(351)

假设我有这个Dataframe。。

TEST_schema = StructType([StructField("col1", IntegerType(), True),\
                          StructField("col2", IntegerType(), True)])
TEST_data = [(5,-1),(4,-1),(3,3),(2,2),(1,-1),(0,-1),(0,-1),(0,2),(0,-1)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df.show() 

+----+----+
|col1|col2|
+----+----+
|   5|  -1|
|   4|  -1|
|   3|   3|
|   2|   2|
|   1|  -1|
|   0|  -1|
|   0|  -1|
|   0|   2|
|   0|  -1|
+----+----+

我要做的是在col1==1之后计算'-1'的数目。
那之后呢 col1 == 1 df.count() 返回4。

s1ag04yj

s1ag04yj1#

这个代码可能对你有帮助,

from pyspark.sql.functions import *
from pyspark.sql.types import *

test_schema = StructType([StructField("col1", IntegerType(), True),\
                          StructField("col2", IntegerType(), True)])
test_data = [(5,-1),(4,-1),(3,3),(2,2),(1,-1),(0,-1),(0,-1),(0,2),(0,-1)]
rdd3 = sc.parallelize(test_data)
df = sqlContext.createDataFrame(test_data, test_schema)
df.show()

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))

w1 =Window.orderBy('row_num').rowsBetween(Window.currentRow,Window.unboundedFollowing)

df.withColumn('count', F.count(when(df.col2==-1,1)).over(w1)).show()
'''
+----+----+-------+-----+
|col1|col2|row_num|count|
+----+----+-------+-----+
|   5|  -1|      1|    6|
|   4|  -1|      2|    5|
|   3|   3|      3|    4|
|   2|   2|      4|    4|
|   1|  -1|      5|    4|
|   0|  -1|      6|    3|
|   0|  -1|      7|    2|
|   0|   2|      8|    1|
|   0|  -1|      9|    1|
+----+----+-------+-----+
'''

相关问题