Pyspark行中零的计数

vyswwuz2  于 2023-08-02  发布在  Spark
关注(0)|答案(3)|浏览(123)

我想找出在pyspark中我在下面代码中使用的 Dataframe 中特定列中的零的数量

selected_columns = Combined_Final.columns[-12:]
Combined_Final  = Combined_Final.withColumn("zero_count", sum([col(column) == lit(0) for column in selected_columns]))

字符串
低于误差

TypeError: Invalid argument, not a string or column:of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.


请帮帮忙

wlsrxk51

wlsrxk511#

在这种情况下,您可以使用Spark高阶函数filterconcat_ws来获取计数。

Explanation:

  • split(concat_ws("",*df.columns),"")->这里我们连接所有列数据并将每个字母拆分成一个数组。
  • size(filter(temp,n -> n == 0))->使用高阶函数过滤掉0。
    Example:
from pyspark.sql.functions import *
df = spark.createDataFrame([('ab0','c9','d0')],['i','j','k'])
df.withColumn("temp", split(concat_ws("",*df.columns),"")).\
withColumn("count_of_0", expr("""size(filter(temp,n -> n == 0))""")).\
  drop("temp").\
  show(10,False)
#+---+---+---+----------+
#|i  |j  |k  |count_of_0|
#+---+---+---+----------+
#|ab0|c9 |d0 |2         |
#+---+---+---+----------+

字符串

hsgswve4

hsgswve42#

也许使用窗口?

df = spark.createDataFrame([(0, 64287, 0, 14114), (0, 14141, 2255, 5232)], ['a', 'b', 'c', 'd'])
selected_columns = ['a', 'c']

# define window for all df
df = df.withColumn('part', F.lit(1))
window = Window().partitionBy('part')

# lambda for count with condition
count_with_condition = lambda cond: F.sum(F.when(cond, 1).otherwise(0))

# and loop by selected columns
for col in selected_columns:
    df = df.withColumn(col, count_with_condition(F.col(col) == 0).over(window))

# get final result
df = df.select(selected_columns).distinct()

字符串
输入:

+---+-----+----+-----+
|  a|    b|   c|    d|
+---+-----+----+-----+
|  0|64287|   0|14114|
|  0|14141|2255| 5232|
+---+-----+----+-----+


输出量:

+---+---+
|  a|  c|
+---+---+
|  2|  1|
+---+---+

gcuhipw9

gcuhipw93#

使用选定的列创建数组列并聚合零的数量。

data = [
    (1, 1, 0, 1),
    (2, 2, 0, 2),
    (3, 0, 0, 0),
    (4, 0, 0, 5)
]

df = spark.createDataFrame(data, ['id', 'c1', 'c2', 'c3'])

selected_cols = df.columns[-3:]

df.withColumn('zero_count', f.aggregate(f.array(*selected_cols), f.lit(0), lambda acc, x: f.when(x == 0, acc + 1).otherwise(acc))) \
  .show(truncate=False)

+---+---+---+---+----------+
|id |c1 |c2 |c3 |zero_count|
+---+---+---+---+----------+
|1  |1  |0  |1  |1         |
|2  |2  |0  |2  |1         |
|3  |0  |0  |0  |3         |
|4  |0  |0  |5  |2         |
+---+---+---+---+----------+

字符串

相关问题