我有以下PyparkDataframe:
import numpy as np
from pyspark.sql.types import *
schema = StructType([
StructField('user', StringType(), True),
StructField('created', IntegerType(), True),
StructField('month_1', FloatType(), True),
StructField('month_2', FloatType(), True),
StructField('month_3', FloatType(), True),
StructField('month_4', FloatType(), True),
])
data = [['tom', 2, np.nan,1.0,1.0,1.0],
['nick', 1,1.0, np.nan, np.nan, np.nan],
['jack', 3,np.nan,np.nan,1.0,1.0],
['jason', 2,np.nan,1.0,1.0,np.nan]]
df=spark.createDataFrame(data,schema)
df.show()
+-----+-------+-------+-------+-------+-------+
| user|created|month_1|month_2|month_3|month_4|
+-----+-------+-------+-------+-------+-------+
| tom| 2| NaN| 1.0| 1.0| 1.0|
| nick| 1| 1.0| NaN| NaN| NaN|
| jack| 3| NaN| NaN| 1.0| 1.0|
|jason| 2| NaN| 1.0| 1.0| NaN|
+-----+-------+-------+-------+-------+-------+
我想根据所创建列的值来填充。
如果“月”列大于等于创建的值,则为1.0
如果“月”列小于创建的值,则为0.0
所需输出应为:
+-----+-------+-------+-------+-------+-------+
| user|created|month_1|month_2|month_3|month_4|
+-----+-------+-------+-------+-------+-------+
| tom| 2| 0.0| 1.0| 1.0| 1.0|
| nick| 1| 1.0| 1.0| 1.0| 1.0|
| jack| 3| 0.0| 0.0| 1.0| 1.0|
|jason| 2| 0.0| 1.0| 1.0| 1.0|
+-----+-------+-------+-------+-------+-------+
1条答案
按热度按时间rekjcdws1#
你可以用
nanvl
替换NaN
使用条件值创建when
: