pyspark:有没有可能基于非空值创建动态数目的Dataframe

kyxcudwk  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(236)

我有一个PyparkDataframe:
NameAgeUserNamePasswordJoe34NullAlice21NullUser1Pass1NullUser2Pass2
从上面的dataframe中,我想通过查找空值列来创建2个这样的dataframe,不知何故:
姓名Joe34Alica21
用户名密码User1Pass1User2Pass2
有没有办法做到这一点?
“source”目录下的json文件示例:

{
 "name": "joe",
 "age": 31
}

{
 "name": "alica",
 "age": 21
}

{
 "username": "user1",
 "password": "pass1"
}

{
 "username": "user2",
 "password": "pass2
}

代码:

conf = SparkConf().setMaster("local").setAppName("Test")
spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .getOrCreate()

json_data = spark.read.json("source")
rm5edbpk

rm5edbpk1#

如果你总是有相同的固定数量的列,我只会涵盖所有的情况

import pyspark.sql.functions as f

df2=df.where(f.col("name").isNotNull() & f.col("age").isNotNull() & f.col("username").isNotNull() & f.col("password").isNull())

df3=df.where(f.col("name").isNotNull() & f.col("age").isNotNull() & f.col("username").isNull() & f.col("password").isNull())

df3=df.where(f.col("name").isNotNull() & f.col("age").isNull() & f.col("username").isNull() & f.col("password").isNull())

df4=df.where(f.col("name").isNull() & f.col("age").isNotNull() & f.col("username").isNotNull() & f.col("password").isNotNull())

df5=df.where(f.col("name").isNull() & f.col("age").isNull() & f.col("username").isNotNull() & f.col("password").isNotNull())

... and so on
h79rfbju

h79rfbju2#

你可以简单地使用 select + dropna() :

df1 = df.select("name", "age").dropna()

df1.show()

# +-----+---+

# | name|age|

# +-----+---+

# |  joe| 34|

# |alice| 21|

# +-----+---+

df2 = df.select("username", "password").dropna()

df2.show()

# +--------+--------+

# |username|password|

# +--------+--------+

# |   user1|   pass1|

# |   user2|   pass2|

# +--------+--------+

相关问题