用Pyspark中的比例替换缺失值

bnl4lu3b  于 2023-04-19  发布在  Spark
关注(0)|答案(1)|浏览(135)

我必须将df列Type的缺失值替换为80%的“R”值和20%的“NR”值,因此必须将16个缺失值替换为“R”值,将4个替换为“NR”值
| Id_a|国家|类型|
| --------------|--------------|--------------|
| a1|恩加尔恩||
| a2|恩加尔恩||
| b1|恩加尔恩||
| b2|恩加尔恩||
| c1|恩加尔恩||
| C2|恩加尔恩||
| Er3|恩加尔恩||
| PO9|恩加尔恩||
| EE4|恩加尔恩||
| E4|恩加尔恩||
| T5|恩加尔恩||
| u8|恩加尔恩||
| r4|恩加尔恩||
| zx1|恩加尔恩||
| 11d|恩加尔恩||
| 二十二|恩加尔恩||
| 2p|恩加尔恩||
| 3jk|恩加尔恩||
| 56小时|恩加尔恩||
| A78|恩加尔恩||
我的想法是创建一个这样的计数器,并为前16行截肢'R'和最后4截肢'NR',任何建议如何做到这一点?

window = Window.orderBy('Id_a')
Df= Df.withColumn('Counter',row_number().over(window))
Id_a国家类型计数器
a1恩加尔恩1
a2恩加尔恩
b1恩加尔恩
b2恩加尔恩
c1恩加尔恩
C2恩加尔恩
Er3恩加尔恩
PO9恩加尔恩
EE4恩加尔恩
E4恩加尔恩10个
T5恩加尔恩十一
u8恩加尔恩十二岁
r4恩加尔恩十三
zx1恩加尔恩十四岁
11d恩加尔恩十五岁
二十二恩加尔恩十六岁
2p恩加尔恩十七岁
3jk恩加尔恩十八岁
56小时恩加尔恩十九岁
A78恩加尔恩二十

重要的是创建一种方法,下次如何复制这种情况,例如下次我可能必须替换70% R 30% NR,或15% R和85% NR等。

lzfw57am

lzfw57am1#

首先,你可以创建两个 Dataframe ,一个有空值,另一个没有空值,然后在空值的 Dataframe 上,你可以使用apache spark中的randomSplit函数将其拆分为2个 Dataframe ,使用你指定的比率,最后你可以将3个 Dataframe 合并以获得想要的结果:

spark = SparkSession.builder.appName("DateDataFrame").getOrCreate()
data = [
    ("a1", "Engalnd", None),
    ("a2", "Engalnd", None),
    ("b1", "Engalnd", None),
    ("b2", "Engalnd", None),
    ("c1", "Engalnd", None),
    ("c2", "Engalnd", None),
    ("er3", "Engalnd", None),
    ("po9", "Engalnd", None),
    ("ee4", "Engalnd", None),
    ("e4", "Engalnd", None),
    ("t5", "Engalnd", None),
    ("u8", "Engalnd", None),
    ("r4", "Engalnd", None),
    ("zx1", "Engalnd", None),
    ("11d", "Engalnd", None),
    ("22", "Engalnd", None),
    ("2p", "Engalnd", None),
    ("3jk", "Engalnd", None),
    ("56h", "Engalnd", None),
    ("a78", "Engalnd", None),
    ("xxx", "Engalnd", "value1"),
    ("zzz", "Engalnd", "value2"),
]
df = spark.createDataFrame(data, ['Id_a', 'Country', 'Type'])

missingTypeDf = df.filter(col("Type").isNull())
notMissingTypeDf = df.filter(~col("Type").isNull())

from pyspark.sql.functions import rand
fractions = [0.8, 0.2]

# Split the DataFrame
chunkDf1, chunkDf2 = missingTypeDf.randomSplit(fractions, seed = 13)

chunkDf1 = chunkDf1.withColumn("Type", lit("R"))
chunkDf2 = chunkDf2.withColumn("Type", lit("NR"))

resultDf = notMissingTypeDf.unionByName(chunkDf1).unionByName(chunkDf2)

resultDf.show(200, truncate=False)

+----+-------+------+
|Id_a|Country|Type  |
+----+-------+------+
|xxx |Engalnd|value1|
|zzz |Engalnd|value2|
|a1  |Engalnd|R     |
|a2  |Engalnd|R     |
|b1  |Engalnd|R     |
|b2  |Engalnd|R     |
|c1  |Engalnd|R     |
|er3 |Engalnd|R     |
|po9 |Engalnd|R     |
|ee4 |Engalnd|R     |
|e4  |Engalnd|R     |
|t5  |Engalnd|R     |
|u8  |Engalnd|R     |
|r4  |Engalnd|R     |
|zx1 |Engalnd|R     |
|22  |Engalnd|R     |
|2p  |Engalnd|R     |
|3jk |Engalnd|R     |
|c2  |Engalnd|NR    |
|11d |Engalnd|NR    |
|56h |Engalnd|NR    |
|a78 |Engalnd|NR    |
+----+-------+------+

相关问题