从pysparkDataframe中删除重复值

zkure5ic  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(375)

我正在分析堆栈溢出调查数据。其中一列 Databaseworkedwith 是一个以分号分隔的值列表,因此我使用“拆分”然后“分解”为每个值创建新行。但现在我得到了其他列的重复值。有没有一种方法可以不复制其他列,而是在其中使用null。
下面是我的代码。

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t

if __name__ == '__main__':

    session = SparkSession.builder.master("local[*]").appName("StackOverFlow Developer Survey").getOrCreate()

    options = {"header":"true" , "inferSchema":"true"}
    initialDF = session.read.options(**options).csv("E:\\DataProject\\developer_survey_2020\\survey_results_public.csv")

    initialDF.show()

    demographicsDF = initialDF.select("Respondent","Age","Age1stCode","Country","EdLevel","Gender","Sexuality")

    #demographicsDF.show()

    databaseworkedwithDF = initialDF.select("Respondent","DatabaseWorkedWith","DatabaseDesireNextYear")

    databaseworkedwithDF = databaseworkedwithDF.withColumn("DatabaseWorkedWith2" , f.split("DatabaseWorkedWith",";")) \
                            .withColumn("DatabaseDesireNextYear2",f.split("DatabaseDesireNextYear",";"))

    databaseworkedwithDF = databaseworkedwithDF.withColumn("DatabaseWorkedWith2" , f.explode("DatabaseWorkedWith2")) \
                            .withColumn("DatabaseDesireNextYear2" , f.explode("DatabaseDesireNextYear2"))

    databaseworkedwithDF.show(truncate=False)

以及输出

+----------+-----------------------------------------+----------------------+--------------------+-----------------------+
|Respondent|DatabaseWorkedWith                       |DatabaseDesireNextYear|DatabaseWorkedWith2 |DatabaseDesireNextYear2|
+----------+-----------------------------------------+----------------------+--------------------+-----------------------+
|1         |Elasticsearch;Microsoft SQL Server;Oracle|Microsoft SQL Server  |Elasticsearch       |Microsoft SQL Server   |
|1         |Elasticsearch;Microsoft SQL Server;Oracle|Microsoft SQL Server  |Microsoft SQL Server|Microsoft SQL Server   |
|1         |Elasticsearch;Microsoft SQL Server;Oracle|Microsoft SQL Server  |Oracle              |Microsoft SQL Server   |
|2         |NA                                       |NA                    |NA                  |NA                     |
|3         |NA                                       |NA                    |NA                  |NA                     |
|4         |NA                                       |NA                    |NA                  |NA                     |
``` `DatabaseDesireNextYear2` 而是重复列。第二行和第三行可以为空吗?
liwlm1x9

liwlm1x91#

你可以用 posexplode 如果 pos 不等于0:


# Example dataframe

df.show(truncate=False)
+----------+-----------------------------------------+----------------------+
|Respondent|DatabaseWorkedWith                       |DatabaseDesireNextYear|
+----------+-----------------------------------------+----------------------+
|1         |Elasticsearch;Microsoft SQL Server;Oracle|Microsoft SQL Server  |
+----------+-----------------------------------------+----------------------+

import pyspark.sql.functions as F

df2 = df.select(
    '*', 
    F.posexplode(F.split('DatabaseWorkedWith', ';')).alias('pos', 'DatabaseWorkedWith2')
).select(
    *[F.when(F.col('pos') == 0, F.col(c)).alias(c) for c in df.columns], 
    'DatabaseWorkedWith2'
)

df2.show(truncate=False)
+----------+-----------------------------------------+----------------------+--------------------+
|Respondent|DatabaseWorkedWith                       |DatabaseDesireNextYear|DatabaseWorkedWith2 |
+----------+-----------------------------------------+----------------------+--------------------+
|1         |Elasticsearch;Microsoft SQL Server;Oracle|Microsoft SQL Server  |Elasticsearch       |
|null      |null                                     |null                  |Microsoft SQL Server|
|null      |null                                     |null                  |Oracle              |
+----------+-----------------------------------------+----------------------+--------------------+

相关问题