我正在分析堆栈溢出调查数据。其中一列 Databaseworkedwith
是一个以分号分隔的值列表,因此我使用“拆分”然后“分解”为每个值创建新行。但现在我得到了其他列的重复值。有没有一种方法可以不复制其他列,而是在其中使用null。
下面是我的代码。
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
if __name__ == '__main__':
session = SparkSession.builder.master("local[*]").appName("StackOverFlow Developer Survey").getOrCreate()
options = {"header":"true" , "inferSchema":"true"}
initialDF = session.read.options(**options).csv("E:\\DataProject\\developer_survey_2020\\survey_results_public.csv")
initialDF.show()
demographicsDF = initialDF.select("Respondent","Age","Age1stCode","Country","EdLevel","Gender","Sexuality")
#demographicsDF.show()
databaseworkedwithDF = initialDF.select("Respondent","DatabaseWorkedWith","DatabaseDesireNextYear")
databaseworkedwithDF = databaseworkedwithDF.withColumn("DatabaseWorkedWith2" , f.split("DatabaseWorkedWith",";")) \
.withColumn("DatabaseDesireNextYear2",f.split("DatabaseDesireNextYear",";"))
databaseworkedwithDF = databaseworkedwithDF.withColumn("DatabaseWorkedWith2" , f.explode("DatabaseWorkedWith2")) \
.withColumn("DatabaseDesireNextYear2" , f.explode("DatabaseDesireNextYear2"))
databaseworkedwithDF.show(truncate=False)
以及输出
+----------+-----------------------------------------+----------------------+--------------------+-----------------------+
|Respondent|DatabaseWorkedWith |DatabaseDesireNextYear|DatabaseWorkedWith2 |DatabaseDesireNextYear2|
+----------+-----------------------------------------+----------------------+--------------------+-----------------------+
|1 |Elasticsearch;Microsoft SQL Server;Oracle|Microsoft SQL Server |Elasticsearch |Microsoft SQL Server |
|1 |Elasticsearch;Microsoft SQL Server;Oracle|Microsoft SQL Server |Microsoft SQL Server|Microsoft SQL Server |
|1 |Elasticsearch;Microsoft SQL Server;Oracle|Microsoft SQL Server |Oracle |Microsoft SQL Server |
|2 |NA |NA |NA |NA |
|3 |NA |NA |NA |NA |
|4 |NA |NA |NA |NA |
``` `DatabaseDesireNextYear2` 而是重复列。第二行和第三行可以为空吗?
1条答案
按热度按时间liwlm1x91#
你可以用
posexplode
如果pos
不等于0: