如何使用pyspark将表格格式的数据转换成句子或可读格式?

3xiyfsfu  于 2021-06-10  发布在  Cassandra
关注(0)|答案(1)|浏览(376)

这是图像中的表格式,所以我应该如何将其转换为可读格式,就像它应该显示为-member\u id is belowns to region,等等其他列一样
那么,有谁能帮我写一个函数,把表格格式的数据转换成可读的句子格式吗?

bfrts1fy

bfrts1fy1#

您可以添加名为“”的新列 Sentence “如下所示,并使用 concat 功能。我也写df到一个文件,如果你想它到csv文件。

>>> from pyspark.sql.functions import *
>>> df.show()
+-----+---------+---+----+
|fname|    lname|age|dept|
+-----+---------+---+----+
| Jack|  Felice | 25|  IT|
| Mike| Gilbert | 30|  CS|
| John|     Shen| 45|  DR|
+-----+---------+---+----+

>>> df1 = df.withColumn("sentence", concat( col("fname"), lit(" "), col("lname"), lit("is "), col("age"), lit(" year's old and he works in a "), col("dept"), lit(" department."))).select("sentence")
>>> df1.show(10,False)
+---------------------------------------------------------------+
|sentence                                                       |
+---------------------------------------------------------------+
|Jack Felice is 25 year's old and he works in a IT department.  |
|Mike  Gilbert is 30 year's old and he works in a CS department.|
|John Shenis 45 year's old and he works in a DR department.     |
+---------------------------------------------------------------+

>>> df1.write.format("csv").option("header", "true").save("/out/")

csv输出

相关问题