在pyspark数组上应用withcolumn

mfuanj7w 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(600)

这是我的密码：

from pyspark.sql import *
department1 = Row(id='123456', name='Computer Science')
department2 = Row(id='789012', name='Mechanical Engineering')
Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('michael', 'armbrust', 'no-reply@berkeley.edu', 100000)
employee2 = Employee('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)
departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
departmentWithEmployees2 = Row(department=department2, employees=[employee1, employee2])
departmentsWithEmployeesSeq1 = [departmentWithEmployees1, departmentWithEmployees2]
df1 = spark.createDataFrame(departmentsWithEmployeesSeq1)

我想在数组中加入firstname和lastname。

from pyspark.sql import functions as sf
df2 = df1.withColumn("employees.FullName", sf.concat(sf.col('employees.firstName'), sf.col('employees.lastName')))
df2.printSchema()
root
 |-- department: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- salary: long (nullable = true)
 |-- employees.FullName: array (nullable = true)
 |    |-- element: string (containsNull = true)

我的新列fullname在父级，如何将它们放入数组中。

root
 |-- department: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- salary: long (nullable = true)
 |    |    |-- FullName: string (containsNull = true)

apache-spark pyspark apache-spark-sql pyspark-dataframes

来源：https://stackoverflow.com/questions/62048949/apply-withcolumn-on-pyspark-array

1条答案

按热度按时间

7nbnzgx91#

一种方法是使用 inline_outer ，并使用 concat_ws 得到你的全名并用 array , struct .

from pyspark.sql import functions as F
df1.selectExpr("department","""inline_outer(employees)""")\
   .withColumn("FullName", F.concat_ws(" ","firstName","lastName"))\
   .select("department", F.array(F.struct(*[F.col(x).alias(x) for x in\
                                     ['firstName','lastName','email','salary','FullName']]))\
           .alias("employees")).printSchema()
# root
 #|-- department: struct (nullable = true)
 #|    |-- id: string (nullable = true)
 #|    |-- name: string (nullable = true)
 #|-- employees: array (nullable = false)
 #|    |-- element: struct (containsNull = false)
 #|    |    |-- firstName: string (nullable = true)
 #|    |    |-- lastName: string (nullable = true)
 #|    |    |-- email: string (nullable = true)
 #|    |    |-- salary: long (nullable = true)
 #|    |    |-- FullName: string (nullable = false)

展开查看全部

赞(0）回复(0）举报 2021-05-27

我来回答

在pyspark数组上应用withcolumn

1条答案

相关问题

热门标签

最新问答