在pyspark数组上应用withcolumn

mfuanj7w  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(600)

这是我的密码:

  1. from pyspark.sql import *
  2. department1 = Row(id='123456', name='Computer Science')
  3. department2 = Row(id='789012', name='Mechanical Engineering')
  4. Employee = Row("firstName", "lastName", "email", "salary")
  5. employee1 = Employee('michael', 'armbrust', 'no-reply@berkeley.edu', 100000)
  6. employee2 = Employee('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)
  7. departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
  8. departmentWithEmployees2 = Row(department=department2, employees=[employee1, employee2])
  9. departmentsWithEmployeesSeq1 = [departmentWithEmployees1, departmentWithEmployees2]
  10. df1 = spark.createDataFrame(departmentsWithEmployeesSeq1)

我想在数组中加入firstname和lastname。

  1. from pyspark.sql import functions as sf
  2. df2 = df1.withColumn("employees.FullName", sf.concat(sf.col('employees.firstName'), sf.col('employees.lastName')))
  3. df2.printSchema()
  4. root
  5. |-- department: struct (nullable = true)
  6. | |-- id: string (nullable = true)
  7. | |-- name: string (nullable = true)
  8. |-- employees: array (nullable = true)
  9. | |-- element: struct (containsNull = true)
  10. | | |-- firstName: string (nullable = true)
  11. | | |-- lastName: string (nullable = true)
  12. | | |-- email: string (nullable = true)
  13. | | |-- salary: long (nullable = true)
  14. |-- employees.FullName: array (nullable = true)
  15. | |-- element: string (containsNull = true)

我的新列fullname在父级,如何将它们放入数组中。

  1. root
  2. |-- department: struct (nullable = true)
  3. | |-- id: string (nullable = true)
  4. | |-- name: string (nullable = true)
  5. |-- employees: array (nullable = true)
  6. | |-- element: struct (containsNull = true)
  7. | | |-- firstName: string (nullable = true)
  8. | | |-- lastName: string (nullable = true)
  9. | | |-- email: string (nullable = true)
  10. | | |-- salary: long (nullable = true)
  11. | | |-- FullName: string (containsNull = true)
7nbnzgx9

7nbnzgx91#

一种方法是使用 inline_outer ,并使用 concat_ws 得到你的全名并用 array , struct .

  1. from pyspark.sql import functions as F
  2. df1.selectExpr("department","""inline_outer(employees)""")\
  3. .withColumn("FullName", F.concat_ws(" ","firstName","lastName"))\
  4. .select("department", F.array(F.struct(*[F.col(x).alias(x) for x in\
  5. ['firstName','lastName','email','salary','FullName']]))\
  6. .alias("employees")).printSchema()
  7. # root
  8. #|-- department: struct (nullable = true)
  9. #| |-- id: string (nullable = true)
  10. #| |-- name: string (nullable = true)
  11. #|-- employees: array (nullable = false)
  12. #| |-- element: struct (containsNull = false)
  13. #| | |-- firstName: string (nullable = true)
  14. #| | |-- lastName: string (nullable = true)
  15. #| | |-- email: string (nullable = true)
  16. #| | |-- salary: long (nullable = true)
  17. #| | |-- FullName: string (nullable = false)
展开查看全部

相关问题