使用pyspark分解数组值

cczfrluj  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(498)

我是pyspark的新手,我需要以这样一种方式分解我的值数组,即每个值都被分配到一个新列。我尝试使用explode,但无法获得所需的输出。下面是我的输出

  1. +---------------+----------+------------------+----------+---------+------------+--------------------+
  2. |account_balance|account_id|credit_Card_Number|first_name|last_name|phone_number| transactions|
  3. +---------------+----------+------------------+----------+---------+------------+--------------------+
  4. | 100000| 12345| 12345| abc| xyz| 1234567890|[1000, 01/06/2020...|
  5. | 100000| 12345| 12345| abc| xyz| 1234567890|[1100, 02/06/2020...|
  6. | 100000| 12345| 12345| abc| xyz| 1234567890|[6146, 02/06/2020...|
  7. | 100000| 12345| 12345| abc| xyz| 1234567890|[253, 03/06/2020,...|
  8. | 100000| 12345| 12345| abc| xyz| 1234567890|[4521, 04/06/2020...|
  9. | 100000| 12345| 12345| abc| xyz| 1234567890|[955, 05/06/2020,...|
  10. +---------------+----------+------------------+----------+---------+------------+--------------------+

下面是程序的模式

  1. root
  2. |-- account_balance: long (nullable = true)
  3. |-- account_id: long (nullable = true)
  4. |-- credit_Card_Number: long (nullable = true)
  5. |-- first_name: string (nullable = true)
  6. |-- last_name: string (nullable = true)
  7. |-- phone_number: long (nullable = true)
  8. |-- transactions: array (nullable = true)
  9. | |-- element: struct (containsNull = true)
  10. | | |-- amount: long (nullable = true)
  11. | | |-- date: string (nullable = true)
  12. | | |-- shop: string (nullable = true)
  13. | | |-- transaction_code: string (nullable = true)

我想要一个输出,其中我有额外的金额,日期,商店,交易代码与各自的值列

  1. amount date shop transaction_code
  2. 1000 01/06/2020 amazon buy
  3. 1100 02/06/2020 amazon sell
  4. 6146 02/06/2020 ebay buy
  5. 253 03/06/2020 ebay buy
  6. 4521 04/06/2020 amazon buy
  7. 955 05/06/2020 amazon buy
7kqas0il

7kqas0il1#

使用 explode 然后分头吃 struct 文件,最后删除新分解的和transactions数组列。 Example: ```
from pyspark.sql.functions import *

got only some columns from json

df.printSchema()

root

|-- account_balance: long (nullable = true)

|-- transactions: array (nullable = true)

| |-- element: struct (containsNull = true)

| | |-- amount: long (nullable = true)

| | |-- date: string (nullable = true)

df.selectExpr("","explode(transactions)").select("","col.").drop(['col','transactions']).show()

+---------------+------+--------+

|account_balance|amount| date|

+---------------+------+--------+

| 10| 1000|20200202|

+---------------+------+--------+

展开查看全部

相关问题