使用sparkxml从pysparkDataframe中选择嵌套列

7y4bm7vi  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(334)

我正在尝试从pysparkDataframe中选择嵌套arraytype。
我只想从这个数据框中选择items列。我不知道我在这里做错了什么。
xml格式:

  1. <?xml version="1.0" encoding="utf-8"?>
  2. <shiporder orderid="str1234">
  3. <orderperson>ABC</orderperson>
  4. <shipto>
  5. <name>XYZ</name>
  6. <address>305, Ram CHowk</address>
  7. <city>Pune</city>
  8. <country>IN</country>
  9. </shipto>
  10. <items>
  11. <item>
  12. <title>Clothing</title>
  13. <notes>
  14. <note>Brand:CK</note>
  15. <note>Size:L</note>
  16. </notes>
  17. <quantity>6</quantity>
  18. <price>208</price>
  19. </item>
  20. </items>
  21. </shiporder>

Dataframe的架构。

  1. root
  2. |-- _orderid: string (nullable = true)
  3. |-- items: struct (nullable = true)
  4. | |-- item: array (nullable = true)
  5. | | |-- element: struct (containsNull = true)
  6. | | | |-- notes: struct (nullable = true)
  7. | | | | |-- note: array (nullable = true)
  8. | | | | | |-- element: string (containsNull = true)
  9. | | | |-- price: double (nullable = true)
  10. | | | |-- quantity: long (nullable = true)
  11. | | | |-- title: string (nullable = true)
  12. |-- orderperson: string (nullable = true)
  13. |-- shipto: struct (nullable = true)
  14. | |-- address: string (nullable = true)
  15. | |-- city: string (nullable = true)
  16. | |-- country: string (nullable = true)
  17. | |-- name: string (nullable = true)
  18. df.show(truncate=False)
  19. +--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+
  20. |_orderid|items |orderperson |shipto |
  21. +--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+
  22. |str1234 |[[[[[color:Brown, Size:12]], 82.0, 1, Footwear], [[[Brand:CK, Size:L]], 208.0, 6, Clothing]]]|Vikrant Chand|[305, Giotto, Irvine, US, Amit]|
  23. +--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+

当我选择items列时,它将返回null。

  1. df.select([ 'items']).show()
  2. +-----+
  3. |items|
  4. +-----+
  5. | null|
  6. +-----+

而使用shipto(其他嵌套列)选择同一列则解决了问题。

  1. df.select([ 'items','shipto']).show()
  2. +--------------------+--------------------+
  3. | items| shipto|
  4. +--------------------+--------------------+
  5. |[[[[[color:Brown,...|[305, Giotto, Irv...|
  6. +--------------------+--------------------+
13z8s7eq

13z8s7eq1#

这是spark xml中的一个bug,在0.4.1中得到了修复
第193期

相关问题