如何在PySpark中为嵌套的JSON列创建模式?

lndjwyie  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(153)

我有一个多列的parquet文件,其中有2列是JSON/Struct,但它们的类型是string。可以有任何数量的array_elements。

  1. {
  2. "addressline": [
  3. {
  4. "array_element": "F748DK’8U1P9’2ZLKXE"
  5. },
  6. {
  7. "array_element": "’O’P0BQ04M-"
  8. },
  9. {
  10. "array_element": "’fvrvrWEM-"
  11. }
  12. ],
  13. "telephone": [
  14. {
  15. "array_element": {
  16. "locationtype": "8.PLT",
  17. "countrycode": null,
  18. "phonenumber": "000000000",
  19. "phonetechtype": "1.PTT",
  20. "countryaccesscode": null,
  21. "phoneremark": null
  22. }
  23. }
  24. ]
  25. }

字符串
如何在PySpark中创建一个模式来处理这些列?

ru9i0ody

ru9i0ody1#

把你提供的例子当作字符串,我创建了这个字符串:

  1. from pyspark.sql import functions as F, types as T
  2. df = spark.createDataFrame([('{"addressline":[{"array_element":"F748DK’8U1P9’2ZLKXE"},{"array_element":"’O’P0BQ04M-"},{"array_element":"’fvrvrWEM-"}],"telephone":[{"array_element":{"locationtype":"8.PLT","countrycode":null,"phonenumber":"000000000","phonetechtype":"1.PTT","countryaccesscode":null,"phoneremark":null}}]}',)], ['c1'])

字符串
这是要应用于此列的架构:

  1. schema = T.StructType([
  2. T.StructField('addressline', T.ArrayType(T.StructType([
  3. T.StructField('array_element', T.StringType())
  4. ]))),
  5. T.StructField('telephone', T.ArrayType(T.StructType([
  6. T.StructField('array_element', T.StructType([
  7. T.StructField('locationtype', T.StringType()),
  8. T.StructField('countrycode', T.StringType()),
  9. T.StructField('phonenumber', T.StringType()),
  10. T.StructField('phonetechtype', T.StringType()),
  11. T.StructField('countryaccesscode', T.StringType()),
  12. T.StructField('phoneremark', T.StringType()),
  13. ]))
  14. ])))
  15. ])


from_json函数提供模式的结果:

  1. df = df.withColumn('c1', F.from_json('c1', schema))
  2. df.show()
  3. # +-------------------------------------------------------------------------------------------------------+
  4. # |c1 |
  5. # +-------------------------------------------------------------------------------------------------------+
  6. # |{[{F748DK’8U1P9’2ZLKXE}, {’O’P0BQ04M-}, {’fvrvrWEM-}], [{{8.PLT, null, 000000000, 1.PTT, null, null}}]}|
  7. # +-------------------------------------------------------------------------------------------------------+
  8. df.printSchema()
  9. # root
  10. # |-- c1: struct (nullable = true)
  11. # | |-- addressline: array (nullable = true)
  12. # | | |-- element: struct (containsNull = true)
  13. # | | | |-- array_element: string (nullable = true)
  14. # | |-- telephone: array (nullable = true)
  15. # | | |-- element: struct (containsNull = true)
  16. # | | | |-- array_element: struct (nullable = true)
  17. # | | | | |-- locationtype: string (nullable = true)
  18. # | | | | |-- countrycode: string (nullable = true)
  19. # | | | | |-- phonenumber: string (nullable = true)
  20. # | | | | |-- phonetechtype: string (nullable = true)
  21. # | | | | |-- countryaccesscode: string (nullable = true)
  22. # | | | | |-- phoneremark: string (nullable = true)

展开查看全部

相关问题