使用java读取spark中包含struct类型的csv文件

slhcrj9b  于 2021-05-27  发布在  Hadoop
关注(0)|答案(2)|浏览(505)

我正在为一个程序写一个测试用例。为此,我正在读取一个csv文件,其中包含以下格式的数据。 account_number,struct_data 123456789,{"key1":"value","key2":"value2","keyn":"valuen"} 987678909,{"key1":"value0","key2":"value20","keyn":"valuen0"} 有几百排这样的队伍。
我需要把第二列当作一个结构来读。但我得到了错误 struct type expected, string type found 我尝试强制转换为structtype,然后得到错误为“stringtype无法转换为structtype”。
我应该改变我的csv吗?我还能做什么?

fd3cxomn

fd3cxomn1#

如果所有json记录都具有相同的模式,那么可以定义该模式并使用sparks from_json() 完成你的任务。

import org.apache.spark.sql.types.StructType

val df = Seq(
    (123456789, "{\"key1\":\"value\",\"key2\":\"value2\",\"keyn\":\"valuen\"}"),
    (987678909, "{\"key1\":\"value0\",\"key2\":\"value20\",\"keyn\":\"valuen0\"}")
    ).toDF("account_number", "struct_data")

val schema = new StructType()
  .add($"key1".string)
  .add($"key2".string)
  .add($"keyn".string)

 val df2 = df.withColumn("st", from_json($"struct_data", schema))

 df2.printSchema
 df2.show(false)

此代码段产生以下输出:

root
 |-- account_number: integer (nullable = false)
 |-- struct_data: string (nullable = true)
 |-- st: struct (nullable = true)
 |    |-- key1: string (nullable = true)
 |    |-- key2: string (nullable = true)
 |    |-- keyn: string (nullable = true)

+--------------+---------------------------------------------------+------------------------+
|account_number|struct_data                                        |st                      |
+--------------+---------------------------------------------------+------------------------+
|123456789     |{"key1":"value","key2":"value2","keyn":"valuen"}   |[value,value2,valuen]   |
|987678909     |{"key1":"value0","key2":"value20","keyn":"valuen0"}|[value0,value20,valuen0]|
+--------------+---------------------------------------------------+------------------------+
ijxebb2r

ijxebb2r2#

我在scala spark中给出了我的解决方案,它可能会对您的查询提供一些见解

scala> val sdf = """{"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}"""
sdf: String = {"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}

scala> val erdf = spark.read.json(Seq(sdf).toDS).toDF().withColumn("arr", explode($"df")).select("arr.*")
erdf: org.apache.spark.sql.DataFrame = [actNum: string, strType: array<struct<key1:string,key2:string>>]

scala> erdf.show()
+-------+-----------------+
| actNum|          strType|
+-------+-----------------+
|1234123|[[value1,value2]]|
+-------+-----------------+

scala> erdf.printSchema
root
 |-- actNum: string (nullable = true)
 |-- strType: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key1: string (nullable = true)
 |    |    |-- key2: string (nullable = true)

相关问题