嗨,我有json数据,我拉在pyspark的样本如下。
{
"data": [
["row-r9pv-p86t.ifsp", "00000000-0000-0000-0838-60C2FFCC43AE", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "KINGS", "F", "11"],
["row-7v2v~88z5-44se", "00000000-0000-0000-C8FC-DDD3F9A72DFF", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "SUFFOLK", "F", "6"],
["row-hzc9-4kvv~mbc9", "00000000-0000-0000-562E-D9A0792557FC", 0, 1574264158, null, 1574264158, null, "{ }", "2007", "ZOEY", "MONROE", "F", "6"]
]
}
我试图分解多数组,并将每条记录分解为一行Dataframe,但看起来是这样的:
df= spark.read.json('data/rows.json', multiLine=True)
temp_df = df.select(explode("data").alias("data"))
temp_df.show(n=3, truncate=False)
结果:
+-----------------------------------------------------------------------------------------------------------------------+
|data |
+-----------------------------------------------------------------------------------------------------------------------+
|[row-r9pv-p86t.ifsp, 00000000-0000-0000-0838-60C2FFCC43AE, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, KINGS, F, 11] |
|[row-7v2v~88z5-44se, 00000000-0000-0000-C8FC-DDD3F9A72DFF, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, SUFFOLK, F, 6]|
|[row-hzc9-4kvv~mbc9, 00000000-0000-0000-562E-D9A0792557FC, 0, 1574264158,, 1574264158,, { }, 2007, ZOEY, MONROE, F, 6] |
+-----------------------------------------------------------------------------------------------------------------------+
temp_df.printSchema()
temp_df.show(5)
temp_df.select(flatten(temp_df.data)).show(n=10)
到目前为止还不错,但是当我尝试使用 flatten
方法它给了我错误的说法 cannot resolve 'flatten('data')' due to data type mismatch: The argument should be an array of arrays, but 'data' is of array<string> type.
这是有道理的,但我不知道我们如何才能平展阵列。
我应该写一些习惯吗 map
方法将行数组Map到Dataframe列?
2条答案
按热度按时间fsi0uk1n1#
v 2(与Column和concat一起使用):
vcirk6k62#
回答我自己的问题。所以它可以帮助任何需要帮助的人。
从文件中读取源数据
结果:
在上面的数据框中,每个单元格都包含一个字符串数组,我需要的是单独列中的每个元素和特定的数据类型。