python dataframe collect()函数

f2uvfpb9 于 2021-07-12 发布在 Spark

关注(0)|答案(1)|浏览(363)

我在使用collect（）函数时遇到了一个非常奇怪的问题

data = df.select("node_id", "bin", "type", "jsonObj").collect()

jsonobj看起来像这样：

[
 {
   "id" : 1,
   "name" : "hello"
 },
 {
   "id" : 2,
   "name" : "world"
 }
]

现在，当我遍历collect函数生成的列表并打印行[“jsonobj”]时，我得到的json对象是字符串的一部分，而不仅仅是json对象。像现在一样，我将“'”添加到数组中的每个对象。问题是，当我试图将它写入一个文件时，它会变成字符串数组，而不是json对象数组

['{
   "id" : 1,
   "name" : "hello"
 }',
 '{
   "id" : 2,
   "name" : "world"
 }'
]

其他人也面临同样的问题吗？我只想将jsonobj按原样存储到文件中，而不是作为字符串。
节点\u idbintypejsonobj1atype1[{“id”：11，“name”：“hello”}，{“id”：12，“name”：“world”}]

root
 |-- node_id: long (nullable = true)
 |-- bin: string (nullable = true)
 |-- type: string (nullable = true)
 |-- jsonObj: array (nullable = true)

JSON python apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/66570611/python-dataframe-collect-function

1条答案

按热度按时间

np8igboo1#

您可以使用 from_json :

import pyspark.sql.functions as F
from pyspark.sql.types import *

df2 = df.withColumn(
    "jsonObj",
    F.from_json(
        F.col('jsonObj').cast('string'), 
        ArrayType(StructType([StructField('id', IntegerType()), StructField('name', StringType())]))
    )
)

df2.show(truncate=False)
+-------+---+-----+--------------------------+
|node_id|bin|type |jsonObj                   |
+-------+---+-----+--------------------------+
|1      |a  |type1|[[11, hello], [12, world]]|
+-------+---+-----+--------------------------+

df2.write.json('filepath')

它的输出应该是

{"node_id":"1","bin":"a","type":"type1","jsonObj":[{"id":11,"name":"hello"},{"id":12,"name":"world"}]}

赞(0）回复(0）举报 2021-07-12

我来回答

python dataframe collect()函数

1条答案

相关问题

热门标签

最新问答