在给定模式时检查PySpark Dataframe 中的列名

5fjcxozz 于 2022-11-16 发布在 Apache

关注(0)|答案(1)|浏览(146)

我有一个模式结构如下：

StructField('results', ArrayType(MapType(StringType(), StringType()), True), True), 
StructField('search_information', MapType(StringType(), StringType()), True), 
StructField('metadata', MapType(StringType(), StringType()), True), 
StructField('parameters', MapType(StringType(), StringType()), True), 
StructField('results_2', MapType(StringType(), StringType()), True),

我在一个文件中有上述列，每个文件可能有也可能没有这些列，我将JSON文件读为

spark.read.JSON.option(schema=schema, path=path)

我需要检查某些列是否存在并进行必要的转换。

if "metadata:" in df.schema.simpleString():

上面总是返回“True”，因为我已经定义了模式。如何检查文件原始数据中是否存在列？

apache-spark

来源：https://stackoverflow.com/questions/74362834/check-for-a-column-name-in-pyspark-dataframe-when-schema-is-given

1条答案

按热度按时间

3bygqnnd1#

您可以在不指定模式的情况下读取文件：

df = spark.read.option('multiline', 'true').json('file_name.json')

然后，如果要检查列是否存在，可以使用以下方法之一：
第一次
另一种方法是使用Python工具检查JSON中是否存在键：

import json

j = json.loads(the_file)

if "metadata" in j:

赞(0）回复(0）举报 2022-11-16

我来回答

在给定模式时检查PySpark Dataframe 中的列名

1条答案

相关问题

热门标签

最新问答