选择Dataframe中不存在的列

at0kjp5o 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(492)

所以，我在创造一个 dataframe 从xml文件。它有一些关于经销商的信息，然后经销商有多辆车-每辆车都是系统的一个子元素 cars 元素，并由 value 元素-每个 cars.value 元素具有各种汽车属性。所以我用一个 explode 函数为经销商的每辆车创建一行，如下所示：

exploded_dealer = df.select('dealer_id',explode('cars.value').alias('a_car'))

现在我想得到 cars.value 我是这样做的：

car_details_df = exploded_dealer.select('dealer_id','a_car.attribute1','a_car.attribute2')

这很管用。但有时 cars.value 元素没有我在查询中指定的所有属性。举个例子 cars.value 元素可能只有attribute1—然后在运行上述代码时会出现以下错误：
pyspark.sql.utils.analysisexception:u“无法解析给定输入列的'attribute2'：[dealer_id，attribute1]；”
我如何要求spark执行相同的查询。但还是回来吧 None 如果属性2不存在？
更新我的数据如下：

initial_file_df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='dealer').load('<xml file location>')

exploded_dealer = df.select('financial_data',explode('cars.value').alias('a_car'))

apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/40180248/selecting-columns-not-present-in-the-dataframe

1条答案

按热度按时间

dxxyhpgq1#

既然您已经对模式做出了特定的假设，那么最好的方法就是使用 nullable 可选字段，并在导入数据时使用它。
假设您希望文档类似于：

<rows>
    <row>
        <id>1</id>
        <objects>
            <object>
                <attribute1>...</attribute1>
                 ...
                <attributebN>...</attributeN>
            </object>
        </objects>
    </row>
</rows>

哪里 attribute1 , attribute2 , ..., attributebN 可能不存在于给定的批中，但您可以定义一组有限的选项和相应的类型。为了简单起见，假设只有两种选择：

{("attribute1", StringType), ("attribute2", LongType)}

您可以将架构定义为：

schema = StructType([
  StructField("objects", StructType([
    StructField("object", StructType([
      StructField("attribute1", StringType(), True),
      StructField("attribute2", LongType(), True)
    ]), True)
  ]), True),
  StructField("id", LongType(), True)
])

与reader一起使用：

spark.read.schema(schema).option("rowTag", "row").format("xml").load(...)

它对属性的任何子集都有效({∅, {属性1}，{attribute2}，{attribute1，attribute2}）。同时比依赖模式推理更有效。

赞(0）回复(0）举报 2021-05-27

我来回答

选择Dataframe中不存在的列

1条答案

相关问题

热门标签

最新问答