pyspark Azure Databricks：使用com.databricks.spark.xml加载增量xml数据的架构不匹配,(将结构转换为数组)

eqqqjvef 于 2023-04-05 发布在 Spark

关注(0)|答案(1)|浏览(155)

我想加载增量XML数据，但对于一个字段，当有单行时，spark有时会将模式推断为struct，当有两行时，spark有时会将模式推断为数组。
单行示例（Ship在这里被推断为struct coulmn）：
船舶船舶船舶ID 123 /船舶ID/船舶/船舶
双行示例（Ship在这里被推断为数组列）：
Ships Ship ShipID 123 /ShipID ShipID 234 /ShipID /Ship /Ships
这会导致架构不匹配。
你能帮助如何将结构体转换为数组，或者你是否有任何其他的解决方案
我试过铸造，但没有工作。

pyspark

来源：https://stackoverflow.com/questions/75901478/azure-databricks-schema-mismatch-to-load-incremental-xml-data-using-com-databr

1条答案

按热度按时间

wsewodh21#

手动定义XML数据的模式，然后在dataframe中使用它。

import org.apache.spark.sql.types._

val custom_schema = StructType(Seq(
  StructField("Ships", ArrayType(
    StructType(Seq(
      StructField("ShipID", StringType)
    ))
  ))
))

val df = spark.read
  .schema(custom_schema )
  .option("rootTag", "Ships")
  .option("rowTag", "Ship")
  .xml("dbfs:/***/shipfile1.xml")

这是用两个案例进行测试的。案例1具有单个ShipId。案例2具有两个ShipId。

输入1：

<Ships>
    <Ship>
        <ShipID>789</ShipID>
    </Ship>
</Ships>

输出1|船舶||- ———————-||789|
输入2：

<Ships>
    <Ship>
        <ShipID>123</ShipID>
    </Ship>
    <Ship>
        <ShipID>234</ShipID>
    </Ship>
</Ships>

输出2：

船舶
[[123]、[234]]

参考：XML文件上的数据块文档。

赞(0）回复(0）举报 2023-04-05

我来回答

pyspark Azure Databricks：使用com.databricks.spark.xml加载增量xml数据的架构不匹配,(将结构转换为数组)

1条答案

相关问题

热门标签

最新问答