pyspark 合并Parquet文件失败

n7taea2i 于 2022-12-22 发布在 Spark

关注(0)|答案(1)|浏览(246)

我正在尝试将架构合并在一起。不幸的是，其中两个是不同的..并且我收到一个错误org.apache.spark.SparkException：合并架构失败。无法合并不兼容的数据类型string和double
我试过几种方法把它们合并在一起，但是我找不到修正这个错误的方法。有人知道如何处理这个错误吗？
谢谢

df = spark.read.format("parquet").load(result.db_path)
old_columns = df.columns
for col in old_columns:
    df = df.withColumnRenamed(col,col.lower())
df = df.withColumn("tenant", lit(payload.tenant))\
       .withColumn("filename", input_file_name())
write_format = 'delta'
save_path = f'dbfs:_________{endpoint.lower()}/'
db = f'--------'
name = f'{endpoint.lower()}_raas'
table_name = f'{db}.{name}'

if not spark._jsparkSession.catalog().tableExists(db,name):
    # Write the data to its target.
    df.write \
      .format(write_format) \
      .save(save_path)
    # Create the table.
    spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
else:
    df.write.format(write_format).mode("overwrite").save(save_path)```

I expect to merge schema with different values. Any ideas would be really helpful.

pyspark

来源：https://stackoverflow.com/questions/74853606/merging-parquet-files-failed

1条答案

按热度按时间

eqqqjvef1#

如果您尝试通过一次导入加载所有内容，例如.format().load()，那么如果您的文件彼此之间模式不兼容，您将无法继续。
在这种情况下，您可以做的是 * 分组 * 您知 prop 有兼容模式的文件，以便您可以转换它们（例如，String到Double），最后，将其与其余文件合并（第二组）。
例如，假设这是您的案例：

/path/file1   -> has column COL of type Int
/path/file2   -> has column COL of type String
/path/file3   -> has column COL of type Int

您可以一起读取文件file1和file3，将file2与String到Int的强制转换合并。

赞(0）回复(0）举报 2022-12-22

我来回答

pyspark 合并Parquet文件失败

1条答案

相关问题

热门标签

最新问答