delta表创建额外的Parquet文件?

xa9qqrwz  于 2021-05-29  发布在  Spark
关注(0)|答案(0)|浏览(194)

在delta表上执行append/delete时,它正在创建一个不必要的虚拟Parquet文件。

data = [["Alyssa", "maroon", [8,8,8]]]
df = spark.createDataFrame(data, "name string, favorite_color string, favorite_numbers array<int>")
df.write.format("delta").mode("append").save("/users")

以下是为将单个记录附加到增量表而生成的日志文件:

{"commitInfo":{"timestamp":1592247860686,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":2,"isBlindAppend":true,"operationMetrics":{"numFiles":"2","numOutputBytes":"1611","numOutputRows":"1"}}}

{"add":{"path":"part-00000-5a63f209-d88f-4453-9e7b-c7b2318160c7-c000.snappy.parquet","partitionValues":{},"size":548,"modificationTime":1592247860655,"dataChange":true}}
{"add":{"path":"part-00007-494b0d4f-36f7-4c3c-a46f-058865d36113-c000.snappy.parquet","partitionValues":{},"size":1063,"modificationTime":1592247860680,"dataChange":true}}

不必要的文件,part-00000-5a63f209-d88f-4453-9e7b-c7b2318160c7-c000.snappy.parquet:

spark.read.format("parquet")\
 .load("/users/part-00000-5a63f209-d88f-4453-9e7b-c7b2318160c7-c000.snappy.parquet").show()

+----+--------------+----------------+
|name|favorite_color|favorite_numbers|
+----+--------------+----------------+
+----+--------------+----------------+

这是delta故意做的动作还是bug?
我曾在以下位置尝试过此代码:
scala版本2.11.12
spark版本2.4.5
delta jar包“io。delta:delta-core_2.11:0.6.1"
localhost,开放的delta湖源代码

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题