hive在物理计划中做交换和排序步骤

drnojrws 于 2021-06-24 发布在 Hive

关注(0)|答案(1)|浏览(300)

我有两个表，它们都聚集在同一列上，但是当连接聚集列上的两个表时，执行计划同时显示交换和排序步骤。
两个表都扣在同一列上（键列）。两个表都是或压缩的，表a被分区和压缩，表b被压缩在同一列上。
我想从我的计划中避免排序和交换步骤，根据文档，带方框的表应该避免排序和交换步骤。
我甚至尝试了以下Hive属性：

spark.sql('set spark.sql.orc.filterPushdown=true')
spark.sql('set hive.optimize.bucketmapjoin = true')
spark.sql('set hive.optimize.bucketmapjoin.sortedmerge = true')
spark.sql('set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat')
spark.sql('set hive.optimize.bucketmapjoin = true')
spark.sql('set hive.stats.autogather=true')
spark.sql('set hive.compute.query.using.stats=true')
spark.sql('set hive.optimize.index.filter=true')

also collected stats for the tables:

排序和交换都可以在物理计划中看到，但是Hive状的表应该避免排序和交换步骤

[count#1311L])
          +- *Project
             +- *SortMergeJoin [key_column#1079], [key_column#1218],Inner
sort step:                :- *Sort [key_column#1079 ASC NULLS FIRST], false, 0
    exchange step:            :  +- Exchange hashpartitioning(key_column#1079, 200)
                :     +- *Filter isnotnull(key_column#1079)

预期结果：无排序和交换

[count#1311L])
              +- *Project
                 +- *SortMergeJoin [key_column#1079], [key_column#1218], Inner
                    :     +- *Filter isnotnull(key_column#1079)`enter code here`

我想从我的计划中避免排序和交换步骤，根据文档，带方框的表应该避免排序和交换步骤。

Hive apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/56554080/hive-bucketed-table-doing-exchange-and-sort-step-in-physical-plan

1条答案

按热度按时间

w7t8yxp51#

hive和spark的bucketing语义是不同的。
当在spark中读取从hive创建的bucketed表时，不遵循hive bucketing语义。
要利用spark bucketing功能，必须使用spark创建表格。
来自开源的设计文档阐述了Hive和Spark扣的区别：https://docs.google.com/document/d/1a8idh23rakrkg9yyaeo51f4ago8-xalupkwdshve2fc/edit#heading=高fbzz4lt51r0

赞(0）回复(0）举报 2021-06-24

我来回答

hive在物理计划中做交换和排序步骤

1条答案

相关问题

热门标签

最新问答