spark与hive是否可以将项目阶段推到hivetablescan？

rkue9o1l 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(443)

我正在使用sparksql查询配置单元中以orc格式存储的数据。
当我对提供给的查询运行explain命令时 spark.sql(query) 我看到以下查询计划：

== Physical Plan ==

* Project [col1, col2, col3]

+- *Filter (....)
   +- HiveTableScan [col1, col2, col3, ...col50]

据我所知，它从配置单元中查询所有的50列，只有在spark和afterwords中的过滤只选择实际需要的列。
有没有可能将所需的列直接向下推到Hive中，这样它们就不会一直加载到spark？

hadoop Hive apache-spark apache-spark-sql apache-spark-dataset

来源：https://stackoverflow.com/questions/57819394/spark-with-hive-is-it-possible-to-push-project-phase-to-hivetablescan

1条答案

按热度按时间

elcex8rz1#

检查以下属性是否设置为默认值或false？

spark.sql("SET spark.sql.orc.enabled=true");
spark.sql("SET spark.sql.hive.convertMetastoreOrc=true")
spark.sql("SET spark.sql.orc.filterPushdown=true")

当您的数据分布在hdfs上的不同分区时，这些帮助您避免读取不必要的列，并利用hiveorc表的分区修剪功能。
将上述属性设置为“true”，然后查看您的解释计划显示的内容。
使用spark对orc格式进行分区修剪也可以从中受益，因为它不需要扫描整个表，并且可以限制spark在查询时需要的分区数。它将有助于减少磁盘输入/输出操作。
例如：
我运行下面的语句从hiveorc文件格式表创建一个dataframe，该表在列上进行分区 'country' & 'tran_date' .

df=spark.sql("""select transaction_date,payment_type,city from test_dev_db.transactionmainhistorytable where country ='United Kingdom' and tran_date='2009-06-01' """)

给定的表有几个分区，如果我们查看上面查询的物理计划，我们可以看到它只扫描了一个分区。

== Physical Plan ==

* (1) Project [transaction_date#69, payment_type#72, city#74]

+- *(1) FileScan orc test_dev_db.transactionmainhistorytable[transaction_date#69,payment_type#72,city#74,country#76,tran_date#77] 
Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://host/user/vikct001/dev/hadoop/database/test_dev..., 

* PartitionCount: 1,* PartitionFilters: [isnotnull(country#76), isnotnull(tran_date#77), (country#76 = United Kingdom), (tran_date#77 = 2...,

PushedFilters: [], ReadSchema: struct<transaction_date:timestamp,payment_type:string,city:string>

看到了吗 "PartitionCount: 1" 而且partitionfilters也被设置为不为null。
类似地，如果在查询中指定了任何筛选器，则可以向下推送筛选器。在这里，就像我用city列来过滤数据一样。

df=spark.sql("""select transaction_date,payment_type,city from test_dev_db.transactionmainhistorytable where country ='United Kingdom' and tran_date='2009-06-01' and city='London' """)

== Physical Plan ==

* (1) Project [transaction_date#104, payment_type#107, city#109]

+- *(1) Filter (isnotnull(city#109) && (city#109 = London))
   +- *(1) FileScan orc test_dev_db.transactionmainhistorytable[transaction_date#104,payment_type#107,city#109,country#111,tran_date#112] 
   Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://host/user/vikct001/dev/hadoop/database/test_dev..., 
   PartitionCount: 1, PartitionFilters: [isnotnull(country#111), isnotnull(tran_date#112), (country#111 = United Kingdom), (tran_date#112..., 
   PushedFilters: [IsNotNull(city), EqualTo(city,London)], ReadSchema: struct<transaction_date:timestamp,payment_type:string,city:string>

上面你可以看到 PushedFilters 不为空，并且有需要筛选的具有特定值的国家/地区列。

赞(0）回复(0）举报 2021-05-29

我来回答

spark与hive是否可以将项目阶段推到hivetablescan？

1条答案

相关问题

热门标签

最新问答