配置单元Map联接配置单元选择较大的表以存储在缓存中

x4shl7ld 于 2021-06-28 发布在 Hive

关注(0)|答案(2)|浏览(277)

我有以下属性集。

set hive.auto.convert.join=true;
set hive.optimize.ppd=true;

表a有2500万条记录。表b有4400万条记录。但where子句在表b上有过滤条件。因此，在应用过滤器之后，记录的数量降到了200万条。
hive没有处理表b的Map联接，而是选择表a。2500万条记录被缓存到所有数据节点中。
下面是所使用的查询

select col1,col2,col3,col4 
    from table_A a 
    join
   table_B c
    on
    a.account_number=c.account_number and c.ins_date between '$date_6' and '$date_cur'.

如何确保配置单元缓存表b？
在较大的表中包含流表提示之后进行计划-
阶段依赖关系：阶段4是根阶段阶段阶段3依赖于阶段：阶段4阶段0依赖于阶段：阶段3阶段计划：阶段4Map减少本地工作别名->Map本地表：b fetch operator limit:-1别名->Map本地运算符树：b表扫描别名：b统计信息：行数：23894045数据大小：7048743275基本统计信息：complete列统计信息：none哈希表接收器运算符条件表达式：0{cm\u mac\u fin}{wan\u mac}{restart}{reboot}{day\u id}1{division}{region}键：0 cm\u mac\u fin（类型：string）1 mac（类型：string）stage:stage-3 map reduce map运算符树：tablescan别名：a统计信息：num rows:2599797数据大小：678547017基本统计信息：完整列统计信息：none map join操作符条件Map：left outer join0 to 1条件表达式：0{cm\u mac\u fin}{wan\u mac}{restart}{reboot}{day\u id}1{mac}{division}{region}键：0 cm\u mac\u fin（type:string）1 mac（type:string）outputcolumnnames:\u col0、\u col1、\u col2、\u col3、\u col4、\u col8、\u col9，_col10 statistics:num rows:26283450 data size:7753617770 basic stats:complete column stats:none select operator expressions:_col0（type:string）、_col1（type:string）、_col2（type:int）、_col3（type:int）、_col4（type:date）、_col8（type:string）、_col9（type:string）、_col10（type:string）outputcolumnnames:_col0、_col1、_col2、_col3，第4列，第5列，第6列，\u col7 statistics:num rows:26283450 data size:7753617770 basic stats:complete column stats:none file output operator compressed:false statistics:num rows:26283450 data size:7753617770 basic stats:complete column stats:none table:input format:org.apache.hadoop.mapred.textinputformat输出格式：org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat serde:org.apache.hadoop.hive.serde2.lazy.lazysimpleserde本地工作：Map减少本地工作执行模式：矢量化阶段：阶段0获取运算符限制：-1处理器树：listsink
在较小的表中包含map join提示后的计划-
阶段依赖关系：阶段4是根阶段阶段阶段3依赖于阶段：阶段4阶段0依赖于阶段：阶段3阶段计划：阶段4Map减少本地工作别名->Map本地表：b fetch operator limit:-1别名->Map本地运算符树：b表扫描别名：b统计信息：行数：23894045数据大小：7048743275基本统计信息：complete列统计信息：none哈希表接收器运算符条件表达式：0{cm\u mac\u fin}{wan\u mac}{restart}{reboot}{day\u id}1{division}{region}键：0 cm\u mac\u fin（类型：string）1 mac（类型：string）stage:stage-3 map reduce map运算符树：tablescan别名：a统计信息：num rows:2599797数据大小：678547017基本统计信息：完整列统计信息：none map join操作符条件Map：left outer join0 to 1条件表达式：0{cm\u mac\u fin}{wan\u mac}{restart}{reboot}{day\u id}1{mac}{division}{region}键：0 cm\u mac\u fin（type:string）1 mac（type:string）outputcolumnnames:\u col0、\u col1、\u col2、\u col3、\u col4、\u col8、\u col9，_col10 statistics:num rows:26283450 data size:7753617770 basic stats:complete column stats:none select operator expressions:_col0（type:string）、_col1（type:string）、_col2（type:int）、_col3（type:int）、_col4（type:date）、_col8（type:string）、_col9（type:string）、_col10（type:string）outputcolumnnames:_col0、_col1、_col2、_col3，第4列，第5列，第6列，\u col7 statistics:num rows:26283450 data size:7753617770 basic stats:complete column stats:none file output operator compressed:false statistics:num rows:26283450 data size:7753617770 basic stats:complete column stats:none table:input format:org.apache.hadoop.mapred.textinputformat输出格式：org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat serde:org.apache.hadoop.hive.serde2.lazy.lazysimpleserde本地工作：Map减少本地工作执行模式：矢量化阶段：阶段0获取运算符限制：-1处理器树：listsink

Hive inner-join

来源：https://stackoverflow.com/questions/40224249/hive-map-join-hive-picking-the-bigger-table-to-store-in-cache

2条答案

按热度按时间

of1yzvn41#

配置单元在内部使用多个因素来确定连接的缓存表和流表：
它根据配置标志将查询转换为Map联接( hive.auto.convert.join.noconditionaltask, hive.auto.convert.join.noconditionaltask.size, hive.mapjoin.smalltable.filesize ).
大小配置允许用户控制表在内存中的大小。
假设n个表在join中进行分区，那么join的n-1个表必须放入内存中，map join优化才能生效。
当n=2时 hive.auto.convert.join 如果设置为true，则配置单元将进行Map联接并缓存小于的表 hive.mapjoin.smalltable.filesize 此参数无效。
在您的情况下，可以显式指定缓存表而不是配置单元来确定它：

select /*+MAPJOIN(c)*/ col1,col2,col3,col4 
    from table_A a 
    join
   table_B c
    on
    a.account_number=c.account_number and c.ins_date between '$date_6' and '$date_cur'.

赞(0）回复(0）举报 2021-06-28

sr4lhrrt2#

在join之前将where子句移动到cte。

WITH b as (
  SELECT col1,col2,col3,col4 
  FROM table_B
  WHERE ins_date between '$date_6' and '$date_cur'
)
SELECT col1,col2,col3,col4 
FROM table_A a join b
on a.account_number = b.account_number;

这样，连接右侧的b只有200万条记录，因此被加载到ram中。

赞(0）回复(0）举报 2021-06-28

我来回答

配置单元Map联接配置单元选择较大的表以存储在缓存中

2条答案

相关问题

热门标签

最新问答