将配置单元表标记为复制/小

mftmpeh8  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(341)

能告诉我吗 hive 某个表是“小的”,也就是说,它应该被复制到所有节点并在ram中进行操作?

mgdq6dx1

mgdq6dx11#

尝试以下提示:

/*+ MAPJOIN(small_table) */

顺便说一句,还有其他选项,如排序合并bucket join。但是,它们需要对输入表进行更改,以便在相同的列上进行绑定。
这里有一些关于hortonworks文档中Map连接的限制/能力的信息
hortonworks文档的mapside连接优化
为了方便起见,这里有一段关于mapjoins的摘录

MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through.

Local work:
read records via standard table scan (includes filters and projections) from source on local machine
build hashtable in memory
write hashtable to local disk
upload hashtable to dfs
add hashtable to distributed cache
Map task
read hashtable from local disk (distributed cache) into memory
match records? keys against hashtable
combine matches and write to output
No reduce task
Limitations of Current Implementation

The current MAPJOIN implementation has the following limitations:

The mapjoin operator can only handle one key at a time; that is, it can perform a multi-table join, but only if all the tables are joined on the same key. (Typical star schema joins do not fall into this category.)
Hints are cumbersome for users to apply correctly and auto conversion doesn't have enough logic to consistently predict if a MAPJOIN will fit into memory or not.
A chain of MAPJOINs is not coalesced into a single map-only job, unless the query is written as a cascading sequence of mapjoin(table, subquery(mapjoin(table, subquery....). Auto conversion will never produce a single map-only job.
The hashtable for the mapjoin operator has to be generated for each run of the query, which involves downloading all the data to the Hive client machine as well as uploading the generated hashtable files.

相关问题