由于在Map端聚合中使用哈希Map，内存不足

sg3maiej 于 2021-06-03 发布在 Hadoop

关注(0)|答案(2)|浏览(352)

我的配置单元查询引发此异常。

Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1
2013-05-22 12:08:32,634 Stage-1 map = 0%,  reduce = 0%
2013-05-22 12:09:19,984 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201305221200_0001 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201305221200_0001_m_000007 (and more) from job job_201305221200_0001
Examining task ID: task_201305221200_0001_m_000003 (and more) from job job_201305221200_0001
Examining task ID: task_201305221200_0001_m_000001 (and more) from job job_201305221200_0001

Task with the most failures(4): 
-----
Task ID:
  task_201305221200_0001_m_000001

URL:
  http://ip-10-134-7-119.ap-southeast-1.compute.internal:9100/taskdetails.jsp?jobid=job_201305221200_0001&tipid=task_201305221200_0001_m_000001

Possible error:
  Out of memory due to hash maps used in map-side aggregation.

Solution:
  Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;'
-----

Counters:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

    select 
        uri, 
        count(*) as hits 
    from
        iislog
    where 
        substr(cs_cookie,instr(cs_Cookie,'cwc'),30) like '%CWC%'
    and uri like '%.aspx%' 
    and logdate = '2013-02-07' 
    group by uri 
    order by hits Desc;

我在8个emr核心示例和1个8gb数据上的大型主示例上尝试了这个方法。首先，我尝试使用外部表（数据的位置是s3的路径）。之后，我将数据从s3下载到emr，并使用本机配置单元表。但在他们两个我得到了相同的错误。

FYI, i am using regex serde to parse the iislogs.

'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
               WITH SERDEPROPERTIES (
               "input.regex" ="([0-9-]+) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9-]+ [0-9:.]+) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([0-9-]+ [0-9:.]+)",
               "output.format.string"="%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s %18$s %19$s %20$s %21$s %22$s %23$s %24$s %25$s %26$s %27$s %28$s %29$s %30$s %31$s %32$s")
location 's3://*******';

hadoop Hive amazon-emr hiveql

来源：https://stackoverflow.com/questions/16684712/out-of-memory-due-to-hash-maps-used-in-map-side-aggregation

2条答案

按热度按时间

7gcisfzg1#

表的位置与配置单元无关。
如果您可以粘贴查询就更好了，这样您就可以知道Map器是否也在排序。
无论如何-我们需要增加内存量。检查Map任务配置为运行的内存量（mapred.child…）。至少应该是1克。如果足够大，您可以：
如果Map程序没有排序：请考虑将日志中指示的哈希聚合内存%提升到一个更高的数字
如果Map器正在排序-只需将任务内存增加到一个更大的数字。

赞(0）回复(0）举报 2021-06-03

nlejzf6q2#

你试过摆设吗 set hive.map.aggr.hash.percentmemory = 0.25; 像留言里写的那样？你可以在这里读更多

赞(0）回复(0）举报 2021-06-03