mapreduce在读取大量csv文件时失败

sg2wtvxw  于 2021-05-30  发布在  Hadoop
关注(0)|答案(1)|浏览(518)

如果我通过mapreduce单独运行csv文件,我就能够读取它们。但当我从一个有n个文件的文件夹运行时,mapreduce作业以100%的速度失败,显示以下错误:

INFO mapreduce.Job:  map 99% reduce 0%
INFO mapred.Task: Task:attempt_local1889843460_0001_m_000190_0 is done. And is in the process of committing
INFO mapred.LocalJobRunner: map
INFO mapred.Task: Task 'attempt_local1889843460_0001_m_000190_0' done.
INFO mapred.LocalJobRunner: Finishing task: attempt_local1889843460_0001_m_000190_0
INFO mapred.LocalJobRunner: map task executor complete.
WARN mapred.LocalJobRunner: job_local1889843460_0001
java.lang.Exception: java.lang.ArrayIndexOutOfBoundsException: 6
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 6
    at com.calsoftlabs.mr.analytics.common.ClientTrafficRecordReader.nextKeyValue(ClientTrafficRecordReader.java:49)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

好心的建议。

6tdlim6h

6tdlim6h1#

有几件事:
1) 始终将mapper的map()方法(和reducer的reduce()方法)中的逻辑 Package 在try-catch块中,这样就不会将整个作业吹出水面
2) 在catch块中,可以将无效的输入键/值与错误一起记录,或者出于开发目的,只需将信息写入控制台。如果正在调试作业,可以在catch块的第一行设置断点。
看起来你有190个任务,这可能意味着你有那么多的小文件。我猜是后来的一个文件-你没有手动运行-导致了这个问题

相关问题