s3distcp在显示100%后挂起

nxagd54h  于 2021-06-02  发布在  Hadoop
关注(0)|答案(2)|浏览(335)

为了解决amazonemr的性能问题,我尝试使用 s3distcp 将文件从s3复制到emr集群进行本地处理。作为第一个测试,我使用 --groupBy 选项将它们折叠成一个(或几个)文件。
作业似乎运行得很好,显示map/reduce进展到100%,但此时进程挂起,再也没有回来。我怎么知道发生了什么事?
源文件是存储在s3中的gzip文本文件,每个文件大约30kb。这是一个普通的amazon emr集群,我正在主节点的shell中运行s3distcp。

  1. hadoop@ip-xxx:~$ hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar --src s3n://xxx/click/20140520 --dest hdfs:////data/click/20140520 --groupBy ".*(20140520).*" --outputCodec lzo
  2. 14/05/21 20:06:32 INFO s3distcp.S3DistCp: Running with args: [Ljava.lang.String;@26f3bbad
  3. 14/05/21 20:06:35 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/9f423c59-ec3a-465e-8632-ae449d45411a/output'
  4. 14/05/21 20:06:35 INFO s3distcp.S3DistCp: GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: us-west-2b
  5. 14/05/21 20:06:35 INFO s3distcp.S3DistCp: Created AmazonS3Client with conf KeyId AKIAJ5KT6QSV666K6KHA
  6. 14/05/21 20:06:37 INFO s3distcp.FileInfoListing: Opening new file: hdfs:/tmp/9f423c59-ec3a-465e-8632-ae449d45411a/files/1
  7. 14/05/21 20:06:38 INFO s3distcp.S3DistCp: Created 1 files to copy 2160 files
  8. 14/05/21 20:06:38 INFO mapred.JobClient: Default number of map tasks: null
  9. 14/05/21 20:06:38 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 72
  10. 14/05/21 20:06:38 INFO mapred.JobClient: Default number of reduce tasks: 3
  11. 14/05/21 20:06:39 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
  12. 14/05/21 20:06:39 INFO mapred.JobClient: Setting group to hadoop
  13. 14/05/21 20:06:39 INFO mapred.FileInputFormat: Total input paths to process : 1
  14. 14/05/21 20:06:39 INFO mapred.JobClient: Running job: job_201405211343_0031
  15. 14/05/21 20:06:40 INFO mapred.JobClient: map 0% reduce 0%
  16. 14/05/21 20:06:53 INFO mapred.JobClient: map 1% reduce 0%
  17. 14/05/21 20:06:56 INFO mapred.JobClient: map 4% reduce 0%
  18. 14/05/21 20:06:59 INFO mapred.JobClient: map 36% reduce 0%
  19. 14/05/21 20:07:00 INFO mapred.JobClient: map 44% reduce 0%
  20. 14/05/21 20:07:02 INFO mapred.JobClient: map 54% reduce 0%
  21. 14/05/21 20:07:05 INFO mapred.JobClient: map 86% reduce 0%
  22. 14/05/21 20:07:06 INFO mapred.JobClient: map 94% reduce 0%
  23. 14/05/21 20:07:08 INFO mapred.JobClient: map 100% reduce 10%
  24. 14/05/21 20:07:11 INFO mapred.JobClient: map 100% reduce 19%
  25. 14/05/21 20:07:14 INFO mapred.JobClient: map 100% reduce 27%
  26. 14/05/21 20:07:17 INFO mapred.JobClient: map 100% reduce 29%
  27. 14/05/21 20:07:20 INFO mapred.JobClient: map 100% reduce 100%
  28. [hangs here]

作业显示为:

  1. hadoop@xxx:~$ hadoop job -list
  2. 1 job currently running
  3. JobId State StartTime UserName Priority SchedulingInfo
  4. job_201405211343_0031 1 1400702799339 hadoop NORMAL NA

目标hdfs目录中没有任何内容:

  1. hadoop@xxx:~$ hadoop dfs -ls /data/click/

有什么想法吗?

3zwjbxry

3zwjbxry1#

hadoop@ip-用法:~$hadoop jar/home/hadoop/lib/emr-s3distcp-1.0.jar--src s3n:///click/20140520**/--dest hdfs:///data/click/20140520/**--groupby“(20140520)。”--outputcodec lzo
我也面临类似的问题。我只需要在目录末尾加一个斜杠。因此,它完成和统计显示,前一个它挂在100%

nc1teljy

nc1teljy2#

使用s3://而不是s3n。
hadoop jar/home/hadoop/lib/emr-s3distcp-1.0.jar--src s3:///click/20140520--dest hdfs:///data/click/20140520--groupby.“(20140520)。”--outputcodec lzo

相关问题