使用hadoop运行distcp java作业

pn9klfpd  于 2021-05-31  发布在  Hadoop
关注(0)|答案(0)|浏览(387)

我想用java代码将hdfs中的文件复制到s3 bucket。我的java代码实现如下所示:

  1. import org.apache.hadoop.tools.DistCp;
  2. import org.apache.hadoop.tools.DistCpOptions;
  3. import org.apache.hadoop.tools.OptionsParser;
  4. import org.apache.hadoop.conf.Configuration;
  5. private void setHadoopConfiguration(Configuration conf) {
  6. conf.set("fs.defaultFS", hdfsUrl);
  7. conf.set("fs.s3a.access.key", s3AccessKey);
  8. conf.set("fs.s3a.secret.key", s3SecretKey);
  9. conf.set("fs.s3a.endpoint", s3EndPoint);
  10. conf.set("hadoop.job.ugi", hdfsUser);
  11. System.setProperty("com.amazonaws.services.s3.enableV4", "true");
  12. }
  13. public static void main(String[] args){
  14. Configuration conf = new Configuration();
  15. setHadoopConfiguration(conf);
  16. try {
  17. DistCpOptions distCpOptions = OptionsParser.parse(new String[]{srcDir, dstDir});
  18. DistCp distCp = new DistCp(conf, distCpOptions);
  19. distCp.execute();
  20. }
  21. catch (Exception e) {
  22. logger.info("Exception occured while copying file {}", srcDir);
  23. logger.error("Error ", e);
  24. }
  25. }

现在,这段代码运行良好,但问题是它没有在yarn cluster上启动distcp作业。它启动本地job runner,因此在出现大文件副本时会超时。

  1. [2020-08-23 21:16:53.759][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://***.distcp.tmp.attempt_local367303638_0001_m_000000_0 (***.distcp.tmp.attempt_local367303638_0001_m_000000_0)
  2. [2020-08-23 21:16:53.922][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Delete path s3a://***.distcp.tmp.attempt_local367303638_0001_m_000000_0 - recursive false
  3. [2020-08-23 21:16:53.922][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://***.distcp.tmp.attempt_local367303638_0001_m_000000_0 (**.distcp.tmp.attempt_local367303638_0001_m_000000_0)
  4. [2020-08-23 21:16:54.007][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://****
  5. [2020-08-23 21:16:54.118][LocalJobRunner Map Task Executor #0][ERROR][RetriableCommand:?] Failure in Retriable command: Copying hdfs://***to s3a://***
  6. com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
  7. at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1189)
  8. at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1135)

请帮助我了解如何配置yarn configs,以便distcp作业在集群上运行,而不是在本地运行

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题