使用hadoop yarn运行distcp Java作业

问题描述

我想使用Java代码将hdfs中存在的文件复制到s3存储桶。我的Java代码实现如下所示:

import org.apache.hadoop.tools.distCp;
import org.apache.hadoop.tools.distCpOptions;
import org.apache.hadoop.tools.OptionsParser;
import org.apache.hadoop.conf.Configuration;

private void setHadoopConfiguration(Configuration conf) {

        conf.set("fs.defaultFS",hdfsUrl);
        conf.set("fs.s3a.access.key",s3AccessKey);
        conf.set("fs.s3a.secret.key",s3SecretKey);
        conf.set("fs.s3a.endpoint",s3EndPoint);
        conf.set("hadoop.job.ugi",hdfsUser);
        System.setProperty("com.amazonaws.services.s3.enableV4","true");
  
    }

public static void main(String[] args){
  
        Configuration conf = new Configuration();
        setHadoopConfiguration(conf);
      try {
                distCpOptions distCpOptions = OptionsParser.parse(new String[]{srcDir,dstDir});
                distCp distCp = new distCp(conf,distCpOptions);
                distCp.execute();
          } 
      catch (Exception e) { 
                   logger.info("Exception occured while copying file {}",srcDir);
                   logger.error("Error ",e);
         }
}

现在此代码可以正常运行,但是问题在于它不会在纱线簇上启动distcp作业。如果启动大文件副本,它将启动本地作业运行程序。

[2020-08-23 21:16:53.759][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://***.distcp.tmp.attempt_local367303638_0001_m_000000_0 (***.distcp.tmp.attempt_local367303638_0001_m_000000_0)
[2020-08-23 21:16:53.922][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Delete path s3a://***.distcp.tmp.attempt_local367303638_0001_m_000000_0 - recursive false
[2020-08-23 21:16:53.922][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://*** .distcp.tmp.attempt_local367303638_0001_m_000000_0 (**.distcp.tmp.attempt_local367303638_0001_m_000000_0)
[2020-08-23 21:16:54.007][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://****
[2020-08-23 21:16:54.118][LocalJobRunner Map Task Executor #0][ERROR][RetriableCommand:?] Failure in Retriable command: copying hdfs://*** to s3a://***
com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1189)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1135)

请帮助我了解如何配置yarn配置,以便distcp作业在群集上而不是在本地运行

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)