我尝试使用Hadoop Java库在Hadoop集群上运行distcp
命令,将内容从HDFS移动到Google Cloud Bucket。我收到错误NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
下面是我的java代码:
import com.google.gson.JsonArray;
import com.google.gson.JsonElement;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.tools.DistCp;
import org.apache.hadoop.tools.DistCpOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class HadoopHelper {
private static Logger logger = LoggerFactory.getLogger(HadoopHelper.class);
private static final String FS_DEFAULT_FS = "fs.defaultFS";
private final Configuration conf;
public HadoopHelper(String hadoopUrl) {
conf = new Configuration();
conf.set(FS_DEFAULT_FS, "hdfs://" + hadoopUrl);
}
public void distCP(JsonArray files, String target) {
try {
List<Path> srcPaths = new ArrayList<>();
for (JsonElement file : files) {
String srcPath = file.getAsString();
srcPaths.add(new Path(srcPath));
}
DistCpOptions options = new DistCpOptions.Builder(
srcPaths,
new Path("gs://" + target)
).build();
logger.info("Using distcp to copy {} to gs://{}", files, target);
this.conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem");
this.conf.set("fs.gs.auth.service.account.email", "my-svc-account@my-gcp-project.iam.gserviceaccount.com");
this.conf.set("fs.gs.auth.service.account.keyfile", "config/my-svc-account-keyfile.p12");
this.conf.set("fs.gs.project.id", "my-gcp-project");
DistCp distCp = new DistCp(this.conf, options);
Job job = distCp.execute();
job.waitForCompletion(true);
logger.info("Distcp operation success. Exiting");
} catch (Exception e) {
logger.error("Error while trying to execute distcp", e);
logger.error("Distcp operation failed. Exiting");
throw new IllegalArgumentException("Distcp failed");
}
}
public void createDirectory() throws IOException {
FileSystem fileSystem = FileSystem.get(this.conf);
fileSystem.mkdirs(new Path("/user/newfolder"));
logger.info("Done");
}
}
我在pom.xml
中添加了以下依赖项:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-distcp</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>hadoop3-2.2.4</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util</artifactId>
<version>2.2.4</version>
</dependency>
如果我在集群本身上运行distcp命令,如下所示:hadoop distcp /user gs://my_bucket_name/
distcp操作起作用,内容被复制到Cloud Bucket上。
1条答案
按热度按时间nsc4cvqm1#
您是否将jar添加到hadoop的类路径中?
将连接器jar添加到Hadoop的类路径中将连接器jar放在HADOOP_COMMON_LIB_JARS_DIR目录中应该足以让Hadoop加载jar。或者,为了确保jar被加载,您可以将HADOOP_CLASSPATH=$HADOOP_CLASSPATH:〈/path/to/gcs-connector.jar〉添加到hadoop-env.shHadoop配置目录中的www.example.com。
这需要在这一行代码之前对DisctCp conf(在您的代码
this.conf
中)执行以下操作:如果有帮助的话,这里有一个troubleshooting section。