spark和temp aws凭据:java.lang.noclassdeffounderror:org/apache/hadoop/fs/storagestatistics

dfddblmv  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(503)

我不明白spark是如何处理或下载scala接口提供的包的。
针对我的具体情况;我想显式地传递aws凭证以访问一些s3存储桶。spark cluster使用hadoop 2.9.2运行spark版本2.4.6。本地环境运行scala 2.11.12

import $ivy.`com.amazonaws:aws-java-sdk:1.11.199`
import $ivy.`org.apache.hadoop:hadoop-common:2.9.2`
import $ivy.`org.apache.hadoop:hadoop-aws:2.9.2`
import $ivy.`org.apache.spark::spark-sql:2.4.6`

import org.apache.spark.sql._
import org.apache.spark._

var appName = "read-s3-test"
var accessKeyId = "xxxxxxxxxxxxxx"
var secretAccessKey = "xxxxxxxxxxxxxx"
var sessionToken = "xxxxxxxxxxxxxx"

val conf = new SparkConf()
    .setAppName(appName)
    .setMaster("spark://my-spark-master-svc:7077")
    .set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .set("spark.jars.packages", "com.amazonaws:aws-java-sdk:1.11.199,org.apache.hadoop:hadoop-aws:2.9.2,org.apache.hadoop:hadoop-common:2.9.2")
    .set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
    .set("spark.hadoop.fs.s3a.access.key", accessKeyId)
    .set("spark.hadoop.fs.s3a.secret.key", secretAccessKey)
    .set("spark.hadoop.fs.s3a.session.token", sessionToken)
val sc = new SparkContext(conf)
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()

这将创建一个会话,但是当在s3a路径上运行任何read命令时,它会抱怨 java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics . 创建会话时,我可以从日志中读取可能未设置的配置:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/08/05 10:49:52 INFO SparkContext: Running Spark version 2.4.6
20/08/05 10:49:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/05 10:49:52 INFO SparkContext: Submitted application: read-s3-test
20/08/05 10:49:52 INFO SecurityManager: Changing view acls to: root
20/08/05 10:49:52 INFO SecurityManager: Changing modify acls to: root
20/08/05 10:49:52 INFO SecurityManager: Changing view acls groups to: 
20/08/05 10:49:52 INFO SecurityManager: Changing modify acls groups to: 
20/08/05 10:49:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/08/05 10:49:53 INFO Utils: Successfully started service 'sparkDriver' on port 46875.
20/08/05 10:49:53 INFO SparkEnv: Registering MapOutputTracker
20/08/05 10:49:53 INFO SparkEnv: Registering BlockManagerMaster
20/08/05 10:49:53 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/08/05 10:49:53 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/08/05 10:49:53 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b7fad649-d7d6-4b2e-b2c9-f54444e2fd22
20/08/05 10:49:53 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/08/05 10:49:53 INFO SparkEnv: Registering OutputCommitCoordinator
20/08/05 10:49:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/08/05 10:49:53 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://100.64.32.16:4040
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://my-spark-master-svc:7077...
20/08/05 10:49:53 INFO TransportClientFactory: Successfully created connection to my-spark-master-svc/172.20.99.118:7077 after 39 ms (0 ms spent in bootstraps)
20/08/05 10:49:53 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200805104953-0007
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200805104953-0007/0 on worker-20200805082629-100.64.40.6-37063 (100.64.40.6:37063) with 2 core(s)
20/08/05 10:49:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20200805104953-0007/0 on hostPort 100.64.40.6:37063 with 2 core(s), 1024.0 MB RAM
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200805104953-0007/1 on worker-20200805082549-100.64.8.0-42223 (100.64.8.0:42223) with 2 core(s)
20/08/05 10:49:53 INFO StandaloneSchedulerBackend: Granted executor ID app-20200805104953-0007/1 on hostPort 100.64.8.0:42223 with 2 core(s), 1024.0 MB RAM
20/08/05 10:49:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39367.
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200805104953-0007/0 is now RUNNING
20/08/05 10:49:53 INFO NettyBlockTransferService: Server created on 100.64.32.16:39367
20/08/05 10:49:53 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200805104953-0007/1 is now RUNNING
20/08/05 10:49:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/08/05 10:49:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 100.64.32.16, 39367, None)
20/08/05 10:49:53 INFO BlockManagerMasterEndpoint: Registering block manager 100.64.32.16:39367 with 366.3 MB RAM, BlockManagerId(driver, 100.64.32.16, 39367, None)
20/08/05 10:49:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 100.64.32.16, 39367, None)
20/08/05 10:49:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 100.64.32.16, 39367, None)
20/08/05 10:49:53 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/08/05 10:49:53 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.

一些评论:
精确的pyspark等价物在集群上运行良好(python3.7.6和pyspark2.4.4);
在本地spark而不是集群上运行也可以正常工作。
为了处理nativecodeloader警告,我已经附加了ld library路径: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native$SPARK_HOME/conf/spark-env.sh ; 但这并没有解决抛出警告或给出上述错误的问题。

9q78igpj

9q78igpj1#

这看起来像是类路径的问题:运行时使用了另一个hadoop版本。你能仔细检查一下哪些hadoop库是你的依赖项吗?
实际上我们刚刚找到了这个链接:java.lang.noclassdeffounderror:org/apache/hadoop/fs/storagestatistics

相关问题