我的spark程序正在服务器上运行: serverA
. 我正在运行pyspark终端的代码。使用该程序,我试图从另一台服务器->服务器上设置的另一个群集读取csv文件: serverB
,hdfs群集: clusterB
具体如下:
spark = SparkSession.builder.master('yarn').appName("Detector").config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/testuser/testuser.keytab').config('spark.yarn.principal','krbtgt/HADOOP.NAME.COM@NAME.COM').config('spark.executor.instances','1').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://clusterB').config('spark.yarn.principal','testuser@NAME.COM').getOrCreate()
我试图读取的文件位于群集上: clusterB
具体如下:
(base) testuser@hdptetl:[~] {46} $ hadoop fs -df -h
Filesystem Size Used Available Use%
hdfs://clusterB 787.3 T 554.5 T 230.7 T 70%
我在spark配置中提到的keytab细节(keytab的路径,kdc领域)在服务器上 serverB
当我尝试将文件加载为:
csv_df = spark.read.format('csv').load('hdfs://botest01/test/mr/wc.txt')
代码导致 UnknownHostException
具体如下:
>>> tdf = spark.read.format('csv').load('hdfs://clusterB/test/mr/wc.txt')
20/07/15 15:40:36 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: clusterB'
有谁能告诉我我在这里犯了什么错误,我该怎么改正?
暂无答案!
目前还没有任何答案,快来回答吧!