从hdfs目录读取csv文件时无法修复未知hostexception

bmp9r5qi  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(379)

我的spark程序正在服务器上运行: serverA . 我正在运行pyspark终端的代码。使用该程序,我试图从另一台服务器->服务器上设置的另一个群集读取csv文件: serverB ,hdfs群集: clusterB 具体如下:

spark = SparkSession.builder.master('yarn').appName("Detector").config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/testuser/testuser.keytab').config('spark.yarn.principal','krbtgt/HADOOP.NAME.COM@NAME.COM').config('spark.executor.instances','1').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://clusterB').config('spark.yarn.principal','testuser@NAME.COM').getOrCreate()

我试图读取的文件位于群集上: clusterB 具体如下:

(base) testuser@hdptetl:[~] {46} $ hadoop fs -df -h
Filesystem          Size     Used  Available  Use%
hdfs://clusterB  787.3 T  554.5 T    230.7 T   70%

我在spark配置中提到的keytab细节(keytab的路径,kdc领域)在服务器上 serverB 当我尝试将文件加载为:

csv_df = spark.read.format('csv').load('hdfs://botest01/test/mr/wc.txt')

代码导致 UnknownHostException 具体如下:

>>> tdf = spark.read.format('csv').load('hdfs://clusterB/test/mr/wc.txt')
20/07/15 15:40:36 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load
    return self._df(self._jreader.load(path))
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: clusterB'

有谁能告诉我我在这里犯了什么错误,我该怎么改正?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题