无法使用spark从hdfs读取文件

q9yhzks0  于 2021-05-30  发布在  Hadoop
关注(0)|答案(10)|浏览(540)

我使用cloudera manager安装了cloudera cdh 5。
我很容易做到

hadoop fs -ls /input/war-and-peace.txt
hadoop fs -cat /input/war-and-peace.txt

上面的命令将在控制台上打印整个txt文件。
现在我打开Spark壳说

val textFile = sc.textFile("hdfs://input/war-and-peace.txt")
textFile.count

现在我犯了个错误
spark上下文可作为sc提供。

scala> val textFile = sc.textFile("hdfs://input/war-and-peace.txt")
2014-12-14 15:14:57,874 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(177621) called with curMem=0, maxMem=278302556
2014-12-14 15:14:57,877 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_0 stored as values in memory (estimated size 173.5 KB, free 265.2 MB)
textFile: org.apache.spark.rdd.RDD[String] = hdfs://input/war-and-peace.txt MappedRDD[1] at textFile at <console>:12

scala> textFile.count
2014-12-14 15:15:21,791 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 0 time(s); maxRetries=45
2014-12-14 15:15:41,905 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 1 time(s); maxRetries=45
2014-12-14 15:16:01,925 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 2 time(s); maxRetries=45
2014-12-14 15:16:21,983 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 3 time(s); maxRetries=45
2014-12-14 15:16:42,001 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 4 time(s); maxRetries=45
2014-12-14 15:17:02,062 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 5 time(s); maxRetries=45
2014-12-14 15:17:22,082 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 6 time(s); maxRetries=45
2014-12-14 15:17:42,116 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 7 time(s); maxRetries=45
2014-12-14 15:18:02,138 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 8 time(s); maxRetries=45
2014-12-14 15:18:22,298 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 9 time(s); maxRetries=45
2014-12-14 15:18:42,319 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 10 time(s); maxRetries=45
2014-12-14 15:19:02,354 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 11 time(s); maxRetries=45
2014-12-14 15:19:22,373 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 12 time(s); maxRetries=45
2014-12-14 15:19:42,424 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 13 time(s); maxRetries=45
2014-12-14 15:20:02,446 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 14 time(s); maxRetries=45
2014-12-14 15:20:22,512 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 15 time(s); maxRetries=45
2014-12-14 15:20:42,515 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 16 time(s); maxRetries=45
2014-12-14 15:21:02,550 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 17 time(s); maxRetries=45
2014-12-14 15:21:22,558 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 18 time(s); maxRetries=45
2014-12-14 15:21:42,683 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 19 time(s); maxRetries=45
2014-12-14 15:22:02,702 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 20 time(s); maxRetries=45
2014-12-14 15:22:22,832 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 21 time(s); maxRetries=45
2014-12-14 15:22:42,852 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 22 time(s); maxRetries=45
2014-12-14 15:23:02,974 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 23 time(s); maxRetries=45
2014-12-14 15:23:22,995 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 24 time(s); maxRetries=45
2014-12-14 15:23:43,109 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 25 time(s); maxRetries=45
2014-12-14 15:24:03,128 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 26 time(s); maxRetries=45
2014-12-14 15:24:23,250 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 27 time(s); maxRetries=45
java.net.ConnectException: Call From dn1home/192.168.1.21 to input:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
        at org.apache.hadoop.ipc.Client.call(Client.java:1415)

为什么我会犯这个错误?我能用hadoop命令读取相同的文件吗?

kqlmhetl

kqlmhetl1#

我也在用cdh5。对我来说,完整的道路hdfs://nn1home:8020“由于某些奇怪的原因不起作用。大多数示例都显示了这样的路径。
我用了这个命令

val textFile=sc.textFile("hdfs:/input1/Card_History2016_3rdFloor.csv")

以上命令的o/p:

textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:22

textFile.count

res1: Long = 58973

这对我来说很好。

vvppvyoh

vvppvyoh2#

这将起作用:

val textFile = sc.textFile("hdfs://localhost:9000/user/input.txt")

这里,你可以 localhost:9000 来自hadoop core-site.xml 配置文件的 fs.defaultFS 参数值。

gkn4icbw

gkn4icbw3#

这对我有用

logFile = "hdfs://localhost:9000/sampledata/sample.txt"
3b6akqbq

3b6akqbq4#

您没有传递正确的url字符串。 hdfs:// -协议类型 localhost -ip地址(可能与您不同,例如-127.56.78.4) 54310 -端口号 /input/war-and-peace.txt -要加载的文件的完整路径。
最后,url应该是这样的

hdfs://localhost:54310/input/war-and-peace.txt
f87krz0w

f87krz0w5#

如果您使用spark-env.sh中的hadoop\u home set启动spark,spark将知道在哪里查找hdfs配置文件。
在这种情况下,spark已经知道namenode/datanode的位置,只有在下面才能很好地访问hdfs文件;

sc.textFie("/myhdfsdirectory/myfiletoprocess.txt")

您可以创建myhdfsdirectory,如下所示;

hdfs dfs -mkdir /myhdfsdirectory

从本地文件系统,您可以使用下面的命令将myfiletoprocess.txt移动到hdfs目录

hdfs dfs -copyFromLocal mylocalfile /myhdfsdirectory/myfiletoprocess.txt
pgvzfuti

pgvzfuti6#

val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
conf.set("fs.defaultFS", "hdfs://hostname:9000")
val sc = new SparkContext(conf)
val data = sc.textFile("hdfs://hostname:9000/hdfspath/")
data.saveAsTextFile("C:\\dummy\")

上面的代码从目录中读取所有hdfs文件,并将其本地保存在c://dummy文件夹中。

lc8prwob

lc8prwob7#

从core site.xml(/etc/hadoop/conf)获取fs.defaultfs url,并读取如下文件。在我的例子中,fs.defaultfs是hdfs://quickstart.cloudera:8020
txtfile=sc.textfile('hdfs://quickstart.cloudera:8020/user/cloudera/rddoutput')txtfile.collect()

falq053o

falq053o8#

如果你想用 sc.textFile("hdfs://...") 您需要给出完整路径(绝对路径),在您的示例中是“nn1”home:8020/.."
如果你想让它变得简单,那就用 sc.textFile("hdfs:/input/war-and-peace.txt") 只有一个 /

l5tcr1uw

l5tcr1uw9#

这是解决办法

sc.textFile("hdfs://nn1home:8020/input/war-and-peace.txt")

我是怎么发现nn1的home:8020?
只需搜索文件 core-site.xml 并查找xml元素 fs.defaultFS

zkure5ic

zkure5ic10#

这可能是文件路径或url和hdfs端口的问题。
解决方案:先打开 core-site.xml 文件来自位置 $HADOOP_HOME/etc/hadoop 检查财产的价值 fs.defaultFS . 假设值为 hdfs://localhost:9000 hdfs中的文件位置是 /home/usr//fileName.txt . 然后,文件url将是: hdfs://localhost:9000/home/usr//fileName.txt 以及以下用于从hdfs读取文件的命令:

var result= scontext.textFile("hdfs://localhost:9000/home/usr/abc/fileName.txt", 2)

相关问题