通过python处理hdfs中的多个文件

xriantvc 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(346)

我在hdfs中有一个目录，包含大约10000个.xml文件。我有一个python脚本“processxml.py”，它接受一个文件并对其进行一些处理。是否可以对hdfs目录中的所有文件运行脚本，或者我需要先将它们复制到本地才能这样做？
例如，当我对本地目录中的文件运行脚本时，我有：

cd /path/to/files

for file in *.xml
do
python  /path/processxml.py 
$file > /path2/$file
done

所以基本上，我怎么做同样的事情，但这次文件是在hdfs中？

hadoop hdfs python scripting

来源：https://stackoverflow.com/questions/35070998/processing-multiple-files-in-hdfs-via-python

2条答案

按热度按时间

wztqucjr1#

你基本上有两个选择：
1）使用hadoop streaming connector创建mapreduce作业（这里您只需要map部分）。从shell或在shell脚本中使用以下命令：

hadoop jar <the location of the streamlib> \
        -D mapred.job.name=<name for the job> \
        -input /hdfs/input/dir \
        -output /hdfs/output/dir \
        -file your_script.py \
        -mapper python your_script.py \
        -numReduceTasks 0

2）创建一个pig脚本并发布python代码。下面是脚本的一个基本示例：

input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;    
STORE updated_data INTO 'hdfs/output/dir';

赞(0）回复(0）举报 2021-05-30

k10s72fa2#

如果您需要处理文件中的数据或在文件系统中移动/cp/rm/等，那么pyspark（带python接口的spark）将是最佳选择之一（速度、内存）。

赞(0）回复(0）举报 2021-05-29