使用python运行mapreduce作业(mrjob)时,命令行上出现意外参数错误

wh6knrhe  于 2021-05-31  发布在  Hadoop
关注(0)|答案(0)|浏览(630)

我对这个过程还比较陌生。我尝试在本地hadoop集群(hadoop版本3.2.1)上使用python3.8和csv运行一个简单的map reduce作业。我目前正在Windows10(64位)上运行它。我要做的是处理一个csv文件,在这个文件中,我将得到一个代表前10名工资的计数输出,但是它不起作用。
输入此命令时:

  1. $ python test2.py hdfs:///sample/salary.csv -r hadoop --hadoop-streaming-jar %HADOOP_HOME%/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar

输出报告错误:

  1. No configs found; falling back on auto-configuration
  2. No configs specified for hadoop runner
  3. Looking for hadoop binary in C:\hdp\hadoop\hadoop-dist\target\hadoop-3.2.1\bin...
  4. Found hadoop binary: C:\hdp\hadoop\hadoop-dist\target\hadoop-3.2.1\bin\hadoop.CMD
  5. Using Hadoop version 3.2.1
  6. Creating temp directory C:\Users\Name\AppData\Local\Temp\test2.Name.20200813.003240.345552
  7. uploading working dir files to hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd...
  8. Copying other local files to hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/
  9. Running step 1 of 1...
  10. Found 2 unexpected arguments on the command line [hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/setup-wrapper.sh#setup-wrapper.sh, hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/test2.py#test2.py]
  11. Try -help for more information
  12. Streaming Command Failed!
  13. Attempting to fetch counters from logs...
  14. Can't fetch history log; missing job ID
  15. No counters found
  16. Scanning logs for probable cause of failure...
  17. Can't fetch history log; missing job ID
  18. Can't fetch task logs; missing application ID
  19. Step 1 of 1 failed: Command '['C:\\hdp\\hadoop\\hadoop-dist\\target\\hadoop-3.2.1\\bin\\hadoop.CMD', 'jar', 'C:\\hdp\\hadoop\\hadoop-dist\\target\\hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar', '-files', 'hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/test2.py#test2.py', '-input', 'hdfs:///sample/salary.csv', '-output', 'hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/output', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 test2.py --step-num=0 --mapper', '-combiner', '/bin/sh -ex setup-wrapper.sh python3 test2.py --step-num=0 --combiner', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 test2.py --step-num=0 --reducer']' returned non-zero exit status 1.

下面是我从上面的输出中得到的错误:

  1. Found 2 unexpected arguments on the command line [hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/setup-wrapper.sh#setup-wrapper.sh, hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/test2.py#test2.py]

这是python文件test2.py:

  1. from mrjob.job import MRJob
  2. from mrjob.step import MRStep
  3. import csv
  4. cols = 'Name,JobTitle,AgencyID,Agency,HireDate,AnnualSalary,GrossPay'.split(',')
  5. class salarymax(MRJob):
  6. def mapper(self, _, line):
  7. # Convert each line into a dictionary
  8. row = dict(zip(cols, [a.strip() for a in csv.reader([line]).next()]))
  9. # Yield the salary
  10. yield 'salary', (float(row['AnnualSalary'][1:]), line)
  11. # Yield the gross pay
  12. try:
  13. yield 'gross', (float(row['GrossPay'][1:]), line)
  14. except ValueError:
  15. self.increment_counter('warn', 'missing gross', 1)
  16. def reducer(self, key, values):
  17. topten = []
  18. # For 'salary' and 'gross' compute the top 10
  19. for p in values:
  20. topten.append(p)
  21. topten.sort()
  22. topten = topten[-10:]
  23. for p in topten:
  24. yield key, p
  25. combiner = reducer
  26. if __name__ == '__main__':
  27. salarymax.run()

我已经看了这个stackoverflow问题,如何使用hadoop流在本地hadoop集群中运行mrjob?但这并没有解决我的错误。
我已经查看了setup-wrapper.sh文件,因为那里突出显示了一个错误。我检查的时候好像没什么问题。
我不明白错误是什么。有办法解决吗?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题