我必须从一些输入数据中得到不同的统计数据。这是我想的Map
label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record
for line in sys.stdin:
line = line.strip()
line2 = line.split('|')
lang = line2[-1]
total_docs_counter +=1
if lang=='-':
no_lang_info +=1
continue
doc_url = line2[0]
lang = lang .split('&')
max_lang = ""
max_percent = 0
for num, ln in enumerate(lang):
if num==0:
continue
tmp = ln.split('-')
lang_id = tmp[0]
# update_lang_statistics(total_lang_record, lang_id)
print "%s\t%s\t1" %(label_1, lang_id)
print "%s\t%s\t1" % (label_2, max_lang)
我为每个输出都提供了一个标签,以便在reducer中检测它属于哪个类别。然后在reducer中通过简单的条件表达式,得到所需的输出。很明显,内环的输出要比外环的输出多。我在本地做了个测试
cat input | ./mapper.py | sort | ./reducer.py
它工作正常,但当我在hadoop中为一些数据运行此作业时,似乎没有发生任何reducer操作,即输出文件包含Map器输出。Map器输出未正确还原。我的Map是非常简单的只是每个类别分别求和。问题出在哪里。有没有其他更好的方法来做这项工作?
这是减速机代码
def counting_reducer(line, temp, count, label):
label,name, freq = line.split("\t")
freq = int(freq)
if name == temp:
count += freq
else:
if temp:
print '%s\t%s\t%s' % (label, temp, count)
count = freq
temp = name
return [temp, count]
label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record"
# TMP variables
temp_l1 = None
count_l1 = 0
temp_l2 = None
count_l2 = 0
temp_l3 = None
count_l3 = 0
temp_l4 = None
count_l4 = 0
urdu_bin_record = {}
skip = 0
for line in sys.stdin:
line = line.strip()
if line.startswith(label_1):
temp_l1, count_l1 = counting_reducer(line, temp_l1, count_l1, label_1)
elif line.startswith(label_2):
temp_l2, count_l2 = counting_reducer(line, temp_l2, count_l2, label_2)
else:
print line
暂无答案!
目前还没有任何答案,快来回答吧!