如何在单个hadoop流作业中处理多个统计信息

b09cbbtk  于 2021-06-02  发布在  Hadoop
关注(0)|答案(0)|浏览(240)

我必须从一些输入数据中得到不同的统计数据。这是我想的Map

label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record

for line in sys.stdin:
    line = line.strip()
    line2 = line.split('|')
    lang = line2[-1]
    total_docs_counter +=1

    if lang=='-':
        no_lang_info +=1
        continue

    doc_url = line2[0]
    lang = lang .split('&')
    max_lang = ""
    max_percent = 0

    for num, ln in enumerate(lang):
        if num==0:
            continue
        tmp = ln.split('-')
        lang_id = tmp[0]
        # update_lang_statistics(total_lang_record, lang_id)
        print "%s\t%s\t1" %(label_1, lang_id)
    print "%s\t%s\t1" % (label_2, max_lang)

我为每个输出都提供了一个标签,以便在reducer中检测它属于哪个类别。然后在reducer中通过简单的条件表达式,得到所需的输出。很明显,内环的输出要比外环的输出多。我在本地做了个测试

cat input | ./mapper.py | sort | ./reducer.py

它工作正常,但当我在hadoop中为一些数据运行此作业时,似乎没有发生任何reducer操作,即输出文件包含Map器输出。Map器输出未正确还原。我的Map是非常简单的只是每个类别分别求和。问题出在哪里。有没有其他更好的方法来做这项工作?
这是减速机代码

def counting_reducer(line, temp, count, label):
    label,name, freq = line.split("\t")
    freq = int(freq)
    if name == temp:
        count += freq
    else:
        if temp:
            print '%s\t%s\t%s' % (label, temp, count)
        count = freq
        temp = name

    return [temp, count]

label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record"

# TMP variables

temp_l1 = None
count_l1 = 0
temp_l2 = None
count_l2 = 0
temp_l3 = None
count_l3 = 0
temp_l4 = None
count_l4 = 0

urdu_bin_record = {}

skip = 0
for line in sys.stdin:
    line = line.strip()
    if line.startswith(label_1):
        temp_l1, count_l1 = counting_reducer(line, temp_l1, count_l1, label_1)
    elif line.startswith(label_2):
        temp_l2, count_l2 = counting_reducer(line, temp_l2, count_l2, label_2)

    else:
        print line

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题