我试图通过本教程了解如何使用python编写hadoop程序http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
这是mapper.py:
# !/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
我不明白这个词的用法 yield
. read_input
一次生成一行。然而, main
只打电话 read_input
一次,对应于文件的第一行。剩下的行怎么读呢?
1条答案
按热度按时间prdp8dxp1#
事实上,
main
电话read_input
好几次。在for循环的每个循环中,
data
,它是由返回的生成器read_input
,称为。输出data
分配给words
.基本上,
for words in data
是“调用数据并将输出分配给字,然后执行循环块”的缩写。