我正试图为hadoop编写一组Map器/缩减器代码来计算tweet中的字数,但遇到了一点问题。我输入的文件是一个收集tweet信息的json文件。我首先将默认编码设置为utf-8,但运行代码时收到以下错误:
traceback(最近一次调用last):文件“./mapperworks2.py”,第211行,在my\u json\u dict=json.loads(line)文件中“/usr/lib/python2.6/json/init.py”,第307行,在loads return\u default\u decoder.decode(s)文件中“/usr/lib/python2.6/json/decoder.py”,第319行,在decode obj中,end=self.raw\u decode(s,idx=\u w(s,0).end())文件“/usr/lib/python2.6/json/decoder.py”,行338,在raw\u decode raise valueerror(“no json object can be decoded”)valueerror:no json object can be decoded
程序的代码在哪里
# !/usr/bin/python
import sys
import json
import string
reload(sys)
sys.setdefaultencoding('utf8')
stop_words = ['a',
'about',
'above',
'after',
'again',
'against',
'all',
'am',
'an',
'and',
'any',
'are',
"aren't",
'as',
'at',
'be',
'because',
'been',
'before',
'being',
'below',
'between',
'both',
'but',
'by',
"can't",
'cannot',
'could',
"couldn't",
'did',
"didn't",
'do',
'does',
"doesn't",
'yourselves']
numbers = ["0","1","2","3","4","5","6","7","8","9"]
def clean_word(word):
for c in string.punctuation:
word = word.replace(c,"")
for c in numbers:
word = word.replace(c,"")
return word
def dont_stop(word):
if word in stop_words or word == "":
return False
else:
return True
# input comes from STDIN (standard input)
for line in sys.stdin:
############
############
############
############
my_json_dict = json.loads(line)
line = my_json_dict['text'].lower()
############
############
############
############
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
##################
##################
word = clean_word(word)
##################
##################
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
##################
##################
if dont_stop(word):
print '%s\t%s' % (word, 1)
当我不切换编码(即,注解掉重载(sys)和sys.setdefaultencoding()时,我会遇到以下错误:
回溯(最后一次调用):文件“./mapperworks2.py”,第236行,在打印“%s\t%s”中(word,1)unicodeencodeerror:“ascii”编解码器无法将字符u'\u2026'编码到位置>3:序号不在范围内(128)
不知道如何解决这个问题,任何帮助都是感激的。
1条答案
按热度按时间qmelpv7a1#
请参阅此处的讨论:在python中管道化stdout时设置正确的编码
您的错误是试图将unicode字符串打印到输出。