8

bhmjp9jg 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(337)

我正试图为hadoop编写一组Map器/缩减器代码来计算tweet中的字数，但遇到了一点问题。我输入的文件是一个收集tweet信息的json文件。我首先将默认编码设置为utf-8，但运行代码时收到以下错误：
traceback（最近一次调用last）：文件“./mapperworks2.py”，第211行，在my\u json\u dict=json.loads（line）文件中“/usr/lib/python2.6/json/init.py”，第307行，在loads return\u default\u decoder.decode（s）文件中“/usr/lib/python2.6/json/decoder.py”，第319行，在decode obj中，end=self.raw\u decode（s，idx=\u w（s，0）.end（））文件“/usr/lib/python2.6/json/decoder.py”，行338，在raw\u decode raise valueerror（“no json object can be decoded”）valueerror:no json object can be decoded
程序的代码在哪里


# !/usr/bin/python
import sys
import json
import string
reload(sys)
sys.setdefaultencoding('utf8')
stop_words = ['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 "can't",
 'cannot',
 'could',
 "couldn't",
 'did',
 "didn't",
 'do',
 'does',
 "doesn't",
 'yourselves']
numbers = ["0","1","2","3","4","5","6","7","8","9"]
def clean_word(word):
    for c in string.punctuation:
        word = word.replace(c,"")
    for c in numbers:
        word = word.replace(c,"")
    return word
def dont_stop(word):
    if word in stop_words or word == "":
        return False
    else:
        return True
# input comes from STDIN (standard input)
for line in sys.stdin:
############ 
############ 
############ 
############ 
    my_json_dict = json.loads(line)
    line = my_json_dict['text'].lower()
############ 
############ 
############ 
############ 
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        ##################
        ##################
        word = clean_word(word)
        ##################
        ##################
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        ##################
        ##################
        if dont_stop(word):
            print '%s\t%s' % (word, 1)

当我不切换编码（即，注解掉重载（sys）和sys.setdefaultencoding（）时，我会遇到以下错误：
回溯（最后一次调用）：文件“./mapperworks2.py”，第236行，在打印“%s\t%s”中（word，1）unicodeencodeerror:“ascii”编解码器无法将字符u'\u2026'编码到位置>3:序号不在范围内（128）
不知道如何解决这个问题，任何帮助都是感激的。

hadoop mapreduce JSON python Ascii

来源：https://stackoverflow.com/questions/47757464/valueerrorno-json-object-could-be-decoded-using-python-2-6-and-utf-8