8

bhmjp9jg  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(337)

我正试图为hadoop编写一组Map器/缩减器代码来计算tweet中的字数,但遇到了一点问题。我输入的文件是一个收集tweet信息的json文件。我首先将默认编码设置为utf-8,但运行代码时收到以下错误:
traceback(最近一次调用last):文件“./mapperworks2.py”,第211行,在my\u json\u dict=json.loads(line)文件中“/usr/lib/python2.6/json/init.py”,第307行,在loads return\u default\u decoder.decode(s)文件中“/usr/lib/python2.6/json/decoder.py”,第319行,在decode obj中,end=self.raw\u decode(s,idx=\u w(s,0).end())文件“/usr/lib/python2.6/json/decoder.py”,行338,在raw\u decode raise valueerror(“no json object can be decoded”)valueerror:no json object can be decoded
程序的代码在哪里

  1. # !/usr/bin/python
  2. import sys
  3. import json
  4. import string
  5. reload(sys)
  6. sys.setdefaultencoding('utf8')
  7. stop_words = ['a',
  8. 'about',
  9. 'above',
  10. 'after',
  11. 'again',
  12. 'against',
  13. 'all',
  14. 'am',
  15. 'an',
  16. 'and',
  17. 'any',
  18. 'are',
  19. "aren't",
  20. 'as',
  21. 'at',
  22. 'be',
  23. 'because',
  24. 'been',
  25. 'before',
  26. 'being',
  27. 'below',
  28. 'between',
  29. 'both',
  30. 'but',
  31. 'by',
  32. "can't",
  33. 'cannot',
  34. 'could',
  35. "couldn't",
  36. 'did',
  37. "didn't",
  38. 'do',
  39. 'does',
  40. "doesn't",
  41. 'yourselves']
  42. numbers = ["0","1","2","3","4","5","6","7","8","9"]
  43. def clean_word(word):
  44. for c in string.punctuation:
  45. word = word.replace(c,"")
  46. for c in numbers:
  47. word = word.replace(c,"")
  48. return word
  49. def dont_stop(word):
  50. if word in stop_words or word == "":
  51. return False
  52. else:
  53. return True
  54. # input comes from STDIN (standard input)
  55. for line in sys.stdin:
  56. ############
  57. ############
  58. ############
  59. ############
  60. my_json_dict = json.loads(line)
  61. line = my_json_dict['text'].lower()
  62. ############
  63. ############
  64. ############
  65. ############
  66. # remove leading and trailing whitespace
  67. line = line.strip()
  68. # split the line into words
  69. words = line.split()
  70. # increase counters
  71. for word in words:
  72. ##################
  73. ##################
  74. word = clean_word(word)
  75. ##################
  76. ##################
  77. # write the results to STDOUT (standard output);
  78. # what we output here will be the input for the
  79. # Reduce step, i.e. the input for reducer.py
  80. #
  81. # tab-delimited; the trivial word count is 1
  82. ##################
  83. ##################
  84. if dont_stop(word):
  85. print '%s\t%s' % (word, 1)

当我不切换编码(即,注解掉重载(sys)和sys.setdefaultencoding()时,我会遇到以下错误:
回溯(最后一次调用):文件“./mapperworks2.py”,第236行,在打印“%s\t%s”中(word,1)unicodeencodeerror:“ascii”编解码器无法将字符u'\u2026'编码到位置>3:序号不在范围内(128)
不知道如何解决这个问题,任何帮助都是感激的。

qmelpv7a

qmelpv7a1#

请参阅此处的讨论:在python中管道化stdout时设置正确的编码
您的错误是试图将unicode字符串打印到输出。

相关问题