pythonmapreduce:Map器中的空文件

cwxwcias  于 2021-06-03  发布在  Hadoop
关注(0)|答案(2)|浏览(340)

join2\uMap器.py

  1. # !/usr/bin/env python
  2. import sys
  3. shows = []
  4. for line in sys.stdin:
  5. line = line.strip()
  6. key_value = line.split(',')
  7. if key_value[1] == 'ABC':
  8. if key_value[1] not in shows:
  9. shows.append(key_value[0])
  10. if key_value[1].isdigit() and (key_value[0] in shows):
  11. print('{0}\t{1}'.format(key_value[0], key_value[1]) )

样品i/p

  1. Hourly_Sports,DEF
  2. Baked_Games,ABC
  3. Dumb_Talking,ABC
  4. Surreal_Talking,DEF
  5. Cold_Sports,BAT
  6. Hourly_Talking,XYZ
  7. Baked_Talking,CNO
  8. PostModern_Games,ABC
  9. Loud_Talking,DEF
  10. Almost_News,BAT
  11. Hot_Talking,XYZ
  12. Dumb_News,CNO
  13. Surreal_News,ABC
  14. Cold_Talking,DEF
  15. Hourly_Show,BAT
  16. Baked_Show,XYZ
  17. PostModern_Talking,CNO
  18. Loud_Show,ABC
  19. Almost_Cooking,DEF
  20. Hot_News,BAT
  21. Dumb_Cooking,XYZ
  22. Surreal_Cooking,CNO
  23. Cold_News,ABC
  24. Hourly_Sports,DEF
  25. Baked_Sports,BAT
  26. PostModern_Show,XYZ
  27. Loud_Sports,CNO
  28. Almost_Games,ABC
  29. Hot_Cooking,DEF
  30. Dumb_Games,BAT
  31. Surreal_Games,XYZ
  32. Cold_Cooking,CNO
  33. Hourly_Talking,ABC
  34. Baked_Talking,DEF
  35. PostModern_Sports,BAT
  36. Loud_Talking,XYZ
  37. Almost_Talking,CNO
  38. Hot_Games,ABC
  39. Dumb_Talking,DEF
  40. Surreal_Talking,BAT
  41. Cold_Games,XYZ
  42. Hourly_News,CNO
  43. Baked_News,ABC
  44. PostModern_Talking,DEF
  45. Loud_News,BAT
  46. Almost_Show,XYZ
  47. Hot_Talking,CNO
  48. Dumb_Show,ABC
  49. Surreal_Show,DEF
  50. Cold_Talking,BAT
  51. Hourly_Cooking,XYZ
  52. Baked_Cooking,CNO
  53. PostModern_News,ABC
  54. Loud_Cooking,DEF
  55. Almost_Sports,BAT
  56. Hot_Show,XYZ
  57. Dumb_Sports,CNO
  58. Surreal_Sports,ABC
  59. Cold_Show,DEF
  60. Hourly_Games,BAT
  61. Baked_Games,XYZ
  62. PostModern_Cooking,CNO
  63. Loud_Games,ABC
  64. Almost_Talking,DEF
  65. Hot_Sports,BAT
  66. Dumb_Talking,XYZ
  67. Surreal_Talking,CNO
  68. Cold_Sports,ABC
  69. Hourly_Talking,DEF
  70. Baked_Talking,BAT
  71. PostModern_Games,XYZ
  72. Loud_Talking,CNO
  73. Almost_News,ABC
  74. Hot_Talking,DEF
  75. Dumb_News,BAT
  76. Surreal_News,XYZ
  77. Cold_Talking,CNO
  78. Hourly_Show,ABC
  79. Almost_Cooking,855
  80. Baked_Games,991
  81. Baked_News,579
  82. Baked_Games,200
  83. Baked_Games,533
  84. Cold_News,590
  85. Hourly_Show,896
  86. ``` `$ cat j2.txt | python join2_mapper.py` ```
  87. Baked_Games 991
  88. Baked_News 579
  89. Baked_Games 200
  90. Baked_Games 533
  91. Cold_News 590
  92. Hourly_Show 896
  93. ``` `hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/join2_data/join2_genchan*.txt -input /user/cloudera/join2_data/join2_gennum*.txt -output /user/cloudera/join2_f1f -mapper /home/cloudera/join2_mapper.py -reducer /home/cloudera/join2_reducer.py -numReduceTasks 0` ```
  94. Map-Reduce Framework
  95. Map input records=6600
  96. Map output records=0
  97. Input split bytes=759
  98. Spilled Records=0
  99. Failed Shuffles=0
  100. Merged Map outputs=0
  101. GC time elapsed (ms)=4419
  102. CPU time spent (ms)=9170
  103. Physical memory (bytes) snapshot=702300160
  104. Virtual memory (bytes) snapshot=9022578688
  105. Total committed heap usage (bytes)=364511232
  106. File Input Format Counters
  107. Bytes Read=113055
  108. File Output Format Counters
  109. Bytes Written=0

问题在于输入文件。我实际上有六个输入文件,如下所示:

  1. $ hdfs dfs -ls /user/cloudera/join2_data/join2_gen*.txt
  2. -rw-r--r-- 1 cloudera cloudera 1714 2015-11-07 12:24 /user/cloudera/join2_data/join2_genchanA.txt
  3. -rw-r--r-- 1 cloudera cloudera 3430 2015-11-07 12:24 /user/cloudera/join2_data/join2_genchanB.txt
  4. -rw-r--r-- 1 cloudera cloudera 5152 2015-11-07 12:24 /user/cloudera/join2_data/join2_genchanC.txt
  5. -rw-r--r-- 1 cloudera cloudera 17114 2015-11-07 12:24 /user/cloudera/join2_data/join2_gennumA.txt
  6. -rw-r--r-- 1 cloudera cloudera 34245 2015-11-07 12:24 /user/cloudera/join2_data/join2_gennumB.txt
  7. -rw-r--r-- 1 cloudera cloudera 51400 2015-11-07 12:24 /user/cloudera/join2_data/join2_gennumC.txt

当我将所有文件连接到一个文件并运行作业时,它正在工作。得到想要的结果。当输入文件分为六块时,我得到一个空文件。请告知。

3phpmpom

3phpmpom1#

你不是故意的吗 key_value[0] 而不是1,in if key_value[1] not in shows ?

ogq8wdun

ogq8wdun2#

只提供一个 -input 参数,并将路径传递给包含所有输入数据的文件夹,而不是使用regex。如果不使用减速机,也要拆下减速机(只是为了消除混乱)。我不能确切地说哪一个可以解决这个问题(我怀疑这是第一个),但它可以解决它。所以:

  1. hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  2. -input /user/cloudera/join2_data/ \
  3. -output /user/cloudera/join2_f1f \
  4. -mapper /home/cloudera/join2_mapper.py

相关问题