为什么我们使用hadoopmapreduce进行数据处理?为什么不在本地机器上呢?

mw3dktmi  于 2021-05-27  发布在  Hadoop
关注(0)|答案(1)|浏览(360)

我很困惑,我试着把概率当成一百万个随机数。我尝试在googledataproc中使用mapreduce和在spyder上运行python脚本两种方法。但是速度越快的是本地机器。那我们为什么要用mapreduce呢?下面是我使用的代码。

  1. # !/usr/bin/env python3
  2. import timeit
  3. start = timeit.default_timer()
  4. from collections import Counter
  5. import numpy as np
  6. import matplotlib.pyplot as plt
  7. # Random Number Generating
  8. x = np.random.randint(low=1, high=100, size=1000000)
  9. counts = Counter(x)
  10. total = sum(counts.values())
  11. d1 = {k:v/total for k,v in counts.items()}
  12. grad = d1.keys()
  13. prob = d1.values()
  14. # print(str(grad))
  15. # print(str(prob))
  16. # bins = 20
  17. plt.hist(prob,bins=20, normed=1, facecolor='blue', alpha=0.5)
  18. # plt.plot(bins, hist, 'r--')
  19. plt.xlabel('Probability')
  20. plt.ylabel('Number Of Students')
  21. plt.title('Histogram of Students Grade')
  22. plt.subplots_adjust(left=0.15)
  23. plt.show()
  24. stop = timeit.default_timer()
  25. print('Time: ', stop - start)
  1. # !/usr/bin/env python3
  2. """mapper.py"""
  3. import sys
  4. # Get input lines from stdin
  5. for line in sys.stdin:
  6. # Remove spaces from beginning and end of the line
  7. #line = line.strip()
  8. # Split it into tokens
  9. #tokens = line.split()
  10. #Get probability_mass values
  11. for probability_mass in line:
  12. print("None\t{}".format(probability_mass))
  13. #print(str(probability_mass)+ '\t1')
  14. #print('%s\t%s' % (probability_mass, None))
  1. # !/usr/bin/env python3
  2. """reducer.py"""
  3. import sys
  4. from collections import defaultdict
  5. counts = defaultdict(float)
  6. # Get input from stdin
  7. for line in sys.stdin:
  8. #Remove spaces from beginning and end of the line
  9. #line = line.strip()
  10. # skip empty lines
  11. if not line:
  12. continue
  13. # parse the input from mapper.py
  14. k,v = line.split('\t', 1)
  15. counts[v] += 1
  16. total = (float(sum(counts.values())))
  17. # total = sum(counts.values())
  18. probability_mass = {k:v/total for k,v in counts.items()}
  19. print(probability_mass)
guz6ccqo

guz6ccqo1#

hadoop用于存储和处理大数据。在hadoop中,数据存储在作为集群运行的廉价商品服务器上。它是一个分布式文件系统,允许并发处理和容错。hadoopmapreduce编程模型用于从其节点更快地存储和检索数据。
googledataproc是云上的apachehadoop。当体积很大时,单机不足以处理map/reduce。100万是小批量。

相关问题