为什么我们使用hadoopmapreduce进行数据处理？为什么不在本地机器上呢？

mw3dktmi 于 2021-05-27 发布在 Hadoop

关注(0)|答案(1)|浏览(360)

我很困惑，我试着把概率当成一百万个随机数。我尝试在googledataproc中使用mapreduce和在spyder上运行python脚本两种方法。但是速度越快的是本地机器。那我们为什么要用mapreduce呢？下面是我使用的代码。


# !/usr/bin/env python3
import timeit
start = timeit.default_timer()
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
# Random Number Generating
x = np.random.randint(low=1, high=100, size=1000000)
counts = Counter(x)
total = sum(counts.values())
d1 = {k:v/total for k,v in counts.items()}
grad = d1.keys()
prob = d1.values()
# print(str(grad))
# print(str(prob))
# bins = 20
plt.hist(prob,bins=20, normed=1, facecolor='blue', alpha=0.5)
# plt.plot(bins, hist, 'r--')
plt.xlabel('Probability')
plt.ylabel('Number Of Students')
plt.title('Histogram of Students Grade')
plt.subplots_adjust(left=0.15)
plt.show()
stop = timeit.default_timer()
print('Time: ', stop - start)


# !/usr/bin/env python3
"""mapper.py"""
import sys
# Get input lines from stdin
for line in sys.stdin:
    # Remove spaces from beginning and end of the line
    #line = line.strip()
    # Split it into tokens
    #tokens = line.split()
    #Get probability_mass values
    for probability_mass in line:
        print("None\t{}".format(probability_mass))
        #print(str(probability_mass)+ '\t1')
        #print('%s\t%s' % (probability_mass, None))


# !/usr/bin/env python3
"""reducer.py"""
import sys
from collections import defaultdict
counts = defaultdict(float)
# Get input from stdin
for line in sys.stdin:
    #Remove spaces from beginning and end of the line
    #line = line.strip()
    # skip empty lines
    if not line:
        continue  
    # parse the input from mapper.py
    k,v = line.split('\t', 1)
    counts[v] += 1
total = (float(sum(counts.values())))
# total = sum(counts.values())
probability_mass = {k:v/total for k,v in counts.items()}
print(probability_mass)

hadoop mapreduce python-3.x google-cloud-platform bigdata

来源：https://stackoverflow.com/questions/59665329/why-we-use-hadoop-mapreduce-for-data-process-why-not-do-on-local-machine

1条答案

按热度按时间

guz6ccqo1#

hadoop用于存储和处理大数据。在hadoop中，数据存储在作为集群运行的廉价商品服务器上。它是一个分布式文件系统，允许并发处理和容错。hadoopmapreduce编程模型用于从其节点更快地存储和检索数据。
googledataproc是云上的apachehadoop。当体积很大时，单机不足以处理map/reduce。100万是小批量。

赞(0）回复(0）举报 2021-05-27

我来回答

为什么我们使用hadoopmapreduce进行数据处理？为什么不在本地机器上呢？

1条答案

相关问题

热门标签

最新问答