mapreduce计算平均字符数

lmyy7pcs 于 2021-07-13 发布在 Hadoop

关注(0)|答案(1)|浏览(337)

我对map reduce和编码是个新手，我正在尝试用python编写一个代码来计算tweet中的平均字符数和“#”
样本数据：
1469453965000;757570956625870854;rt@lasteven04:la jeune rebecca#kpossi，纳古斯，18岁à 佩因·德夫赖特ê多哥德雷帕奥门大街à #rio2016超链接；android的twitter 1469453965000;757570957502394369;世界上有超过3000万的女足球运动员。我们中的大多数人都会用这个地段来交换位置⚽️ 超链接；iphone的twitter
字段/列详细信息：

0: epoch_time  1: tweetId  2: tweet  3: device

这是我写的代码，我需要帮助来计算reducer函数中的平均值，任何帮助/指导都将不胜感激：-根据@onecricketeer提供的答案进行更新

import re
from mrjob.job import MRJob

class Lab3(MRJob):

def mapper(self,_,line):

    try:
        fields=line.split(";")
        if(len(fields)==4):
            tweet=fields[2]
            tweet_id=fields[0]
            yield(None,tweet_id,("{},{}".format(len(tweet),tweet.count('#')))
    except:
        pass

def reduce(self,tweet_id,tweet_info):
    total_tweet_length=0
    total_tweet_hash=0
    count=0
    for v in tweet_info:
        tweet_length,hashes = map(int,v.split())
        tweet_length_sum+= tweet_length
        total_tweet_hash+=hashes
        count+=1

    yield(total_tweet_length/(1.0*count),total_tweet_hash/(1.0*count))

if __name__=="__main__":
    Lab3.run()

hadoop mapreduce python-3.x mrjob

来源：https://stackoverflow.com/questions/66791160/mapreduce-in-python-to-calculate-average-characters

1条答案

按热度按时间

euoag5mw1#

您的Map器需要产生一个键和一个值，2个元素，而不是3个元素，因此输出平均长度和hashtag计数在理想情况下应该是单独的mapreduce作业，但是对于这种情况，您可以将它们合并，因为您处理的是整行，而不是单独的单词


# you could use the tweetId as the key, too, but would only help if tweets shared ids

yield (None, "{} {}".format(len(tweet), tweet.count('#')))

注： len(tweet) 包括空格和表情符号，您可能希望将其作为“字符”排除
我不确定你能不能 _ 在函数定义中，所以也可以更改它
reduce函数语法错误。不能将字符串作为函数参数，也不能使用 += 在一个尚未定义的变量上。然后，平均计算需要在求和和计数之后进行除法（因此，循环中每个reducer返回一个结果，而不是每个值返回一个结果）}

def reduce(self,key,tweet_info):
    total_tweet_length = 0
    total_tweet_hash = 0
    count = 0
    for v in tweet_info:
        tweet_length, hashes = map(int, v.split())
        total_tweet_length += tweet_length
        total_tweet_hash += hashes
        count+=1
    yield(total_tweet_length / (1.0 * count), total_tweet_hash / (1.0 * count))  # forcing a floating point output

赞(0）回复(0）举报 2021-07-13

我来回答

mapreduce计算平均字符数

1条答案

相关问题

热门标签

最新问答