python中的MapReduce计算平均字符

问题描述

我是 map-reduce 和编码的新手，我正在尝试用 python 编写代码来计算推文中的平均字符数和“#”

示例数据：

1469453965000;757570956625870854;RT @lasteven04: La jeune Rebecca #Kpossi,nageuse,18 ans à peine devrait être la porte-drapeau du #Togoà #Rio2016 超链接；Android 版 Twitter 1469453965000；757570957502394369；超过 3000 万女性世界上的足球运动员。我们大多数人会用这个地块换地用于#Rio2016 ⚽️ 超链接；iPhone 版推特

字段/列详细信息：

 0: epoch_time  1: tweetId  2: tweet  3: device

这是我编写的代码，我需要帮助来计算 reducer 函数中的平均值，任何帮助/指导将不胜感激：- 根据@oneCricketeer 提供的答案更新

import re
from mrjob.job import MRJob

class Lab3(MRJob):

def mapper(self,_,line):

    try:
        fields=line.split(";")
        if(len(fields)==4):
            tweet=fields[2]
            tweet_id=fields[0]
            yield(None,tweet_id,("{},{}".format(len(tweet),tweet.count('#')))
    except:
        pass

def reduce(self,tweet_info):
    total_tweet_length=0
    total_tweet_hash=0
    count=0
    for v in tweet_info:
        tweet_length,hashes = map(int,v.split())
        tweet_length_sum+= tweet_length
        total_tweet_hash+=hashes
        count+=1

    yield(total_tweet_length/(1.0*count),total_tweet_hash/(1.0*count))


if __name__=="__main__":
    Lab3.run()

解决方法

您的映射器需要生成一个键和一个值，2 个元素，而不是 3 个，因此理想情况下输出平均长度和标签计数应该是单独的 mapreduce 作业，但在这种情况下，您可以将它们组合起来，因为您正在处理整个行，而不是单独的词

# you could use the tweetId as the key,too,but would only help if tweets shared ids 
yield (None,"{} {}".format(len(tweet),tweet.count('#')))

注意：len(tweet) 包含空格和表情符号，您可能希望将其排除为“字符”

我不确定您是否可以将 _ 放在函数定义中，所以也可以更改它

您的 reduce 函数在语法上不正确。您不能将字符串作为函数参数，也不能对尚未定义的变量使用 +=。然后，平均计算需要您在总计和计数后除以（因此，在循环中，每个减速器返回一个结果，而不是每个值}

def reduce(self,key,tweet_info):
    total_tweet_length = 0
    total_tweet_hash = 0
    count = 0
    for v in tweet_info:
        tweet_length,hashes = map(int,v.split())
        total_tweet_length += tweet_length
        total_tweet_hash += hashes
        count+=1
    yield(total_tweet_length / (1.0 * count),total_tweet_hash / (1.0 * count))  # forcing a floating point output

mrjob python-3.x