如何计算 MapReduce 作业需要多少映射器

问题描述

下面我有一个问题可以为我们提供这些信息。

Suppose the program presented in 2a) will be executed on a dataset of 200 million
recorded inspections,collecting 2000 days of data. In total there are 1,000,000 unique
establishments. The total input size is 1 terabyte. The cluster has 100 worker nodes
(all of them idle),and HDFS is configured with a block size of 128MB.
Using that information,provide a reasoned answer to the following questions. State
any assumptions you feel necessary when presenting your answer.

在这里，我被要求回答这些问题。

1) How many worker nodes will be involved during the execution of the Map and Reduce
tasks of the job? 
2) How many times does the map method run on each physical worker?
3) How many input splits are processed at each node? 
4) How many times will the reduce method be invoked at each reducer?

有人可以验证我的答案是否正确吗？

Q1) 我基本上正在计算我需要多少映射器？我的计算结果是 1TB（输入大小）除以块大小（128MB）。

1TB / 128MB = 7812.5。既然需要 7812.5 映射器，而我们只有 100 个工作节点，那么所有 100 个节点都会正确使用吗？

Q2) 从 Q1 开始，我发现需要 7812.5 个映射器，因此每个映射方法将在每个 pyhsical worker 上运行 7812.5（四舍五入到 7813）次。

Q3) 输入拆分与映射器的数量相同，因此将有 7813 个拆分。

Q4) 因为我被告知有 1,000 个唯一值，reducer 的默认数量是 2。reduce 方法将在每个 reducer 上运行 500,000 次。

有人可以通过我的推理来看看我是否正确吗？谢谢

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

mrjob