问题描述
我是 Google Colab 和 Python 的新手。
我已经从谷歌驱动器定向文件,并试图使用 mrjob 运行 Map Reduce。
MariaDB [Bernd]> SELECT * FROM softball_stats;
+-------------+-----------+---------+---------------+
| sb_stats_id | player_id | game_id | sb_stats_team |
+-------------+-----------+---------+---------------+
| 1 | 100 | 1 | 1000 |
| 2 | 100 | 2 | 1000 |
| 3 | 100 | 3 | 1010 |
| 4 | 101 | 2 | 1000 |
| 5 | 102 | 3 | 1010 |
| 6 | 103 | 1 | 1000 |
+-------------+-----------+---------+---------------+
6 rows in set (0.01 sec)
MariaDB [Bernd]>
MariaDB [Bernd]> SELECT t1.*,t2.*
-> FROM softball_stats t1
-> INNER JOIN softball_stats t2
-> ON t1.game_id=t2.game_id
-> AND t1.sb_stats_team=t2.sb_stats_team
-> AND t2.sb_stats_id <> t1.sb_stats_id
-> WHERE t1. player_id IN (100,101);
+-------------+-----------+---------+---------------+-------------+-----------+---------+---------------+
| sb_stats_id | player_id | game_id | sb_stats_team | sb_stats_id | player_id | game_id | sb_stats_team |
+-------------+-----------+---------+---------------+-------------+-----------+---------+---------------+
| 4 | 101 | 2 | 1000 | 2 | 100 | 2 | 1000 |
| 2 | 100 | 2 | 1000 | 4 | 101 | 2 | 1000 |
| 3 | 100 | 3 | 1010 | 5 | 102 | 3 | 1010 |
| 1 | 100 | 1 | 1000 | 6 | 103 | 1 | 1000 |
+-------------+-----------+---------+---------------+-------------+-----------+---------+---------------+
4 rows in set (0.01 sec)
MariaDB [Bernd]>
但是,它返回类型错误。
import sys
sys.argv=['0']
from mrjob.job import MRJob
from mrjob.protocol import JSONProtocol,RawValueProtocol
from mrjob.step import MRStep
#creating an mrjob
class averagerating(MRJob):
def steps(self):
return [MRStep(mapper=self.mapper_average_rating,reducer=self.reducer_average_rating)]
#creating a mapping fuction
def mapper_average_rating(self):
x_teleplay=dfR_new['teleplay_id']
y_rating=dfR_new.iloc[:,-1:].mean(axis=1)
average_rate_per_id=dfR_new.groupby(['teleplay_id'])[['rating']].mean()
yield y_rating,x_teleplay
#creating a reducer fuction
def reducer_average_rating(self,key,values):
key=average_rate_per_id['teleplay_id']
values=average_rate_per_id['rating']
yield key,values
print(key,values)
#main function
if __name__ == "__main__":
averagerating.run()
我添加 ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-44-d5d6bd5175a1> in <module>()
27 #main function
28 if __name__ == "__main__":
---> 29 averagerating.run()
7 frames
/usr/local/lib/python3.7/dist-packages/mrjob/job.py in run(cls)
614 """
615 # load options from the command line
--> 616 cls().execute()
617
618 def run_job(self):
/usr/local/lib/python3.7/dist-packages/mrjob/job.py in execute(self)
685
686 else:
--> 687 self.run_job()
688
689 def make_runner(self):
/usr/local/lib/python3.7/dist-packages/mrjob/job.py in run_job(self)
632 stream=log_stream)
633
--> 634 with self.make_runner() as runner:
635 try:
636 runner.run()
/usr/local/lib/python3.7/dist-packages/mrjob/job.py in make_runner(self)
702
703 runner_class = self._runner_class()
--> 704 kwargs = self._runner_kwargs()
705
706 # screen out most false-ish args so that it's readable
/usr/local/lib/python3.7/dist-packages/mrjob/job.py in _runner_kwargs(self)
725 # don't screen out irrelevant opts (see #1898)
726 self._kwargs_from_switches(set(_RUNNER_OPTS)),--> 727 self._job_kwargs(),728 )
729
/usr/local/lib/python3.7/dist-packages/mrjob/job.py in _job_kwargs(self)
244 self.jobconf(),self.options.jobconf),245 libjars=combine_lists(
--> 246 self.libjars(),self.options.libjars),247 partitioner=self.partitioner(),248 sort_values=self.sort_values(),/usr/local/lib/python3.7/dist-packages/mrjob/job.py in libjars(self)
1371 ``--libjars`` option
1372 """
-> 1373 script_dir = os.path.dirname(self.mr_job_script())
1374
1375 paths = []
/usr/lib/python3.7/posixpath.py in dirname(p)
154 def dirname(p):
155 """Returns the directory component of a pathname"""
--> 156 p = os.fspath(p)
157 sep = _get_sep(p)
158 i = p.rfind(sep) + 1
TypeError: expected str,bytes or os.pathLike object,not nonetype
是因为如果我写 sys.argv=['0']
或不添加 sys.argv[]
列表索引将超出范围。
解决方法
所有问题都是因为您在 Google Colab
中运行它,它将代码和所有其他信息保存为 jupyter notebook .pynb
,但它不会将代码保存在 .py
当 MRJob
运行它时,它会创建新进程,这些进程必须再次从文件 .py
读取此代码。因为 juputer notebook
不是 .py
,所以它无法读取。
正如我在 Run MRJob from IPython notebook 中发现的,您可以使用 %%file script.py
将代码放入新单元格中以将代码保存在文件 .py
中,然后在下一个单元格中您可以将其作为 !python script.py your_data_file.txt
运行
这解决了 expected str,bytes or os.PathLike object,not NoneType
的问题,但它仍然有其他问题 - 它不知道 dfR_new
,因为它只运行 script.py
内的代码。您必须在 dfR_new
中创建,但我认为 MRjob
应该从文件中读取它。
坦率地说,我不明白您在 MRjob
中尝试做什么,因为这对我来说毫无意义。看起来 mapper
和 reducer
总是使用相同的值,所以它们没有用。 MRjob
从文本文件中读取数据,但它不会将其作为一个字符串发送给映射器,而是逐行发送 - 这样每一行都可以通过单独的进程进行修改 - 后来 reducer
应该获取所有修改过的行并将它们减少到一行。但是您始终使用来自完整 DataFrame 的相同值 - 因此 MRjob 似乎对此毫无用处。简单的 MRjob 是针对完全不同的问题。