类型错误:运行 mrjob 时预期的 str、bytes 或 os.PathLike 对象,而不是 NoneType

问题描述

我是 Google Colab 和 Python 的新手。

我已经从谷歌驱动器定向文件,并试图使用 mrjob 运行 Map Reduce。

MariaDB [Bernd]> SELECT * FROM softball_stats;
+-------------+-----------+---------+---------------+
| sb_stats_id | player_id | game_id | sb_stats_team |
+-------------+-----------+---------+---------------+
|           1 |       100 |       1 |          1000 |
|           2 |       100 |       2 |          1000 |
|           3 |       100 |       3 |          1010 |
|           4 |       101 |       2 |          1000 |
|           5 |       102 |       3 |          1010 |
|           6 |       103 |       1 |          1000 |
+-------------+-----------+---------+---------------+
6 rows in set (0.01 sec)

MariaDB [Bernd]> 
MariaDB [Bernd]> SELECT t1.*,t2.*
    -> FROM softball_stats t1
    -> INNER JOIN softball_stats t2 
    ->   ON  t1.game_id=t2.game_id 
    ->   AND t1.sb_stats_team=t2.sb_stats_team 
    ->   AND t2.sb_stats_id <> t1.sb_stats_id
    -> WHERE t1. player_id IN (100,101);
+-------------+-----------+---------+---------------+-------------+-----------+---------+---------------+
| sb_stats_id | player_id | game_id | sb_stats_team | sb_stats_id | player_id | game_id | sb_stats_team |
+-------------+-----------+---------+---------------+-------------+-----------+---------+---------------+
|           4 |       101 |       2 |          1000 |           2 |       100 |       2 |          1000 |
|           2 |       100 |       2 |          1000 |           4 |       101 |       2 |          1000 |
|           3 |       100 |       3 |          1010 |           5 |       102 |       3 |          1010 |
|           1 |       100 |       1 |          1000 |           6 |       103 |       1 |          1000 |
+-------------+-----------+---------+---------------+-------------+-----------+---------+---------------+
4 rows in set (0.01 sec)

MariaDB [Bernd]> 

但是,它返回类型错误

import sys
sys.argv=['0']

from mrjob.job import MRJob
from mrjob.protocol import JSONProtocol,RawValueProtocol
from mrjob.step import MRStep
#creating an mrjob
class averagerating(MRJob):
  def steps(self):
    return [MRStep(mapper=self.mapper_average_rating,reducer=self.reducer_average_rating)]
  #creating a mapping fuction
  def mapper_average_rating(self):
     x_teleplay=dfR_new['teleplay_id']
     y_rating=dfR_new.iloc[:,-1:].mean(axis=1)
     average_rate_per_id=dfR_new.groupby(['teleplay_id'])[['rating']].mean()

     yield y_rating,x_teleplay



  #creating a reducer fuction
  def reducer_average_rating(self,key,values):
     key=average_rate_per_id['teleplay_id']
     values=average_rate_per_id['rating']
     yield key,values
     print(key,values)

 #main function
if __name__ == "__main__":
   averagerating.run()

我想问一下我的代码最后一行的问题在哪里,我该如何修复错误

添加 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-44-d5d6bd5175a1> in <module>() 27 #main function 28 if __name__ == "__main__": ---> 29 averagerating.run() 7 frames /usr/local/lib/python3.7/dist-packages/mrjob/job.py in run(cls) 614 """ 615 # load options from the command line --> 616 cls().execute() 617 618 def run_job(self): /usr/local/lib/python3.7/dist-packages/mrjob/job.py in execute(self) 685 686 else: --> 687 self.run_job() 688 689 def make_runner(self): /usr/local/lib/python3.7/dist-packages/mrjob/job.py in run_job(self) 632 stream=log_stream) 633 --> 634 with self.make_runner() as runner: 635 try: 636 runner.run() /usr/local/lib/python3.7/dist-packages/mrjob/job.py in make_runner(self) 702 703 runner_class = self._runner_class() --> 704 kwargs = self._runner_kwargs() 705 706 # screen out most false-ish args so that it's readable /usr/local/lib/python3.7/dist-packages/mrjob/job.py in _runner_kwargs(self) 725 # don't screen out irrelevant opts (see #1898) 726 self._kwargs_from_switches(set(_RUNNER_OPTS)),--> 727 self._job_kwargs(),728 ) 729 /usr/local/lib/python3.7/dist-packages/mrjob/job.py in _job_kwargs(self) 244 self.jobconf(),self.options.jobconf),245 libjars=combine_lists( --> 246 self.libjars(),self.options.libjars),247 partitioner=self.partitioner(),248 sort_values=self.sort_values(),/usr/local/lib/python3.7/dist-packages/mrjob/job.py in libjars(self) 1371 ``--libjars`` option 1372 """ -> 1373 script_dir = os.path.dirname(self.mr_job_script()) 1374 1375 paths = [] /usr/lib/python3.7/posixpath.py in dirname(p) 154 def dirname(p): 155 """Returns the directory component of a pathname""" --> 156 p = os.fspath(p) 157 sep = _get_sep(p) 158 i = p.rfind(sep) + 1 TypeError: expected str,bytes or os.pathLike object,not nonetype 是因为如果我写 sys.argv=['0'] 或不添加 sys.argv[] 列表索引将超出范围。

解决方法

所有问题都是因为您在 Google Colab 中运行它,它将代码和所有其他信息保存为 jupyter notebook .pynb,但它不会将代码保存在 .py

MRJob 运行它时,它会创建新进程,这些进程必须再次从文件 .py 读取此代码。因为 juputer notebook 不是 .py,所以它无法读取。

正如我在 Run MRJob from IPython notebook 中发现的,您可以使用 %%file script.py 将代码放入新单元格中以将代码保存在文件 .py 中,然后在下一个单元格中您可以将其作为 !python script.py your_data_file.txt 运行

enter image description here

这解决了 expected str,bytes or os.PathLike object,not NoneType 的问题,但它仍然有其他问题 - 它不知道 dfR_new,因为它只运行 script.py 内的代码。您必须在 dfR_new 中创建,但我认为 MRjob 应该从文件中读取它。


坦率地说,我不明白您在 MRjob 中尝试做什么,因为这对我来说毫无意义。看起来 mapperreducer 总是使用相同的值,所以它们没有用。 MRjob 从文本文件中读取数据,但它不会将其作为一个字符串发送给映射器,而是逐行发送 - 这样每一行都可以通过单独的进程进行修改 - 后来 reducer 应该获取所有修改过的行并将它们减少到一行。但是您始终使用来自完整 DataFrame 的相同值 - 因此 MRjob 似乎对此毫无用处。简单的 MRjob 是针对完全不同的问题。