问题描述
我目前正在构建一个小型测试项目,以了解如何在 Linux (Ubuntu 20.04.2 LTS) 上使用 crontab
。
我的 crontab 文件如下所示:
* * * * * sh /home/path_to .../crontab_start_spider.sh >> /home/path_to .../log_python_test.log 2>&1
我想要crontab做的,就是用下面的shell文件启动一个scrapy项目。输出存储在文件 log_python_test.log 中。
我的shell文件(数字仅供参考):
0 #!/bin/bash
1 cd /home/luc/Documents/computing/tests/learning/morning
2 PATH=$PATH:/usr/local/bin
3 export PATH
4 PATH=$PATH:/home/luc/gen_env/lib/python3.7/site-packages
5 export PATH
6 scrapy crawl meteo
你们中的一些人可能对我的scrapy项目的结构感兴趣,所以这里是:
你可能还想要我在scrapy中编辑的代码:
我的蜘蛛:meteo.py
import scrapy
from morning.items import MorningItem
class MeteoSpider(scrapy.Spider):
name = 'meteo'
allowed_domains = ['meteo.gc.ca']
start_urls = ['https://www.meteo.gc.ca/city/pages/qc-136_metric_f.html']
def parse(self,response,**kwargs):
# Extracting data from page
condition =response.css('div.col-sm-4:nth-child(1) > dl:nth-child(1) > dd:nth-child(2)::text').get()
pression = response.css('div.col-sm-4:nth-child(1) > dl:nth-child(1) > dd:nth-child(4)::text').get()
temperature = response.css('div.brdr-rght-city:nth-child(2) > dl:nth-child(1) > dd:nth-child(2)::text').get()
# Creating and filling the item
item = MorningItem()
item['condition'] = condition
item['pression'] = pression
item['temperature'] = temperature
return item
我的项目:在 items.py
import scrapy
class MorningItem(scrapy.Item):
condition = scrapy.Field()
pression = scrapy.Field()
temperature = scrapy.Field()
我的管道:在 pipelines.py 中(这个默认管道在 settings.py 中没有注释)
import logging
from gtts import gTTS
import os
import random
from itemadapter import ItemAdapter
class MorningPipeline:
def process_item(self,item,spider):
adapter = ItemAdapter(item)
# Message creation
messages = ["Bon matin! J'èspère que vous avez bien dormi cette nuit. Voici le topo.","Bonjour Luc. Un bon petit café et on est parti.","Saluto amigo. Voici ce que vous devez savoir."]
message_of_the_day = messages[random.randint(0,len(messages) - 1)]
# Add meteo to message
message_of_the_day += f" Voici la météo. La condition: {adapter['condition']}. La pression: " \
f"{adapter['pression']} kilo-pascal. La température: {adapter['temperature']} celcius."
if '-' in adapter['temperature']:
message_of_the_day += " Vous devriez vous mettre un petit chandail."
elif len(adapter['temperature']) == 3:
if int(adapter['temperature'][0:2]) > 19:
message_of_the_day += " Vous allez être bien en sanDales."
# Creating mp3
language = 'fr-ca'
output = gTTS(text=message_of_the_day,lang=language,slow=False)
# Prepare output file emplacement and saving
if os.path.exists("/home/luc/Music/output.mp3"):
os.remove("/home/luc/Music/output.mp3")
output.save("/home/luc/Music/output.mp3")
# Playing mp3 and retrieving the output
logging.info(f'First command output: {os.system("mpg123 /home/luc/Music/output.mp3")}')
return item
我在终端中运行该项目没有任何问题 (scrapy crawl meteo
):
WARNING:gtts.lang:'fr-ca' has been deprecated,falling back to 'fr'. This fallback will be removed in a future version.
2021-06-04 12:18:21 [gtts.lang] WARNING: 'fr-ca' has been deprecated,falling back to 'fr'. This fallback will be removed in a future version.
...
stats:
{'downloader/request_bytes': 471,'downloader/request_count': 2,'downloader/request_method_count/GET': 2,'downloader/response_bytes': 14325,'downloader/response_count': 2,'downloader/response_status_count/200': 2,'elapsed_time_seconds': 21.002126,'finish_reason': 'finished','finish_time': datetime.datetime(2021,6,4,16,18,41,658684),'item_scraped_count': 1,'log_count/DEBUG': 82,'log_count/INFO': 11,'log_count/WARNING': 1,'memusage/max': 60342272,'memusage/startup': 60342272,'response_received_count': 2,'robotstxt/request_count': 1,'robotstxt/response_count': 1,'robotstxt/response_status_count/200': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2021,20,656558)}
INFO:scrapy.core.engine:Spider closed (finished)
2021-06-04 12:18:41 [scrapy.core.engine] INFO: Spider closed (finished)
只有一个小的弃用警告消息,我认为抓取成功。从 crontab 运行时会出现问题。这是 log_python_test.log 的输出:
2021-06-04 12:00:02 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: morning)
2021-06-04 12:00:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0,libxml2 2.9.10,cssselect 1.1.0,parsel 1.5.2,w3lib 1.21.0,Twisted 20.3.0,Python 3.7.7 (default,May 6 2020,14:51:16) - [GCC 9.3.0],pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020),cryptography 2.9.2,Platform Linux-5.8.0-53-generic-x86_64-with-debian-bullseye-sid
2021-06-04 12:00:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-06-04 12:00:02 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'morning','NEWSPIDER_MODULE': 'morning.spiders','ROBOTSTXT_OBEY': True,'SPIDER_MODULES': ['morning.spiders']}
2021-06-04 12:00:02 [scrapy.extensions.telnet] INFO: Telnet Password: bf691c25dae7d218
2021-06-04 12:00:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats']
2021-06-04 12:00:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-06-04 12:00:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2021-06-04 12:00:02 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py",line 192,in crawl
return self._crawl(crawler,*args,**kwargs)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py",line 196,in _crawl
d = crawler.crawl(*args,**kwargs)
File "/home/luc/.local/lib/python3.7/site-packages/twisted/internet/defer.py",line 1613,in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/home/luc/.local/lib/python3.7/site-packages/twisted/internet/defer.py",line 1529,in _cancellableInlineCallbacks
_inlineCallbacks(None,g,status)
--- <exception caught here> ---
File "/home/luc/.local/lib/python3.7/site-packages/twisted/internet/defer.py",line 1418,in _inlineCallbacks
result = g.send(result)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py",line 87,in crawl
self.engine = self._create_engine()
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/crawler.py",line 101,in _create_engine
return ExecutionEngine(self,lambda _: self.stop())
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/engine.py",line 70,in __init__
self.scraper = Scraper(crawler)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/scraper.py",line 71,in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/middleware.py",line 53,in from_crawler
return cls.from_settings(crawler.settings,crawler)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/middleware.py",line 34,in from_settings
mwcls = load_object(clspath)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/utils/misc.py",line 50,in load_object
mod = import_module(module)
File "/usr/local/lib/python3.7/importlib/__init__.py",line 127,in import_module
return _bootstrap._gcd_import(name[level:],package,level)
File "<frozen importlib._bootstrap>",line 1006,in _gcd_import
File "<frozen importlib._bootstrap>",line 983,in _find_and_load
File "<frozen importlib._bootstrap>",line 967,in _find_and_load_unlocked
File "<frozen importlib._bootstrap>",line 677,in _load_unlocked
File "<frozen importlib._bootstrap_external>",line 728,in exec_module
File "<frozen importlib._bootstrap>",line 219,in _call_with_frames_removed
File "/home/luc/Documents/computing/tests/learning/morning/morning/pipelines.py",line 3,in <module>
from gtts import gTTS
builtins.ModuleNotFoundError: No module named 'gtts'
2021-06-04 12:00:02 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/luc/.local/lib/python3.7/site-packages/twisted/internet/defer.py",in _gcd_import
File "<frozen importlib._bootstrap>",in _find_and_load
File "<frozen importlib._bootstrap>",in _find_and_load_unlocked
File "<frozen importlib._bootstrap>",in _load_unlocked
File "<frozen importlib._bootstrap_external>",in exec_module
File "<frozen importlib._bootstrap>",in _call_with_frames_removed
File "/home/luc/Documents/computing/tests/learning/morning/morning/pipelines.py",in <module>
from gtts import gTTS
ModuleNotFoundError: No module named 'gtts'
突然就找不到gtts包了。它似乎不是唯一一个找不到的包,因为在我的 pipeline.py 顶部的先前版本 from mutagen.mp3 import MP3
中存在并且导入它时也出现问题。
我想,也许我在安装 gtts 包时犯了一个错误,所以我尝试了 pip install gtts 以确保一切正常,我得到了:
Requirement already satisfied: gtts in /home/luc/gen_env/lib/python3.7/site-packages (2.2.2)
Requirement already satisfied: six in /home/luc/gen_env/lib/python3.7/site-packages (from gtts) (1.15.0)
Requirement already satisfied: requests in /home/luc/gen_env/lib/python3.7/site-packages (from gtts) (2.24.0)
Requirement already satisfied: click in /home/luc/gen_env/lib/python3.7/site-packages (from gtts) (7.1.2)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/luc/gen_env/lib/python3.7/site-packages (from requests->gtts) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /home/luc/gen_env/lib/python3.7/site-packages (from requests->gtts) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /home/luc/gen_env/lib/python3.7/site-packages (from requests->gtts) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/luc/gen_env/lib/python3.7/site-packages (from requests->gtts) (1.25.10)
gTTs 在我输入 pip list
时也会出现:
gTTS 2.2.2
我还确保我安装在正确的环境中。下面分别是which python
和which pip
的结果:
/home/luc/gen_env/bin/python
/home/luc/gen_env/bin/pip
我以为我可以通过在我的 shell 文件中添加第四行和第五行来解决问题,但没有成功(输出是相同的)。我很确定我必须添加一些路径到 PYTHONPATH 或类似的东西,但我不确定我在做什么,我不想破坏任何东西。
提前致谢。
解决方法
我找到了解决问题的方法。事实上,正如我所怀疑的那样,我的 PYTHONPATH 缺少一个目录。这是包含 gtts 包的目录。
解决办法: 如果你有同样的问题,
- 找到包裹
我看了that post
- 将其添加到 sys.path(也会将其添加到 PYTHONPATH)
将此代码添加到脚本的顶部(在我的例子中是 pipelines.py):
import sys
sys.path.append("/<the_path_to_your_package>")