Google 学术搜索阻止我使用 search_pubs

问题描述

我使用的是 Pycharm 社区版 2020.3.2、Scholarly 版本 1.0.2、Tor 版本 1.0.0。我试图抓取 700 篇文章以找到它们的引用次数。 Google Scholar 阻止我使用 search_pubs(Scholarly 的一个功能)。但是,Scholarly 的另一个功能 search_author 仍然运行良好。一开始,search_pubs 功能正常工作。我试过这些代码

from scholarly import scholarly
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

经过几次试验,它显示以下错误

Traceback (most recent call last):
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\IPython\core\interactiveshell.py",line 3343,in run_code
    exec(code_obj,self.user_global_ns,self.user_ns)
  File "<ipython-input-9-3bbcfb742cb5>",line 1,in <module>
    scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_scholarly.py",line 121,in search_pubs
    return self.__nav.search_publications(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py",line 256,in search_publications
    return _SearchScholarIterator(self,url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\publication_parser.py",line 53,in __init__
    self._load_url(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\publication_parser.py",line 58,in _load_url
    self._soup = self._nav._get_soup(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py",line 200,in _get_soup
    html = self._get_page('https://scholar.google.com{0}'.format(url))
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py",line 152,in _get_page
    raise Exception("Cannot fetch the page from Google Scholar.")
Exception: Cannot fetch the page from Google Scholar.

然后,我发现原因是我需要通过 Google 的 CAPTCHA 才能继续从 Google Scholar 获取信息。很多人建议我需要使用代理,因为我的 IP 被谷歌屏蔽了。我尝试使用 FreeProxies() 更改代理

from scholarly import scholarly,proxygenerator

pg = proxygenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

它不起作用并且Pycharm被冻结了很长时间。然后,我安装了 Tor(pip install Tor)并再次尝试:

from scholarly import scholarly,proxygenerator
pg = proxygenerator()
pg.Tor_External(tor_sock_port=9050,tor_control_port=9051,tor_password="scholarly_password")
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

它不起作用。然后,我尝试使用 SingleProxy()

from scholarly import scholarly,proxygenerator
pg = proxygenerator()
pg.SingleProxy(https='socks5://127.0.0.1:9050',http='socks5://127.0.0.1:9050')
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

它也不起作用。我从未尝试过 Luminati,因为我不熟悉它。如果有人知道解决方案,请帮忙!

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)