使用Python抓取多个Wikitables

问题描述

我是Python的初学者。我有一项任务要从Wikipedia页面抓取信息表。我想使用以下代码抓取:

from pandas.io.html import read_html
page = requests.get('https://de.wikipedia.org/wiki/Köln')
wikitables = read_html(page,attrs={"class":"hintergrundfarbe5 float-right toptextcells infoBox"})
print("Extracted {num} wikitables".format(num=len(wikitables)))

wikitables[0]

但是由于Url中的特殊字符(如Köln),我收到以下错误消息:请帮助我在程序中的哪里进行修改以抓取信息。

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-168-d9bd1e1d7548> in <module>
      2 page = requests.get('https://de.wikipedia.org/wiki/Köln')
      3 Soup = BeautifulSoup(page.content)
----> 4 wikitables = read_html(page,attrs={"class":"hintergrundfarbe5 float-right toptextcells infoBox"})
      5 print("Extracted {num} wikitables".format(num=len(wikitables)))
      6 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\html.py in read_html(io,match,flavor,header,index_col,skiprows,attrs,parse_dates,tupleize_cols,thousands,encoding,decimal,converters,na_values,keep_default_na,displayed_only)
   1092                   decimal=decimal,converters=converters,na_values=na_values,1093                   keep_default_na=keep_default_na,-> 1094                   displayed_only=displayed_only)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor,io,displayed_only,**kwargs)
    914             break
    915     else:
--> 916         raise_with_traceback(retained)
    917 
    918     ret = []

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\compat\__init__.py in raise_with_traceback(exc,traceback)
    418         if traceback == Ellipsis:
    419             _,_,traceback = sys.exc_info()
--> 420         raise exc.with_traceback(traceback)
    421 else:
    422     # this version of raise is a Syntax error in Python 3

TypeError: Cannot read object of type 'Response'

解决方法

这与美丽的科隆无关...

您需要更改

geos

wikitables = read_html(page,attrs={"..."})

它应该可以工作。