问题描述
我正在尝试抓取Wiki信息框并将数据放入字典中,其中信息框的第一列是键,第二列是值。我还必须忽略所有没有2列的行。我在理解如何获取与钥匙相关的价值时遇到了麻烦。我要抓取的Wikipedia页面是https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347,我要从第一个信息框中提取信息。
结果应如下所示: {“名称”:“ RMS Titanic”,“所有者”:“ White Star Line”,“ Operator”:“ White Star Line”,“ Registration Port”:“ Liverpool,UK”,“ Route”:“ Southampton to New约克市” .....}
这是我尝试过的:
import requests
from bs4 import BeautifulSoup
def get_infobox(url):
response = requests.get(url)
bs = BeautifulSoup(response.text)
table = bs.find('table',{'class' :'infobox'})
result = {}
row_count = 0
if table is None:
pass
else:
for tr in table.find_all('tr'):
if tr.find('th'):
pass
else:
row_count += 1
if row_count > 1:
if tr is not None:
result[tr.find('td').text.strip()] = tr.find('td').text
return result
print(get_infobox("https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347"))
任何帮助将不胜感激!
解决方法
如果您不需要或不想使用刮板,则可以使用API
https://www.mediawiki.org/wiki/API:Main_page/de
英语端点为https://en.wikipedia.org/w/api.php
例如:
,尝试一下:
import unicodedata
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347"
def get_infobox(url):
response = requests.get(url)
bs = BeautifulSoup(response.text,"html.parser")
table = bs.find('table',{'class': 'infobox'}).find_all("td")
return [t.getText() for t in table][1:]
def parse_results(results):
return {
row_name.replace(":",""): unicodedata.normalize("NFKD",row_data).strip()
for row_name,row_data in zip(results[::2],results[1::2])
}
print(parse_results(get_infobox(url)))
输出:
{'Name': 'RMS Titanic','Owner': 'White Star Line','Operator': 'White Star Line','Port of registry': 'Liverpool,UK','Route': 'Southampton to New York City','Ordered': '17 September 1908','Builder': 'Harland and Wolff,Belfast','Cost': 'GB£1.5 million (£140 million in 2016)','Yard number': '401','Way number': '400','Laid down': '31 March 1909','Launched': '31 May 1911','Completed': '2 April 1912','Maiden voyage': '10 April 1912; 108 years ago (1912-04-10)','In service': '10–15 April 1912','Out of service': '15 April 1912','Identification': 'Official Number 131428[1]\nCode Letters HVMP[2]\n\nRadio call sign "MGY"','Fate': "Hit an iceberg 11:40 p.m. (ship's time) 14 April 1912 on her maiden voyage and sank 2 h 40 min later on 15 April 1912; 108 years ago (1912-04-15).",'Status': 'Wreck','Class and type': 'Olympic-class ocean liner','Tonnage': '46,328 GRT','Displacement': '52,310 tons','Length': '882 ft 9 in (269.1 m)','Beam': '92 ft 6 in (28.2 m)','Height': '175 ft (53.3 m) (keel to top of funnels)','Draught': '34 ft 7 in (10.5 m)','Depth': '64 ft 6 in (19.7 m)','Decks': '9 (A–G)','Installed power': '24 double-ended and five single-ended boilers feeding two reciprocating steam engines for the wing propellers,and a low-pressure turbine for the centre propeller;[3] output: 46,000 HP','Propulsion': 'Two three-blade wing propellers and one three-blade centre propeller','Speed': 'Cruising: 21 kn (39 km/h; 24 mph). Max: 23 kn (43 km/h; 26 mph)','Capacity': 'Passengers: 2,435,crew: 892. Total: 3,327 (or 3,547 according to other sources)','Notes': 'Lifeboats: 20 (sufficient for 1,178 people)'}