如何通过Web抓取Wikipedia信息框表?

问题描述

我正在尝试抓取Wiki信息框并将数据放入字典中,其中信息框的第一列是键,第二列是值。我还必须忽略所有没有2列的行。我在理解如何获取与钥匙相关的价值时遇到了麻烦。我要抓取的Wikipedia页面是https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347,我要从第一个信息框中提取信息。

结果应如下所示: {“名称”:“ RMS Titanic”,“所有者”:“ White Star Line”,“ Operator”:“ White Star Line”,“ Registration Port”:“ Liverpool,UK”,“ Route”:“ Southampton to New约克市” .....}

这是我尝试过的:

    import requests
    from bs4 import BeautifulSoup

    def get_infobox(url):
       response = requests.get(url)
       bs = BeautifulSoup(response.text)

       table = bs.find('table',{'class' :'infobox'})
       result = {}
       row_count = 0
       if table is None:
         pass
       else:
         for tr in table.find_all('tr'):
             if tr.find('th'):
                 pass
             else:
                 row_count += 1
         if row_count > 1:
             if tr is not None:
               result[tr.find('td').text.strip()] = tr.find('td').text
         return result

print(get_infobox("https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347"))

任何帮助将不胜感激!

解决方法

如果您不需要或不想使用刮板,则可以使用API​​

https://www.mediawiki.org/wiki/API:Main_page/de

英语端点为https://en.wikipedia.org/w/api.php

例如:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Titanic&rvsection=0

,

尝试一下:

import unicodedata
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347"


def get_infobox(url):
    response = requests.get(url)
    bs = BeautifulSoup(response.text,"html.parser")

    table = bs.find('table',{'class': 'infobox'}).find_all("td")
    return [t.getText() for t in table][1:]


def parse_results(results):
    return {
        row_name.replace(":",""): unicodedata.normalize("NFKD",row_data).strip()
        for row_name,row_data in zip(results[::2],results[1::2])
    }


print(parse_results(get_infobox(url)))

输出:

{'Name': 'RMS Titanic','Owner': 'White Star Line','Operator': 'White Star Line','Port of registry': 'Liverpool,UK','Route': 'Southampton to New York City','Ordered': '17 September 1908','Builder': 'Harland and Wolff,Belfast','Cost': 'GB£1.5 million (£140 million in 2016)','Yard number': '401','Way number': '400','Laid down': '31 March 1909','Launched': '31 May 1911','Completed': '2 April 1912','Maiden voyage': '10 April 1912; 108 years ago (1912-04-10)','In service': '10–15 April 1912','Out of service': '15 April 1912','Identification': 'Official Number 131428[1]\nCode Letters HVMP[2]\n\nRadio call sign "MGY"','Fate': "Hit an iceberg 11:40 p.m. (ship's time) 14 April 1912 on her maiden voyage and sank 2 h 40 min later on 15 April 1912; 108 years ago (1912-04-15).",'Status': 'Wreck','Class and type': 'Olympic-class ocean liner','Tonnage': '46,328 GRT','Displacement': '52,310 tons','Length': '882 ft 9 in (269.1 m)','Beam': '92 ft 6 in (28.2 m)','Height': '175 ft (53.3 m) (keel to top of funnels)','Draught': '34 ft 7 in (10.5 m)','Depth': '64 ft 6 in (19.7 m)','Decks': '9 (A–G)','Installed power': '24 double-ended and five single-ended boilers feeding two reciprocating steam engines for the wing propellers,and a low-pressure turbine for the centre propeller;[3] output: 46,000 HP','Propulsion': 'Two three-blade wing propellers and one three-blade centre propeller','Speed': 'Cruising: 21 kn (39 km/h; 24 mph). Max: 23 kn (43 km/h; 26 mph)','Capacity': 'Passengers: 2,435,crew: 892. Total: 3,327 (or 3,547 according to other sources)','Notes': 'Lifeboats: 20 (sufficient for 1,178 people)'}

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...