从段落中提取整数

问题描述

我试图仅从段落中提取费用金额，但我遇到了问题。有两个费用金额，我想要其中的两个。这是我的代码：http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx

fees_div = soup.find('div',class_='Fees hiddenContent pad-around-large tabcontent')
if fees_div:
    fees_list = fees_div.find_all('\d+','p')
    course_data['Fees'] = fees_list
    print('fees : ',fees_list)

解决方法

请试试这个：

In [10]: import requests
In [11]: from bs4 import *
In [12]: page = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
In [13]: soup = BeautifulSoup(page.content,'html.parser')
In [14]: [x for x in soup.find('div',class_='Fees hiddenContent pad-around-large tabcontent').text.split() if u"\xA3" in x]
Out[14]: ['£9,250*','£17,320']

import requests
from bs4 import BeautifulSoup
import re

r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
soup = BeautifulSoup(r.text,'html.parser')
fees_div = soup.find('div',class_='Fees hiddenContent pad-around-large tabcontent')
m = re.search(r'£[\d,]+',fees_div.select('p:nth-of-type(2)')[0].get_text())
fee1 = m[0]
m = re.search(r'£[\d,fees_div.select('p:nth-of-type(3)')[0].get_text())
fee2 = m[0]
print(fee1,fee2)

打印：

£9,250 £17,320

更新

您也可以使用 Selenium 抓取页面，尽管在这种情况下它没有任何优势。例如（使用 Chrome）：

from selenium import webdriver
from bs4 import BeautifulSoup
import re


options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches',['enable-logging'])
driver = webdriver.Chrome(options=options)

driver.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
soup = BeautifulSoup(driver.page_source,fee2)
driver.quit()

更新

考虑仅使用以下内容：只需扫描整个 HTML 源代码，而不使用 BeautifulSoup 使用简单的正则表达式 findall 查找费用：

import requests
import re

r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
print(re.findall(r'£[\d,r.text))

打印：

['£9,250',320']

试一试：

import re
import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
soup = BeautifulSoup(r.text,'html.parser')
item = soup.find(id='Panel5').text
fees = re.findall(r"students:[^£]+(.*?)[*\s]",item)
print(fees)

输出：

['£9,320']

beautifulsoup python web-scraping web-scraping-language