问题描述
我正在抓取此链接:https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds
立即申请并了解更多网址
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json,requests,re
AMEXurl = ['https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds']
identity = ['filmstrip_container']
html_1 = urlopen(AMEXurl[0])
soup_1 = BeautifulSoup(html_1,'lxml')
address = soup_1.find('div',attrs={"class" : identity[0]})
for x in address.find_all('a',id = 'html-link'):
print(x)
我得到的链接无效:
<a href="https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge&intlink=in-amex-cardshop-allcards-apply-AmericanExpressplatinum-carousel&cpid=100370494&sourcecode=A0000FCRAA" id="html-link"><div><span>Apply Now</span></div></a>
<a href="charge-cards/platinum-card/?linknav=in-amex-cardshop-allcards-learn-AmericanExpressplatinum-carousel&cpid=100370494&sourcecode=A0000FCRAA" id="html-link"><div><span>Learn More</span></div></a>
<a href="https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge&intlink=in-amex-cardshop-allcards-apply-AmericanExpressplatinum-carousel&cpid=100370494&sourcecode=A0000FCRAA" id="html-link"><div><span>Apply Now</span></div></a>
<a href="charge-cards/platinum-card/?linknav=in-amex-cardshop-allcards-learn-AmericanExpressplatinum-carousel&cpid=100370494&sourcecode=A0000FCRAA" id="html-link"><div><span>Learn More</span></div></a>
以下是我试图从中了解更多信息和了解更多网址的 html 代码的图像:
我想知道代码中是否有任何更改,以便我立即获得所有应用并了解更多所有 7 张卡片的网址。
解决方法
您可以修改它以使用您的列表和语法,但这会为您提供我认为您想要的链接。请注意,使用 module.exports.addCategory = (params) => {
return User.findById(params.userId)
.then((user,err) => {
console.log(user)
if(err) return false
user.categories.push(params)
return user.save()
.then((updatedUser,err) => {
return err ? false : true
})
.catch(err => {
console.log(err)
})
})
.catch(err => {
console.log(err)
})
}
并不能获得所需的内容,但将 find
与 find_all
结合使用并获取第一个链接可以。
href=True
输出
nurl = 'https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds'
npage = requests.get(nurl)
nsoup = BeautifulSoup(npage.text,"html.parser")
# for link in nsoup.find_all('a'):
for link in nsoup.find_all('a',string=re.compile('Apply Now'),href=True)[0:1]:
print(link.get('href'))
for link in nsoup.find_all('a',string=re.compile('Learn'),href=True)[0:1]:
print('https://www.americanexpress.com/in/' + link.get('href'))
,
您要查找的 URL 并未全部存储在 HTML 中。需要进一步请求以返回 JSON 中的信息。为此,还需要会话 ID。例如:
StateObject
这将为您提供以下链接:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds'
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
for script in soup.find_all('script'):
if script.contents and "intlUserSessionId" in script.contents[0]:
json_raw = script.contents[0][script.contents[0].find('{'):]
json_data = json.loads(json_raw)
id = json_data["pageData"]["pageValues"]["intlUserSessionId"]
url2 = 'https://acquisition-1.americanexpress.com/api/acquisition/digital/v1/shop/us/cardshop-api/api/v1/intl/content/compare-cards/in/default'
r2 = requests.get(url2,params={'sessionId':id})
json_data = r2.json()
for entry in json_data:
cta_group = entry["ctaGroup"][0]
click_url = cta_group['clickUrl']
print(f"{cta_group['text']} - {click_url}")
learn_more = entry['learnMore']['ctaGroup'][0]
print(f"{learn_more['text']} - {learn_more['clickUrl']}")
了解更多网址需要添加网站的基本网址。