尽管标签存在,BeautifulSoup4返回None

问题描述

我正在关注python3和BeautifulSoup的教程。 对于以下代码

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())
bs = BeautifulSoup(html.read(),'html.parser')
print("\n\n-----H1 content after this-----")
print(bs.h1)

我得到:

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet,consectetur adipisicing elit,sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


-----H1 content after this-----
None

由于h1标签存在,None是意外的。对于print(bs.find("h1")),我得到的结果完全相同 如何获取h1标签内容

解决方法

import requests
from bs4 import BeautifulSoup


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content,'html.parser')
    print(soup.find("h1").text)


main("http://pythonscraping.com/pages/page1.html")

输出:

An Interesting Title