BeautifulSoup是灵活又方便的网页解析库,处理高效,支持多种解析器
利用它不用编写正则表达式即可方便地实现网页信息的提取
安装:pip3 install beautifulsoup4
用法详解:
beautifulsoup支持的一些解析库
解析器 | 使用方法 | 优势 | 劣势 |
Python标准库 | BeautifulSoup(makeup,"html.parser") | python的内置标准库,执行速度适中,文档容错能力强 | python2.7 or python3.2.2前的版本中文容错能力差 |
lxml HTML解析器 | BeautifulSoup(makeup,"lxml") | 速度快,文档容错能力强 | 需要安装c语言库 |
lxml XML解析器 | BeautifulSoup(makeup,"xmlr") | 速度快,唯一支持xml的解析器 | 需要安装c语言库 |
html5lib | BeautifulSoup(makeup,"html5lib") | 最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档 | 速度慢,不依赖外部扩展 |
基本使用方法:
import bs4 from bs4 import BeautifulSoup #下面是一段不完整的 html代码 html = ‘‘‘ <html><head><title>The Demouse‘s story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse‘s story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ‘‘‘ soup = BeautifulSoup(html,‘lxml‘) #将代码补全,也就是容错处理 print(soup.prettify()) #选择title这个标签,并打印内容 输出结果为: <html> <head> <title> The Demouse‘s story </title> </head> <body> <p class="title" name="dromouse"> <b> The Domouse‘s story </b> </p> <p class="story"> Once upon a time there were three little sisters,and their name were <a class="sister" href="http://examlpe.com/elele" ld="link1"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/lacie" ld="link2"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/title" ld="link3"> <title> </title> </a> and they lived the bottom of a wall </p> <p clas="stuy"> .. </p> </body> </html> The Demouse‘s story
标签选择器:
选择元素