BeautifulSoup解析库详解

BeautifulSoup是灵活又方便的网页解析库，处理高效，支持多种解析器

利用它不用编写正则表达式即可方便地实现网页信息的提取

安装：pip3 install beautifulsoup4

用法详解：

beautifulsoup支持的一些解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(makeup,"html.parser")	python的内置标准库，执行速度适中，文档容错能力强	python2.7 or python3.2.2前的版本中文容错能力差
lxml HTML解析器	BeautifulSoup(makeup,"lxml")	速度快，文档容错能力强	需要安装c语言库
lxml XML解析器	BeautifulSoup(makeup,"xmlr")	速度快，唯一支持xml的解析器	需要安装c语言库
html5lib	BeautifulSoup(makeup,"html5lib")	最好的容错性，以浏览器的方式解析文档，生成HTML5格式的文档	速度慢，不依赖外部扩展

基本使用方法：

import bs4
from bs4 import BeautifulSoup

#下面是一段不完整的 html代码
html = ‘‘‘
<html><head><title>The Demouse‘s story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse‘s story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
‘‘‘

soup = BeautifulSoup(html,‘lxml‘)

#将代码补全，也就是容错处理
print(soup.prettify())

#选择title这个标签，并打印内容
输出结果为：
<html>
 <head>
  <title>
   The Demouse‘s story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Domouse‘s story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters,and their name were
   <a class="sister" href="http://examlpe.com/elele" ld="link1">
    <!--Elsle-->
   </a>
   <a class="sister" href="http://examlpe.com/lacie" ld="link2">
    <!--Elsle-->
   </a>
   <a class="sister" href="http://examlpe.com/title" ld="link3">
    <title>
    </title>
   </a>
   and they lived the bottom of a wall
  </p>
  <p clas="stuy">
   ..
  </p>
 </body>
</html>
The Demouse‘s story

标签选择器：

选择元素

BeautifulSoup解析库详解

相关文章