问题描述
在以后的操作中,我需要忽略“注释”和“文档类型”(因为稍后将替换一些字符,这些字符将不再允许我区分注释和“文档类型”。)
最小示例
#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup,Comment,Doctype
def is_toremove(element):
return isinstance(element,Comment) or isinstance(element,Doctype)
def test1():
html = \
'''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''
soup = BeautifulSoup(html,features="html.parser")
to_remove = soup.find_all(text=is_toremove)
for element in to_remove:
element.extract()
# some operations needing soup.findAll
for txt in soup.findAll(text=True):
# some replace computations
pass
return soup
print(test1())
预期结果将是用替换计算替换的“ word1 word2 word3 word4”。它有效,但我认为它不是非常有效。我想做类似的事情
for txt in soup.findAll(text=not is_toremove()):
仅适用于未取下的零件。
所以我的问题是:
我也尝试使用父标签:
if(not isinstance(txt,Doctype)
或
if(txt.parent.name != "[document]")
例如。这并没有改变我的主程序中的内容。
解决方法
如评论中所述,如果只想获取纯<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width,initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<div class="flex"><p>
Lets learn about Flexbox.Flexbox needs a parent
container in order to work elements inside of that container
these elements can be centered and adjusted easily.
</p>
<div class="box" id="box1">Box1</div>
<div class="box" id="box2">Box2</div>
<div class="box" id="box3">Box3</div>
<div class="box" id="box4">Box4</div>
<div class="box" id="box5">Box5</div>
<div class="box" id="box6">Box6</div>
</div>
</body>
</html>
,则可以执行以下操作:
NavigableString
打印:
from bs4 import BeautifulSoup,NavigableString
html = '''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''
def is_string_only(t):
return type(t) is NavigableString
soup = BeautifulSoup(html,'lxml')
for visible_string in soup.find_all(text=is_string_only):
print(visible_string)