BS4:如何将find_all减少到最小忽略而不是提取

问题描述

在以后的操作中,我需要忽略“注释”和“文档类型”(因为稍后将替换一些字符,这些字符将不再允许我区分注释和“文档类型”。)

最小示例

#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup,Comment,Doctype


def is_toremove(element):
    return isinstance(element,Comment) or isinstance(element,Doctype)


def test1():
    html = \
    '''
    <!DOCTYPE html>
    word1 word2 word3 word4
    <!-- A comment -->
    '''
    soup = BeautifulSoup(html,features="html.parser")
    to_remove = soup.find_all(text=is_toremove)
    for element in to_remove:
        element.extract()

    # some operations needing soup.findAll
    for txt in soup.findAll(text=True):
        # some replace computations
        pass
    return soup
print(test1())

预期结果将是用替换计算替换的“ word1 word2 word3 word4”。它有效,但我认为它不是非常有效。我想做类似的事情

for txt in soup.findAll(text=not is_toremove()):

仅适用于未取下的零件。

所以我的问题是:

  1. 是否存在一些内部魔术,使您可以两次调用findAll而不会造成效率低下或
  2. 如何将它们全部合并成一个find_all

我也尝试使用父标签

if(not isinstance(txt,Doctype)

if(txt.parent.name != "[document]")

例如。这并没有改变我的主程序中的内容

解决方法

如评论中所述,如果只想获取纯<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width,initial-scale=1.0"> <meta http-equiv="X-UA-Compatible" content="ie=edge"> <title>Document</title> </head> <body> <div class="flex"><p> Lets learn about Flexbox.Flexbox needs a parent container in order to work elements inside of that container these elements can be centered and adjusted easily. </p> <div class="box" id="box1">Box1</div> <div class="box" id="box2">Box2</div> <div class="box" id="box3">Box3</div> <div class="box" id="box4">Box4</div> <div class="box" id="box5">Box5</div> <div class="box" id="box6">Box6</div> </div> </body> </html>,则可以执行以下操作:

NavigableString

打印:

from bs4 import BeautifulSoup,NavigableString


html = '''
<!DOCTYPE html>
word1 word2 word3 word4
<!-- A comment -->
'''

def is_string_only(t):
    return type(t) is NavigableString

soup = BeautifulSoup(html,'lxml')

for visible_string in soup.find_all(text=is_string_only):
    print(visible_string)