问题描述
我有以下代码,我在其中创建了一个新的标签包装器,其中包含数据属性的内容,该数据属性可以是简单文本、简单标签或更复杂的标签,具有自己的属性。
前两个简单文本和简单标记有效,但复杂标记仅更新到类名上的引号点。我做错了什么?
我觉得我需要使用重新编译但没有成功
感谢您的帮助
from bs4 import BeautifulSoup
import re
data = """<div data-value="Just Basic Text"></div>
<p>This one contains a simple tag</p><div data-value="<simpletagname></simpletagname>"></div>
<p>This one contains a simple tag with a class and text</p>
<div data-value="<h6 class="myclass">More Basic Text</h6>"></div>"""
soup = BeautifulSoup(data,"lxml")
for div in soup.select('div[data-value]'):
# insert sup tag after the div
sup = soup.new_tag('wrappertagname')
sup.string = div['data-value']
div.insert_after(sup)
# replace the div tag with it's contents
div.unwrap()
print(soup.prettify(formatter=None))
输出不正确 (h6)
<html>
<body>
<wrappertagname>
Just Basic Text
</wrappertagname>
<p>
This one contains a simple tag
</p>
<wrappertagname>
<simpletagname></simpletagname>
</wrappertagname>
<p>
This one contains a simple tag with a class and text
</p>
More Basic Text">
<wrappertagname>
<h6 class=
</wrappertagname>
</body>
</html>
预期输出
<html>
<body>
<wrappertagname>
Just Basic Text
</wrappertagname>
<p>
This one contains a simple tag
</p>
<wrappertagname>
<simpletagname></simpletagname>
</wrappertagname>
<p>
This one contains a simple tag with a class and text
</p>
<wrappertagname>
<h6 class="myclass">More Basic Text</h6>
<wrappertagname>
</body>
</html>
更新
如果我为双引号添加字符实体,它输出正确 (see post)
所以它一定不是有效的 html,Beautiful Soup 无法解析无效的 html。
更新输入
from bs4 import BeautifulSoup
import re
data = """<div data-value="Just Basic Text"></div>
<p>This one contains a simple tag</p><div data-value="<simpletagname></simpletagname>"></div>
<p>This one contains a simple tag with a class and text</p>
<div data-value="<h6 class="myclass">More Basic Text</h6>"></div>"""
soup = BeautifulSoup(data,"lxml")
for div in soup.select('div[data-value]'):
# insert sup tag after the div
sup = soup.new_tag('wrappertagname')
sup.string = div['data-value']
div.insert_after(sup)
# replace the div tag with it's contents
div.unwrap()
print(soup.prettify(formatter=None))
更新输出
<html>
<body>
<wrappertagname>
Just Basic Text
</wrappertagname>
<p>
This one contains a simple tag
</p>
<wrappertagname>
<simpletagname></simpletagname>
</wrappertagname>
<p>
This one contains a simple tag with a class and text
</p>
<wrappertagname>
<h6 class="myclass">More Basic Text</h6>
</wrappertagname>
</body>
</html>
(如果有人对我如何不必使用字符实体有任何其他想法,将不胜感激。)
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)