在HTML文档上使用python索引功能

问题描述

以下脚本将浏览html文档。首先,它将某些文本替换为不同的文本。这部分工作正常。下一个工作是将<font>``</font>标签之间的文本分成几组。并将它们,组放在[]中。这按预期工作。它跌倒了。它应该使用这些组[]比较每个组Entity Id..,如果找到匹配项/重复项,我希望它删除相关的组。不幸的是,该脚本并没有删除重复的组,并且那里的数量已经足够了。在某些情况下> 100个重复项。它们是动态的,因此可能有许多不同的Entity Id...xxx都有重复项。我不想删除每个包含重复项的组,而是希望为每个重复项保留1个组Entity Id...我的问题是否在此范围内

for group in groups[1:]:
    if group[1] == groups[groups.index(group)-1][1]:
        groups.remove(group)

据我了解,它使用group[1]作为文档中的最后一个Entity Id...123456作为对groups[groups.index(group)-1][1]:的查找值,该值将包括文档中的每个Entity Id Index函数。但是由于group[1]不在'Index'函数中,它仅使用最后一个Entity Id...作为查找吗?我是否需要在Index中加入if group[1] ==

from bs4 import BeautifulSoup
import re
import urllib
import os

svdir = ''
filename = 'something.html'


with open(r'something.html','r') as f:
   html_string = f.read()


soup = BeautifulSoup(html_string)

target = soup.find_all(text=re.compile(r'Serial#.........'))
for v in target:                                
    v.replace_with(v.replace('Serial#.........','Note Id.....'))

target1 = soup.find_all(text=re.compile(r'Cust#...........'))
for v in target1:
    v.replace_with(v.replace('Cust#...........','Entity Id...'))

target2 = soup.find_all(text=re.compile(r'Customer Name...'))
for v in target2:
    v.replace_with(v.replace('Customer Name...','Entity Name.'))

target3 = soup.find_all(text=re.compile(r'BILL TO NO NAME.'))
for v in target3:
    v.replace_with(v.replace('BILL TO NO NAME.','Note Detail.'))

target4 = soup.find_all(text=re.compile(r'FIXED DATE......'))
for v in target4:
    v.replace_with(v.replace('FIXED DATE......','Create Date.'))


data = soup.select_one('font')
targets = data.text.replace('Note Id','xxxNote Id').split('xxx')
groups = [target.strip().split('\n') for target in targets[1:]]
for group in groups[1:]:
    if group[1] == groups[groups.index(group)-1][1]:
        groups.remove(group)
new_ts = '\n'
for group in groups:
    new_ts += '\n'.join(group)+'\n\n'
data.string.replace_with(new_ts)
    
print(soup)

sv_path = os.path.join(svdir,filename)
fp = open(sv_path,'w')
fp.write(str(soup))   
fp.close()

这是一些示例html,作为结构指南

<font>

##This is a group##
Serial#......... 123456789101234567
Cust#........... 123456
Customer Name... Joe Rogan
BILL TO NO NAME. Bill To: 000000 - Some Company
FIXED DATE...... 01/01/00

##This is another group##
Serial#......... 765432110987654321
Cust#........... 123456
Customer Name... Nate Diaz
BILL TO NO NAME. Bill To: 000001 - Some other company
FIXED DATE...... 01/01/00

Serial#......... 123456789101234567
Cust#........... 123451
Customer Name... Someone Famous
BILL TO NO NAME. Bill To: 000012 - My Company
FIXED DATE...... 01/01/00

Serial#......... 7765897411126
Cust#........... 123456
Customer Name... John Giles
BILL TO NO NAME. Bill To: 000123 - Sole trader PTY LTD
FIXED DATE...... 01/01/00

Serial#......... 12345665432112345
Cust#........... 000001
Customer Name... Mary Mack
BILL TO NO NAME. Bill To: 000245 - Hello.PTY.LTD
FIXED DATE...... 01/01/00`
</font>

这是我希望达到的目标

<font>
Note Id..... 123456789101234567
Entity Id... 123456
Entity Name. Joe Rogan
Note Detail. Bill To: 000000 - Some Company
Create Date. 01/01/00

Note Id..... 12345665432112345
Entity Id... 000001
Entity Name. Ned Flanders
Note Detail. Bill To: 000002 - Some other big company
Create Date. 01/01/00

Note Id..... 123456789101234567
Entity Id... 123451
Entity Name. Someone Famous
Note Detail. Bill To: 000012 - My Company
Create Date. 01/01/00
    

</font>

解决方法

import re
from bs4 import BeautifulSoup


html_data = '''<font>
Serial#......... 123456789101234567
Cust#........... 123456
Customer Name... Joe Rogan
BILL TO NO NAME. Bill To: 000000 - Some Company
FIXED DATE...... 01/01/00

Serial#......... 765432110987654321
Cust#........... 123456
Customer Name... Nate Diaz
BILL TO NO NAME. Bill To: 000001 - Some other company
FIXED DATE...... 01/01/00

Serial#......... 123456789101234567
Cust#........... 123451
Customer Name... Someone Famous
BILL TO NO NAME. Bill To: 000012 - My Company
FIXED DATE...... 01/01/00

Serial#......... 7765897411126
Cust#........... 123456
Customer Name... John Giles
BILL TO NO NAME. Bill To: 000123 - Sole trader PTY LTD
FIXED DATE...... 01/01/00

Serial#......... 12345665432112345
Cust#........... 000001
Customer Name... Mary Mack
BILL TO NO NAME. Bill To: 000245 - Hello.PTY.LTD
FIXED DATE...... 01/01/00
</font>'''

soup = BeautifulSoup(html_data,'html.parser')

groups = soup.font.text.replace('Serial#','xxxSerial#').split('xxx')

seen,out = set(),[]
for g in groups:
    m = re.search(r'Cust#.*?(\d+)\s*$',g,flags=re.M)
    if not m:
        continue
    if m.group(1) not in seen:
        seen.add(m.group(1))
        out.append(g.strip())

soup.find('font').string.replace_with('\n' + '\n\n'.join(out) + '\n')

print(soup)

打印:

<font>
Serial#......... 123456789101234567
Cust#........... 123456
Customer Name... Joe Rogan
BILL TO NO NAME. Bill To: 000000 - Some Company
FIXED DATE...... 01/01/00

Serial#......... 123456789101234567
Cust#........... 123451
Customer Name... Someone Famous
BILL TO NO NAME. Bill To: 000012 - My Company
FIXED DATE...... 01/01/00

Serial#......... 12345665432112345
Cust#........... 000001
Customer Name... Mary Mack
BILL TO NO NAME. Bill To: 000245 - Hello.PTY.LTD
FIXED DATE...... 01/01/00
</font>

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...