问题描述
以下脚本将浏览html
文档。首先,它将某些文本替换为不同的文本。这部分工作正常。下一个工作是将<font>``</font>
标签之间的文本分成几组。并将它们,组放在[]
中。这按预期工作。它跌倒了。它应该使用这些组[]
比较每个组Entity Id..
,如果找到匹配项/重复项,我希望它删除相关的组。不幸的是,该脚本并没有删除重复的组,并且那里的数量已经足够了。在某些情况下> 100个重复项。它们是动态的,因此可能有许多不同的Entity Id...xxx
都有重复项。我不想删除每个包含重复项的组,而是希望为每个重复项保留1个组Entity Id...
我的问题是否在此范围内
for group in groups[1:]:
if group[1] == groups[groups.index(group)-1][1]:
groups.remove(group)
据我了解,它使用group[1]
作为文档中的最后一个Entity Id...123456
作为对groups[groups.index(group)-1][1]:
的查找值,该值将包括文档中的每个Entity Id
Index
函数。但是由于group[1]
不在'Index'函数中,它仅使用最后一个Entity Id...
作为查找吗?我是否需要在Index
中加入if group[1] ==
?
from bs4 import BeautifulSoup
import re
import urllib
import os
svdir = ''
filename = 'something.html'
with open(r'something.html','r') as f:
html_string = f.read()
soup = BeautifulSoup(html_string)
target = soup.find_all(text=re.compile(r'Serial#.........'))
for v in target:
v.replace_with(v.replace('Serial#.........','Note Id.....'))
target1 = soup.find_all(text=re.compile(r'Cust#...........'))
for v in target1:
v.replace_with(v.replace('Cust#...........','Entity Id...'))
target2 = soup.find_all(text=re.compile(r'Customer Name...'))
for v in target2:
v.replace_with(v.replace('Customer Name...','Entity Name.'))
target3 = soup.find_all(text=re.compile(r'BILL TO NO NAME.'))
for v in target3:
v.replace_with(v.replace('BILL TO NO NAME.','Note Detail.'))
target4 = soup.find_all(text=re.compile(r'FIXED DATE......'))
for v in target4:
v.replace_with(v.replace('FIXED DATE......','Create Date.'))
data = soup.select_one('font')
targets = data.text.replace('Note Id','xxxNote Id').split('xxx')
groups = [target.strip().split('\n') for target in targets[1:]]
for group in groups[1:]:
if group[1] == groups[groups.index(group)-1][1]:
groups.remove(group)
new_ts = '\n'
for group in groups:
new_ts += '\n'.join(group)+'\n\n'
data.string.replace_with(new_ts)
print(soup)
sv_path = os.path.join(svdir,filename)
fp = open(sv_path,'w')
fp.write(str(soup))
fp.close()
这是一些示例html,作为结构指南
<font>
##This is a group##
Serial#......... 123456789101234567
Cust#........... 123456
Customer Name... Joe Rogan
BILL TO NO NAME. Bill To: 000000 - Some Company
FIXED DATE...... 01/01/00
##This is another group##
Serial#......... 765432110987654321
Cust#........... 123456
Customer Name... Nate Diaz
BILL TO NO NAME. Bill To: 000001 - Some other company
FIXED DATE...... 01/01/00
Serial#......... 123456789101234567
Cust#........... 123451
Customer Name... Someone Famous
BILL TO NO NAME. Bill To: 000012 - My Company
FIXED DATE...... 01/01/00
Serial#......... 7765897411126
Cust#........... 123456
Customer Name... John Giles
BILL TO NO NAME. Bill To: 000123 - Sole trader PTY LTD
FIXED DATE...... 01/01/00
Serial#......... 12345665432112345
Cust#........... 000001
Customer Name... Mary Mack
BILL TO NO NAME. Bill To: 000245 - Hello.PTY.LTD
FIXED DATE...... 01/01/00`
</font>
这是我希望达到的目标
<font>
Note Id..... 123456789101234567
Entity Id... 123456
Entity Name. Joe Rogan
Note Detail. Bill To: 000000 - Some Company
Create Date. 01/01/00
Note Id..... 12345665432112345
Entity Id... 000001
Entity Name. Ned Flanders
Note Detail. Bill To: 000002 - Some other big company
Create Date. 01/01/00
Note Id..... 123456789101234567
Entity Id... 123451
Entity Name. Someone Famous
Note Detail. Bill To: 000012 - My Company
Create Date. 01/01/00
</font>
解决方法
import re
from bs4 import BeautifulSoup
html_data = '''<font>
Serial#......... 123456789101234567
Cust#........... 123456
Customer Name... Joe Rogan
BILL TO NO NAME. Bill To: 000000 - Some Company
FIXED DATE...... 01/01/00
Serial#......... 765432110987654321
Cust#........... 123456
Customer Name... Nate Diaz
BILL TO NO NAME. Bill To: 000001 - Some other company
FIXED DATE...... 01/01/00
Serial#......... 123456789101234567
Cust#........... 123451
Customer Name... Someone Famous
BILL TO NO NAME. Bill To: 000012 - My Company
FIXED DATE...... 01/01/00
Serial#......... 7765897411126
Cust#........... 123456
Customer Name... John Giles
BILL TO NO NAME. Bill To: 000123 - Sole trader PTY LTD
FIXED DATE...... 01/01/00
Serial#......... 12345665432112345
Cust#........... 000001
Customer Name... Mary Mack
BILL TO NO NAME. Bill To: 000245 - Hello.PTY.LTD
FIXED DATE...... 01/01/00
</font>'''
soup = BeautifulSoup(html_data,'html.parser')
groups = soup.font.text.replace('Serial#','xxxSerial#').split('xxx')
seen,out = set(),[]
for g in groups:
m = re.search(r'Cust#.*?(\d+)\s*$',g,flags=re.M)
if not m:
continue
if m.group(1) not in seen:
seen.add(m.group(1))
out.append(g.strip())
soup.find('font').string.replace_with('\n' + '\n\n'.join(out) + '\n')
print(soup)
打印:
<font>
Serial#......... 123456789101234567
Cust#........... 123456
Customer Name... Joe Rogan
BILL TO NO NAME. Bill To: 000000 - Some Company
FIXED DATE...... 01/01/00
Serial#......... 123456789101234567
Cust#........... 123451
Customer Name... Someone Famous
BILL TO NO NAME. Bill To: 000012 - My Company
FIXED DATE...... 01/01/00
Serial#......... 12345665432112345
Cust#........... 000001
Customer Name... Mary Mack
BILL TO NO NAME. Bill To: 000245 - Hello.PTY.LTD
FIXED DATE...... 01/01/00
</font>