我正在解析一个
HTML,我得到一个Array字符串,我正在尝试清理它,然后将其放入pdf中.在这个级别,我想把@X开始的所有单词移到行尾,这样我最终可以得到所有的@X对齐.
Hello World @Xabs Hello World @Xz Hello World @Xss Hello World @Xssa Hello World @Xqq Hello World @Xsasas
我希望作为输出:
Hello World @Xabs Hello World @Xz Hello World @Xss Hello World @Xssa Hello World @Xqq Hello World @Xsaxs
有任何想法吗?
到目前为止我所拥有的:
# encoding=utf8 import sys reload(sys) #import from lxml import html from bs4 import BeautifulSoup as soup import re import codecs sys.setdefaultencoding('utf8') # Access to the local URL(Html file) f=codecs.open("C:\...\file.html",'r') page = f.read() f.close() #html parsing page_soup = soup(page,"html.parser") tree = html.fromstring(page) # extract the important arrays of string a_s= page_soup.find_all("td",{"class" :"row_cell"}) for a in a_s: result = a.text.replace("@X","") print(final_result)
解决方法
与@ blue_note的答案非常相似,但使整个解决方案更加自动化:
import re lines = ['Hello World @Xabs','Hello World @Xz','Hello World @Xss','Hello World @Xssa','Hello World @Xqq','Hello World @Xsasas'] aligned_lines = [] for line in lines: match = re.findall('@X\w+',line)[0] line = line.replace(match,'') aligned_lines.append('%-50s %s' % (line,match)) aligned_lines ['Hello World @Xabs','Hello World @Xz','Hello World @Xss','Hello World @Xssa','Hello World @Xqq','Hello World @Xsasas']