问题描述
我对如何抓住这个迷茫了。我一直在使用https://regexr.com/和https://www.logextender.com/,但没什么。谁能帮助这个可怜的新手?
我正试图抓住“ |”之后的所有内容。
HTML代码是:
<div class="atw-JobInfo-companyLocation"><span>CG-VAK Softwares USA Inc</span><span> | </span><span>Remote </span></div>
到目前为止,我的正则表达式是:
(([[\ w-。] + [-。] [\ w-。] +)\ s +(\ w +)\ s +(\ w +)\ s +(\ w +)\ s + | \ s +(\ w + )\ s +)
我的表如下所示,我想将公司名称与所有后面的| |分开,但是我认为最好的方法是在创建表后通过正则表达式进行操作?停顿。
+---------------------------------------------------+--------------------------------------+
| Position | Company |
+---------------------------------------------------+--------------------------------------+
| Renovation/Construction Underwriter | Ignite Human Capital | Remote |
| Scientific Computing | CG-VAK Softwares USA Inc | Remote |
| Data Analytics Engineer | Delta Defense LLC | West Bend,WI |
| Data Analyst - Tableau - Alteryx - Insurance e... | Grapevine Technology | United States |
| Technology Integration Specialist | KAGE Innovation | Osceola,WI |
+---------------------------------------------------+--------------------------------------+
解决方法
简单拆分有什么问题?
sample = 'CG-VAK Softwares USA Inc | Remote'
parts = sample.split('|')
if len(parts) == 2:
print(parts[1].strip()) # prints 'Remote'
如果您已经有了该表,则应该使用该数据。如果您以html开头,则html解析库可以轻松创建它:
from bs4 import BeautifulSoup as BS
def find_span_texts(html):
"""find spans and return their containing text
the parameter is a bs4.element.Tag,or partial result from a BS instance
the return is an array of strings,representing the contenxt of the spans
"""
return [s.text for s in html.find_all('span')]
html_input = """<div class="atw-JobInfo-companyLocation">
<span>CG-VAK Softwares USA Inc</span>
<span> | </span><span>Remote </span></div>"""
# create soup object
bs = BS(html_input,'html.parser')
# find divs with information class
divs = bs.find_all(['div',{"class":"atw-JobInfo-companyLocation"}])
# get spans from all spans
spanTexts = [find_span_texts(div) for div in divs]
# print(spanTexts) # [['CG-VAK Softwares USA Inc',' | ','Remote ']]
# get company and location
coLocs = [[c.strip(),l.strip()] for c,pipe,l in spanTexts]
# show result
print(coLocs) # [['CG-VAK Softwares USA Inc','Remote']]