如何从组中选择

问题描述

我对如何抓住这个迷茫了。我一直在使用https://regexr.com/https://www.logextender.com/,但没什么。谁能帮助这个可怜的新手?

我正试图抓住“ |”之后的所有内容

HTML代码是:

<div class="atw-JobInfo-companyLocation"><span>CG-VAK Softwares USA Inc</span><span> | </span><span>Remote </span></div>

到目前为止,我的正则表达式是:

(([[\ w-。] + [-。] [\ w-。] +)\ s +(\ w +)\ s +(\ w +)\ s +(\ w +)\ s + | \ s +(\ w + )\ s +)

我的表如下所示,我想将公司名称与所有后面的| |分开,但是我认为最好的方法是在创建表后通过正则表达式进行操作?停顿。

+---------------------------------------------------+--------------------------------------+
|                     Position                      |               Company                |
+---------------------------------------------------+--------------------------------------+
| Renovation/Construction Underwriter               | Ignite Human Capital | Remote        |
| Scientific Computing                              | CG-VAK Softwares USA Inc | Remote    |
| Data Analytics Engineer                           | Delta Defense LLC | West Bend,WI    |
| Data Analyst - Tableau - Alteryx - Insurance e... | Grapevine Technology | United States |
| Technology Integration Specialist                 | KAGE Innovation | Osceola,WI        |
+---------------------------------------------------+--------------------------------------+

解决方法

简单拆分有什么问题?

sample = 'CG-VAK Softwares USA Inc | Remote'

parts = sample.split('|')
if len(parts) == 2:
    print(parts[1].strip()) # prints 'Remote'

如果您已经有了该表,则应该使用该数据。如果您以html开头,则html解析库可以轻松创建它:

from bs4 import BeautifulSoup as BS

def find_span_texts(html):
    """find spans and return their containing text

    the parameter is a bs4.element.Tag,or partial result from a BS instance
    the return is an array of strings,representing the contenxt of the spans
    """
    return [s.text for s in html.find_all('span')]


html_input = """<div class="atw-JobInfo-companyLocation">
    <span>CG-VAK Softwares USA Inc</span>
    <span> | </span><span>Remote </span></div>"""

# create soup object
bs = BS(html_input,'html.parser')

# find divs with information class
divs = bs.find_all(['div',{"class":"atw-JobInfo-companyLocation"}])

# get spans from all spans
spanTexts = [find_span_texts(div) for div in divs]
# print(spanTexts) # [['CG-VAK Softwares USA Inc',' | ','Remote ']]

# get company and location
coLocs = [[c.strip(),l.strip()] for c,pipe,l in spanTexts]

# show result
print(coLocs) # [['CG-VAK Softwares USA Inc','Remote']]