如何从组中选择

问题描述

我对如何抓住这个迷茫了。我一直在使用https://regexr.com/和https://www.logextender.com/，但没什么。谁能帮助这个可怜的新手？

我正试图抓住“ |”之后的所有内容。

<div class="atw-JobInfo-companyLocation"><span>CG-VAK Softwares USA Inc</span><span> | </span><span>Remote </span></div>

到目前为止，我的正则表达式是：

（（[[\ w-。] + [-。] [\ w-。] +）\ s +（\ w +）\ s +（\ w +）\ s +（\ w +）\ s + | \ s +（\ w + ）\ s +）

我的表如下所示，我想将公司名称与所有后面的| |分开，但是我认为最好的方法是在创建表后通过正则表达式进行操作？停顿。

+---------------------------------------------------+--------------------------------------+
|                     Position                      |               Company                |
+---------------------------------------------------+--------------------------------------+
| Renovation/Construction Underwriter               | Ignite Human Capital | Remote        |
| Scientific Computing                              | CG-VAK Softwares USA Inc | Remote    |
| Data Analytics Engineer                           | Delta Defense LLC | West Bend,WI    |
| Data Analyst - Tableau - Alteryx - Insurance e... | Grapevine Technology | United States |
| Technology Integration Specialist                 | KAGE Innovation | Osceola,WI        |
+---------------------------------------------------+--------------------------------------+

解决方法

简单拆分有什么问题？

sample = 'CG-VAK Softwares USA Inc | Remote'

parts = sample.split('|')
if len(parts) == 2:
    print(parts[1].strip()) # prints 'Remote'

如果您已经有了该表，则应该使用该数据。如果您以html开头，则html解析库可以轻松创建它：

from bs4 import BeautifulSoup as BS

def find_span_texts(html):
    """find spans and return their containing text

    the parameter is a bs4.element.Tag,or partial result from a BS instance
    the return is an array of strings,representing the contenxt of the spans
    """
    return [s.text for s in html.find_all('span')]


html_input = """<div class="atw-JobInfo-companyLocation">
    <span>CG-VAK Softwares USA Inc</span>
    <span> | </span><span>Remote </span></div>"""

# create soup object
bs = BS(html_input,'html.parser')

# find divs with information class
divs = bs.find_all(['div',{"class":"atw-JobInfo-companyLocation"}])

# get spans from all spans
spanTexts = [find_span_texts(div) for div in divs]
# print(spanTexts) # [['CG-VAK Softwares USA Inc',' | ','Remote ']]

# get company and location
coLocs = [[c.strip(),l.strip()] for c,pipe,l in spanTexts]

# show result
print(coLocs) # [['CG-VAK Softwares USA Inc','Remote']]

jupyter-notebook python python-3.x regex-group