使用正则表达式从 URL 中提取字符串在数据工作室

问题描述

我有以下着陆页

from bs4 import BeautifulSoup

html = '''\
<div class="lister-item-content">
    <h3 class="lister-item-header">
        <span class="lister-item-index unbold text-primary"></span>
        <a href="http://127.0.0.1:5000/#">The Shawshank Redemption</a>
    </h3>
    <p class="text-muted ">
        <span class="certificate">9.3</span>
        <span class="ghost">|</span> 
        <span class="runtime">142 min</span>
        <span class="ghost">|</span> 
        <span class="genre">Drama</span>
    </p>
    <div class="ratings-bar">
        <p class="text-muted">Two imprisoned men bond over a number of years,finding solace and eventual redemption through acts of common decency.</p>
        <p class="">
            Director:
            <a href="http://127.0.0.1:5000/#">Frank Darabont</a>
        </p>
        <p class="sort-num_Votes-visible">
            <span class="text-muted">Votes:</span>
            <span name="nv" data-value="41436">2033239</span>
        </p>
    </div>
</div>
'''

soup = BeautifulSoup(html,'html.parser')
for item in soup.select('div.lister-item-content'):
    title = item.select_one('h3.lister-item-header').text.strip()
    rating = item.select_one('span.certificate').text.strip()
    description_para = item.select_one('div.ratings-bar > p:f@R_502_6447@t-child')
    description = description_para.text.strip()
    director_para = description_para.find_next_sibling('p')
    director = description_para.find_next_sibling('p').a.text.strip()

并且我想将所有内容合并到

city7900/
cityid=7900
city7900-t40094705.nb1/

在数据工作室

我尝试使用

7900

它只提取

REGEXP_EXTRACT(Landing Page,'city([^&]+))

一个 并尝试

city7900/
cityid=7900

它只提取 REGEXP_EXTRACT(Landing Page,'city([^&]+)|city([^&]+)(.*?)\\-')

如何提取所有这些?

解决方法

你可以使用

REGEXP_EXTRACT(Landing Page,'city[^0-9]*([0-9]+)')

参见regex demo详情

  • city - 一个字符串
  • [^0-9]* - 零个或多个非数字字符
  • ([0-9]+) - 捕获第 1 组:一位或多位数字。