问题描述
我有以下着陆页
from bs4 import BeautifulSoup
html = '''\
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary"></span>
<a href="http://127.0.0.1:5000/#">The Shawshank Redemption</a>
</h3>
<p class="text-muted ">
<span class="certificate">9.3</span>
<span class="ghost">|</span>
<span class="runtime">142 min</span>
<span class="ghost">|</span>
<span class="genre">Drama</span>
</p>
<div class="ratings-bar">
<p class="text-muted">Two imprisoned men bond over a number of years,finding solace and eventual redemption through acts of common decency.</p>
<p class="">
Director:
<a href="http://127.0.0.1:5000/#">Frank Darabont</a>
</p>
<p class="sort-num_Votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="41436">2033239</span>
</p>
</div>
</div>
'''
soup = BeautifulSoup(html,'html.parser')
for item in soup.select('div.lister-item-content'):
title = item.select_one('h3.lister-item-header').text.strip()
rating = item.select_one('span.certificate').text.strip()
description_para = item.select_one('div.ratings-bar > p:f@R_502_6447@t-child')
description = description_para.text.strip()
director_para = description_para.find_next_sibling('p')
director = description_para.find_next_sibling('p').a.text.strip()
并且我想将所有内容合并到
city7900/
cityid=7900
city7900-t40094705.nb1/
在数据工作室
我尝试使用
7900
它只提取
REGEXP_EXTRACT(Landing Page,'city([^&]+))
一个 并尝试
city7900/
cityid=7900
它只提取 REGEXP_EXTRACT(Landing Page,'city([^&]+)|city([^&]+)(.*?)\\-')
如何提取所有这些?
解决方法
你可以使用
REGEXP_EXTRACT(Landing Page,'city[^0-9]*([0-9]+)')
参见regex demo。 详情:
-
city
- 一个字符串 -
[^0-9]*
- 零个或多个非数字字符 -
([0-9]+)
- 捕获第 1 组:一位或多位数字。