正则表达式以清理Amazon链接

问题描述

我试图创建一个正则表达式来清理Amazon URL,但是我无法删除中间部分。

在所附示例中,我希望“组2”在最终结果中消失。有可能吗?

我使用以下正则表达式:^(?:http:\/\/|www\.|https:\/\/)([^\/]+)(\s?.*)(/[dg]p/)([^/]+)

我将得到这样的结果:

https://www.amazon.com/adidas-Melange-Performance-T-Shirt-Charcoal/dp/B07P4LVZNL/ref=sr_1_fkmr1_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr1 --> https://www.amazon.com/dp/B07P4LVZNL

https://www.amazon.com/adidas-Originals-Solid-Melange-Purple/dp/B07DXPN7TK/ref=sr_1_fkmr2_1?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-1-fkmr2 --> https://www.amazon.com/dp/B07DXPN7TK

https://www.amazon.es/gp/B07R23QGH6/ref=sr_1_fkmr2_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr2 --> https://www.amazon.com/gp/B07R23QGH6

https://www.amazon.it/dp/B07R23QGH6/ --> https://www.amazon.it/dp/B07R23QGH6/

https://regex101.com/r/AFGk96/1

解决方法

您在逃避。在正则表达式中,斜杠没有任何意义,不需要将其转义:

^(?:http:\/\/|www\.|https:\/\/)([^\/]+)(\s?.*)(/[dg]p/)([^/]+)

可能(还有一些其他简化)

^(?:https?://)?(www[^/]+).*?(/[dg]p/[^/]+)

当我们在末尾添加.*以匹配字符串的尾部时,我们得到的结果是有效的:

import re

amazon_url_pattern = re.compile(r'^(?:https?://)?(www[^/]+).*?(/[dg]p/[^/]+).*')

url = 'https://www.amazon.com/adidas-Melange-Performance-T-Shirt-Charcoal/dp/B07P4LVZNL/ref=sr_1_fkmr1_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr1'
result = amazon_url_pattern.sub(r'\1\2/',url)

print(result)

打印

https://www.amazon.com/dp/B07P4LVZNL/

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...