(Python 3) - 正则表达式问题不返回匹配

问题描述

有点背景故事。我正在尝试抓取 pastebin 的存档页面并仅获取粘贴的 ID。 ID 的长度为 8 个字符，粘贴的示例链接如下：“https://pastebin.com/A8XGWYBu”

import requests
import re
from bs4 import BeautifulSoup

def get_recent_id():
    
    URL = requests.get('https://pastebin.com/archive',verify=False)

    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"

    soup = BeautifulSoup(URL.content,'html.parser')
    pastes = soup.find_all('a')

    # Works good here
    # prints the necessary things using the regex above
    pastes_findall = re.findall(href_regex,str(pastes))

    try:
        for id,t in pastes_findall:
            output = f"{t} -> {id}"
            get_valid = r'(.*?) \-\> ([A-Za-z\d+]{8})'

            final = re.findall(get_valid,output)
            print(final)
    except IndexError:
        pass

get_recent_id()

它中断的地方是 try 语句中的正则表达式。它不会返回我期望的信息，而是返回空白的 [] 括号。

在 try 语句中使用正则表达式的示例输出。

[]
[]
[]
[]
...

我已经在 regex101 中测试了正则表达式，它对 output 变量的输出效果很好。

regex101 中的示例：

我试图实现的输出应该只返回标题和粘贴 ID，并且应该如下所示：

blood sword v1.0 -> cvWdRuaV
lab2 -> eRJY9YAb
example 210526a -> A2sv2shx
2021-05-26_stats.json -> wjsmucFF
2021-05-25_stats.json -> TsXrW7ex
Flake#5595 (466999758096039936) RD -> q8tHsgMz
Untitled -> akrSbCyT
...

当 regex101 清楚地显示 2 组中有匹配项时，我不知道为什么我没有从输出中得到任何结果。如果有人能够提供帮助，我将不胜感激！

谢谢！

解决方法

您可以使用更少的代码行获得所需的输出。确保您的 bs4 版本是最新的或至少 >= 4.7.0 以支持我在脚本中使用的伪 css 选择器。

import requests
from bs4 import BeautifulSoup

link = 'https://pastebin.com/archive'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.maintable tr:has(> td > a[href]) > td:nth-of-type(1) > a"):
        title = item.text
        _id = item.get("href").lstrip("/")
        print(title," -> ",_id)

此时的输出（截断）：

new_meta_format  ->  JjMxWDzh
Paste Ping  ->  bH54QCb9
Untitled  ->  EEMQigvX
free checked credit cards  ->  b6LE4e78
Untitled  ->  wJA8Axbb
Untitled  ->  fFFrEJnv
Untitled  ->  A8XGWYBu
Ejercicio01  ->  CqP4grhP
Ejercicio01  ->  nhxM8Tca
Untitled  ->  8Y485jwG
f_get_product_balance_stock_exclude_reserved  ->  hc64MsgH
in_product_balance_stock_reserved  ->  ZGXgRWKQ
My Log File  ->  24TnZK2F
Untitled  ->  tvbwuWkL

我认为您不需要正则表达式。您可以获取每个 href 的 pastes 值，strip / 个字符，然后通过附加 -> 和文本值来生成输出值a 元素的：

[i["href"].strip('/') + " -> " + i.get_text() for i in pastes]

整个方法看起来像

def get_recent_id():
    URL = requests.get('https://pastebin.com/archive',verify=False)
    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"
    soup = BeautifulSoup(URL.content,'html.parser')
    pastes = soup.find_all('a')
    return [i["href"].strip('/') + " -> " + i.get_text() for i in pastes]

所以玩了一会儿，我就找到了问题的答案。

@Wiktor，你的回答很好，但仍然返回了一些我不需要的结果。

最终代码如下：

def get_recent_id():
    
    URL = requests.get('https://pastebin.com/archive',verify=False)

    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"

    soup = BeautifulSoup(URL.content,'html.parser')
    pastes = soup.find_all('a')
    
    # Works until here
    # prints the necessary things using the regex above
    pastes_findall = re.findall(href_regex,str(pastes))

    try:
        for id,t in pastes_findall:
            output = f"{t} -> {id}"
            get_valid = r'(.*?) \-\> ([A-Za-z\d+]{8})'
            final = re.search(get_valid,output)
            
            if final is None:
                pass
            else:
                final = final.group(0)
                print(final)
            
    except IndexError:
        pass

get_recent_id()

基本上，我在本地的 output 变量中还有其他一些东西，我没有在我的帖子中在这里展示。删除这些之后，我最初发布的内容就解决了（应该早点尝试...）。

然后我收到了一个“NoneType”错误，但一个简单的 if 语句也解决了这个问题。

最后，我得到了所需的输出，如下所示：

$ ./tool.py

Paste Ping -> bH54QCb9
Untitled -> EEMQigvX
free checked credit cards -> b6LE4e78
Untitled -> wJA8Axbb
Untitled -> fFFrEJnv
Untitled -> A8XGWYBu
Ejercicio01 -> CqP4grhP
Ejercicio01 -> nhxM8Tca
Untitled -> 8Y485jwG
f_get_product_balance_stock_exclude_reserved -> hc64MsgH
in_product_balance_stock_reserved -> ZGXgRWKQ
My Log File -> 24TnZK2F
Untitled -> tvbwuWkL
Woocommerce Minimum Order Amount -> j35Hg0Ci
...

感谢您的回答！

pastebin python python-3.x regex regex regex web-scraping