适用于模式2位数到2位数的Python正则表达式-26到40

问题描述

请帮助，正则表达式使我震惊。

我正在清理Pandas数据框（python 3）中的数据。

我尝试了很多在网上找到的正则表达式组合来获取数字，但没有一个适合我的情况。我似乎无法弄清楚如何为模式2位数空间到间隔2位数（示例26至40）编写自己的正则表达式。

我的挑战是从熊猫列中提取BLOOM（刮擦的数据）花瓣数量。通常，花瓣被指定为“ dd到dd花瓣”。我知道正则表达式中的2位数字是\d\d或\d{2}，但是如何合并“ to”分隔符？最好有一个条件，即模式后面加上“花瓣”一词。

我当然不是第一个需要将python \ d \ d转换为\ d \ d的正则表达式的人。

编辑：

我意识到没有示例数据框的问题有点令人困惑。这是一个示例数据框。

import pandas as pd 
import re

# initialize list of lists 
data = [['Evert van Dijk','Carmine-pink,salmon-pink streaks,stripes,flecks.  Warm pink,clear carmine pink,rose pink shaded salmon.  Mild fragrance.  Large,very double,in small clusters,high-centered bloom form.  Blooms in flushes throughout the season.'],['Every Good Gift','Red.  Flowers velvety red.  Moderate fragrance.  Average diameter 4".  Medium-large,full (26-40 petals),borne mostly solitary bloom form.  Blooms in flushes throughout the season.'],['Evghenya','Orange-pink.  75 petals.  Large,very double bloom form.  Blooms in flushes throughout the season.'],['Evita','White or white blend.  None to mild fragrance.  35 petals.  Large,['Evrathin','Light pink. [Deep pink.]  Outer petals white. Expand rarely.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium,double (17-25 petals),cluster-flowered,in small clusters bloom form.  Prolific,once-blooming spring or summer.  Glandular sepals,leafy sepals,long sepals buds.'],['Evita 2','White,blush shading.  Mild,wild rose fragrance.  20 to 25 petals.  Average diameter 1.25".  Small,cluster-flowered bloom form.  Blooms in flushes throughout the season.']]

# Create the pandas DataFrame 
df = pd.DataFrame(data,columns = ['NAME','BLOOM']) 

# print dataframe. 
df

解决方法

您可以使用

df['res_col'] = df['src_col'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal',expand=False)

请参见regex demo

详细信息

(?<!\d)-向后查找以确保当前位置的左侧没有数字
(\d{2}\s+to\s+\d{2})-第1组（str.extract的实际回报）：
- \d{2}-两位数字
- \s+to\s+-1个以上空白，to字符串，1个以上空白
- \d{2}-两位数字
\s*petal-0 +空格后跟petal。

这对我有用：

import re

sample = '2 digits (example 26 to 40 petals) and 16 to 43 petals.'
re.compile(r"\d{2}\sto\s\d{2}\spetals").findall(sample)

输出：

['26 to 40 petals','16 to 43 petals']

如前所述，\ d {2}查找2位数字，\ sto \ s查找单词'to'，并用空格包围，然后再次\ d {2}查找第二个2位数字，然后是一个空格（\ s）和“花瓣”一词。

发布答案以显示我如何解决从BLOOM列中提取花瓣数据的问题。我必须使用多个正则表达式来获取所需的所有数据。这个问题仅涉及我使用的一个正则表达式。

示例数据框在打印时如下所示：

在遇到导致这篇文章的问题之前，我创建了那些列。我最初的方法是将所有数据都放在方括号中。

#coping content in column BLOOM inside first brackets into new column PETALS
df['PETALS'] = df['BLOOM'].str.extract('(\\(.*?)\\)',expand=False).str.strip()
df['PETALS'] = df['PETALS'].str.replace("(","") 

# #coping content in column BLOOM inside all brackets into new column ALL_PETALS_BRACKETS
df['ALL_PETALS_BRACKETS'] = df['BLOOM'].str.findall('(\\(.*?)\\)')
df[['NAME','BLOOM','PETALS','ALL_PETALS_BRACKETS']]

我后来意识到，这种方式只能获取某些行的花瓣值。可以在BLOOM列中以多种方式指定花瓣。另一个常见的模式是“ 2位到2位”。还有图案“ 2位数花瓣”。

# solution provided by Wiktor Stribiżew
df['PETALS_Wiktor_S'] = df['BLOOM'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal',expand=False)

# my modification that worked on the main df and not only on the test one. 
# now lets copy part of column BLOOM that matches regex pattern two digits to two digits
df['PETALS5'] = df['BLOOM'].str.extract(r'(\d{2}\s+to\s+\d{2})',expand=False).str.strip()

# also came across cases where pattern is two digits followed by word "petals"
#now lets copy part of column BLOOM that matches regex patern two digits followed by word "petals"
df['PETALS6'] = df['BLOOM'].str.extract(r'(\d{2}\s+petals+\.)',expand=False).str.strip()
df

因为我追求的是“ 2位数花瓣”图案。我必须修改我的正则表达式，以便它在+\.中使用r'(\d{2}\s+petals+\.查找点。如果正则表达式写为r'(\d{2}\s+petals.，则会遇到单词花瓣后跟.和{{ 1}}。

data-cleaning data-wrangling python-3.x regex