问题描述
请帮助,正则表达式使我震惊。
我正在清理Pandas数据框(python 3)中的数据。
我尝试了很多在网上找到的正则表达式组合来获取数字,但没有一个适合我的情况。我似乎无法弄清楚如何为模式2位数空间到间隔2位数(示例26至40)编写自己的正则表达式。
我的挑战是从熊猫列中提取BLOOM(刮擦的数据)花瓣数量。通常,花瓣被指定为“ dd到dd花瓣”。我知道正则表达式中的2位数字是\d\d
或\d{2}
,但是如何合并“ to”分隔符?最好有一个条件,即模式后面加上“花瓣”一词。
我当然不是第一个需要将python \ d \ d转换为\ d \ d的正则表达式的人。
编辑:
我意识到没有示例数据框的问题有点令人困惑。这是一个示例数据框。
import pandas as pd
import re
# initialize list of lists
data = [['Evert van Dijk','Carmine-pink,salmon-pink streaks,stripes,flecks. Warm pink,clear carmine pink,rose pink shaded salmon. Mild fragrance. Large,very double,in small clusters,high-centered bloom form. Blooms in flushes throughout the season.'],['Every Good Gift','Red. Flowers velvety red. Moderate fragrance. Average diameter 4". Medium-large,full (26-40 petals),borne mostly solitary bloom form. Blooms in flushes throughout the season.'],['Evghenya','Orange-pink. 75 petals. Large,very double bloom form. Blooms in flushes throughout the season.'],['Evita','White or white blend. None to mild fragrance. 35 petals. Large,['Evrathin','Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5". Medium,double (17-25 petals),cluster-flowered,in small clusters bloom form. Prolific,once-blooming spring or summer. Glandular sepals,leafy sepals,long sepals buds.'],['Evita 2','White,blush shading. Mild,wild rose fragrance. 20 to 25 petals. Average diameter 1.25". Small,cluster-flowered bloom form. Blooms in flushes throughout the season.']]
# Create the pandas DataFrame
df = pd.DataFrame(data,columns = ['NAME','BLOOM'])
# print dataframe.
df
解决方法
您可以使用
df['res_col'] = df['src_col'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal',expand=False)
请参见regex demo
详细信息
-
(?<!\d)
-向后查找以确保当前位置的左侧没有数字 -
(\d{2}\s+to\s+\d{2})
-第1组(str.extract
的实际回报):-
\d{2}
-两位数字 -
\s+to\s+
-1个以上空白,to
字符串,1个以上空白 -
\d{2}
-两位数字
-
-
\s*petal
-0 +空格后跟petal
。
这对我有用:
import re
sample = '2 digits (example 26 to 40 petals) and 16 to 43 petals.'
re.compile(r"\d{2}\sto\s\d{2}\spetals").findall(sample)
输出:
['26 to 40 petals','16 to 43 petals']
如前所述,\ d {2}查找2位数字,\ sto \ s查找单词'to',并用空格包围,然后再次\ d {2}查找第二个2位数字,然后是一个空格(\ s)和“花瓣”一词。
,发布答案以显示我如何解决从BLOOM列中提取花瓣数据的问题。我必须使用多个正则表达式来获取所需的所有数据。这个问题仅涉及我使用的一个正则表达式。
示例数据框在打印时如下所示:
在遇到导致这篇文章的问题之前,我创建了那些列。我最初的方法是将所有数据都放在方括号中。
#coping content in column BLOOM inside first brackets into new column PETALS
df['PETALS'] = df['BLOOM'].str.extract('(\\(.*?)\\)',expand=False).str.strip()
df['PETALS'] = df['PETALS'].str.replace("(","")
# #coping content in column BLOOM inside all brackets into new column ALL_PETALS_BRACKETS
df['ALL_PETALS_BRACKETS'] = df['BLOOM'].str.findall('(\\(.*?)\\)')
df[['NAME','BLOOM','PETALS','ALL_PETALS_BRACKETS']]
我后来意识到,这种方式只能获取某些行的花瓣值。可以在BLOOM列中以多种方式指定花瓣。另一个常见的模式是“ 2位到2位”。还有图案“ 2位数花瓣”。
# solution provided by Wiktor Stribiżew
df['PETALS_Wiktor_S'] = df['BLOOM'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal',expand=False)
# my modification that worked on the main df and not only on the test one.
# now lets copy part of column BLOOM that matches regex pattern two digits to two digits
df['PETALS5'] = df['BLOOM'].str.extract(r'(\d{2}\s+to\s+\d{2})',expand=False).str.strip()
# also came across cases where pattern is two digits followed by word "petals"
#now lets copy part of column BLOOM that matches regex patern two digits followed by word "petals"
df['PETALS6'] = df['BLOOM'].str.extract(r'(\d{2}\s+petals+\.)',expand=False).str.strip()
df
因为我追求的是“ 2位数花瓣”图案。我必须修改我的正则表达式,以便它在+\.
中使用r'(\d{2}\s+petals+\.
查找点。如果正则表达式写为r'(\d{2}\s+petals.
,则会遇到单词花瓣后跟.
和{{ 1}}。