问题描述
import pandas as pd
import numpy as np
d = {'col1': ['I called the c. i. a','the house is e. m','this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')
输出是(例如“我打电话给中央情报局”)但我想要发生的是以下(“我打电话给中央情报局”)。所以我基本上喜欢大写的缩写。我尝试了以下,但没有结果
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)'.upper(),'')
或
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)',''.upper())
解决方法
pandas.Series.str.replace
允许第二个参数可调用,符合 re.sub
的第二个参数的要求。使用它,您可能首先将缩写大写,如下所示:
import pandas as pd
def make_upper(m): # where m is re.Match object
return m.group(0).upper()
d = {'col1': ['I called the c. i. a','the house is e. m','this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].str.replace(r'\b\w\.?\b',make_upper)
print(df)
输出
col1
0 I called the C. I. A
1 the house is E. M
2 this is an E. U. call!
3 how is the P. O. R going?
然后您可以使用已有的代码进一步处理
df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')
print(df)
输出
col1
0 I called the CIA
1 the house is EM
2 this is an EU call
3 how is the POR going
如果您遇到它未涵盖的情况,您可以选择改进我使用的模式 (r'\b\w\.?\b'
)。我使用了单词边界和文字点 (\.
),因此它确实可以找到任何单个单词字符 (\w
) 可选 (?
) 后跟点。
您需要使用一个函数进行替换。试试这个来制作大写字母并替换首字母缩略词的空格和标点符号:
def my_replace(match):
match = match.group()
return match.replace('.','').replace(' ','').upper()
df['col1'].str.replace(r'\b[\w](\.\s[\w])+\b[\.]*',my_replace)