删除字符串列中缩写字母之间的空格

问题描述

我有一个熊猫数据框如下：

import pandas as pd
import numpy as np

d = {'col1': ['I called the c. i. a','the house is e. m','this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)

我删除了标点符号并删除了缩写字母之间的空格：

df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')

输出是（例如“我打电话给中央情报局”）但我想要发生的是以下（“我打电话给中央情报局”）。所以我基本上喜欢大写的缩写。我尝试了以下，但没有结果

df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)'.upper(),'')

或

df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)',''.upper())

解决方法

pandas.Series.str.replace 允许第二个参数可调用，符合 re.sub 的第二个参数的要求。使用它，您可能首先将缩写大写，如下所示：

import pandas as pd
def make_upper(m):  # where m is re.Match object
    return m.group(0).upper()
d = {'col1': ['I called the c. i. a','the house is e. m','this is an e. u. call!','how is the p. o. r going?']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].str.replace(r'\b\w\.?\b',make_upper)
print(df)

输出

                        col1
0       I called the C. I. A
1          the house is E. M
2     this is an E. U. call!
3  how is the P. O. R going?

然后您可以使用已有的代码进一步处理

df['col1'] = df['col1'].str.replace('[^\w\s]','')
df['col1'] = df['col1'].str.replace(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)','')
print(df)

输出

               col1
0      I called the CIA
1       the house is EM
2    this is an EU call
3  how is the POR going

如果您遇到它未涵盖的情况，您可以选择改进我使用的模式 (r'\b\w\.?\b')。我使用了单词边界和文字点 (\.)，因此它确实可以找到任何单个单词字符 (\w) 可选 (?) 后跟点。

您需要使用一个函数进行替换。试试这个来制作大写字母并替换首字母缩略词的空格和标点符号：

def my_replace(match):
    match = match.group()
    return match.replace('.','').replace(' ','').upper()

df['col1'].str.replace(r'\b[\w](\.\s[\w])+\b[\.]*',my_replace)

abbreviation pandas pandas python uppercase