使用python清理REGEX字母数字给出的不匹配记录的最佳方法？解释

问题描述

我有一个数据集，其中有包含字母数字值的列。我现在可以过滤不匹配的记录我想清理它们，这意味着例如不匹配的记录是 123*abc& 那么它应该删除 123abc。我已经做到了，但我认为这不是一种正确的方法，并且在最终结果之后合并数据我可以使用 for 循环来正确获取它们，但这将是一个缓慢的过程。因此寻找一种更简单的方法（逐列清洗）。可以这样做吗？

data = ['abc123','abc*123&','Abc123','ABC@*&123',np.nan,'123*Abc']
df=pd.DataFrame(data,columns=['a'])
print(df)
       a
0     abc123
1   abc*123&
2     Abc123
3  ABC@*&123
4        NaN
5    123*Abc

过滤不匹配的记录：

wrong=df[~df['a'].str.contains(numeric,na=True)]
print(wrong)
        a
1   abc*123&
3  ABC@*&123
5    123*Abc


wrong_index = wrong.index
result = ''.join(i for i in wrong['a'] if not i.isalpha())  
alphanumeric = [character for character in result if character.isalnum()]
alphanumeric = "".join(alphanumeric)
df['a'].loc[wrong_index]=alphanumeric
print(df)
     a
0   abc123
1   abc123ABC123123Abc
2   Abc123
3   abc123ABC123123Abc
4   NaN
5   abc123ABC123123Abc

我知道为什么会发生这种情况，可以通过使用 for 或遍历每一行来解决，但这会消耗大量时间。有什么办法可以逐列清理吗？

异常输出：

       a
0     abc123
1     abc123
2     Abc123
3     ABC123
4        NaN
5     123Abc

解决方法

使用带有内置 Regex 模块 re 的普通 Python 即可。看到这个demo on IDEone: using regular-expression to replace list elements

import re

data = ['abc123','abc*123&','Abc123','ABC@*&123','123*Abc']
cleaned = [re.sub('\W','',item) for item in data]
print(cleaned)

脚本将输出：

['abc123','abc123','ABC123','123Abc']

解释

re.sub 函数替换给定的字符串（此处：item），例如搜索和替换。
搜索由正则表达式指定： \W 所有非单词字符（即非数字、非字母字母）。
replace 由一个空字符串指定，以简单地删除找到的：''。
for 循环 实现为 list-comprehension，一种思想的或 pythonic 方法，用于迭代一个元素的元素列表。

过滤上述部分

如果你只想过滤到部分，比如只有字母或数字字符，那么你需要组合 PCRE 的元字符，就像这个demo on IDEone：

import re

data = ['abc123','123*Abc','123_abc','123 abc']

# replace non-alphas and non-digits; filter [A-Za-z0-9_]
alphanumeric_underscore = [re.sub('\W',item) for item in data]
print('alphanumeric_underscore',alphanumeric_underscore)

# replace also the underscore; filter [A-Za-z0-9]
alphanumeric = [re.sub('[\W_]',item) for item in data]
print('alphanumeric',alphanumeric)

# filter only digits
numeric = [re.search(r"\d+",item).group(0) for item in data]
print('numeric',numeric)

# filter only alphas
alpha = [re.search(r"[A-Za-z]+",item).group(0) for item in data]
print('alpha',alpha)

它会输出：

alphanumeric_underscore ['abc123','123Abc','123abc'] 字母数字 ['abc123','123abc','123abc'] 数字 ['123','123','123'] 阿尔法 ['abc','abc','ABC','abc']

它使用正则表达式搜索 re.search 和正则表达式（前缀为原始字符串）r"\d+" 返回所有找到的出现 .group(0)，因此过滤。

另见

官方 Python 3 文档：re — Regular expression operations
官方 Python 3 文档：Regular Expression HOWTO
类似的答案：Filtering text in python for numeric values
关于 raw-string 的有趣问题：Python regex - r prefix

alphanumeric dataframe python regex

使用python清理REGEX字母数字给出的不匹配记录的最佳方法？ 解释

问题描述

解决方法

解释

过滤上述部分

另见

使用python清理REGEX字母数字给出的不匹配记录的最佳方法？解释