将字符串的一部分替换为凌乱的数据更快的方法而不是字符串替换?

问题描述

我想替换产品变体的许多值。

Big Ben Personalized Products AVENGERS – Stark / 2 set                                                2
BigBen Personalized Products Expendables – Statham / 2 set                                            2
BigBen Personalized Toy 20.00% Off Auto renew Adults Toy / 5 set                                      2
BigBen Personalized Toy 20.00% Off Auto renew Adults Toy / 3 set                                       1
Personalized Toy 5 set                                                                                  1
BIG BEN Personalized  Machine 20.00% Off Auto renew (Versand jeden 3 Monate) Kids Toy / 3 set    1
BigBen Personalized Toy 20.00% Off Auto renew (Versand jeden 2 Monate) Kids Toy / 5 set            1
BigBen Personalized Toy 20.00% Off Auto renew (Versand jeden 2 Monate) Adults Toy / 5 set              1
BigBen Personalized Products 20.00% Off Auto renew (Versand jeden 5 Monate) Adults Toy / 5 set                   

有许多产品变体实际上具有相同的值。

我想知道是否有比使用以下方法更快的方法

df["product_variant"]= df["product_variant"].str.replace('BigBen Personalized','',case = False) 
df["product_variant"]= df["product_variant"].str.replace('Big Ben Personalized ',case = False)
df["product_variant"]= df["product_variant"].str.replace('BigBen Personalized',case = False)
df["product_variant"]= df["product_variant"].str.replace('Auto renew',case = False) 

I expect the data row by row to look more like this:
AVENGERS - Stark (2 set)
Expendables - Statham (2 set)
Adults Toy (5 set)
Toy (5 set)
Kids Toy (3 set)
Kids Toy (5 set)
Adults Toy (5 set)
Kids Toy (5 set)
Adults Toy (3 set)

解决方法

一个选择是为这些示例创建一个带有2个捕获组的特定模式。

对于大多数项目,请先全部匹配,直到Products之后或AdultsKids之前

  • 在{strong>第1组中捕获/之前存在的部分。
  • 第2组 1中捕获,或在数字后跟set

示例模式

^(?:big\s*ben personalized (?:products\s+)?(?:.*?(?=Adult|Kids))?|personalized\s+)(\w+(?: \w+)*(?: – \w+(?: \w+)*)?)(?: /)? (\d+ set)\b.*

Regex demo

在使用两个捕获组\1 (\2)

的替换中
import pandas as pd

regex = r"^Event:\s+Task_(\d+)Error:(NO_ERROR|ERROR_(?:MINOR|\d+))(?:\w+:(\w+))?"

items = [
    "Big Ben Personalized Products AVENGERS – Stark / 2 set                                                2","BigBen Personalized Products Expendables – Statham / 2 set                                            2","BigBen Personalized Toy 20.00% Off Auto renew Adults Toy / 5 set                                      2","BigBen Personalized Toy 20.00% Off Auto renew Adults Toy / 3 set                                       1","Personalized Toy 5 set                                                                                  1","BIG BEN Personalized  Machine 20.00% Off Auto renew (Versand jeden 3 Monate) Kids Toy / 3 set    1","BigBen Personalized Toy 20.00% Off Auto renew (Versand jeden 2 Monate) Kids Toy / 5 set            1","BigBen Personalized Toy 20.00% Off Auto renew (Versand jeden 2 Monate) Adults Toy / 5 set              1","BigBen Personalized Products 20.00% Off Auto renew (Versand jeden 5 Monate) Adults Toy / 5 set                   "
]


df = pd.DataFrame(items,columns=["product_variant"])
df["product_variant"] = df["product_variant"].replace(
    r'(?i)^(?:big\s*ben personalized (?:products\s+)?(?:.*?(?=Adult|Kids))?|personalized\s+)(\w+(?: \w+)*(?: – \w+(?: \w+)*)?)(?: /)? (\d+ set)\b.*',r'\1 (\2)',regex=True
)
print(df)

输出

                 product_variant
0       AVENGERS – Stark (2 set)
1  Expendables – Statham (2 set)
2             Adults Toy (5 set)
3             Adults Toy (3 set)
4                    Toy (5 set)
5               Kids Toy (3 set)
6               Kids Toy (5 set)
7             Adults Toy (5 set)
8             Adults Toy (5 set)