问题描述
我有如下所示的数据框,我想通过替换列的唯一值来使其不敏感。即,我想用“ faker”库生成的一些虚假姓氏代替姓氏列。
代码段如下。
import pandas as pd
from faker import Faker
fake = Faker()
print(fake.first_name())
print(fake.last_name())
last = ('Meyer','Maier','Meyer','Mayer','Meyr','Mair')
job = ('data analyst','programmer','computer scientist','data scientist','accountant','psychiatrist')
language = ('Python','Perl','Java','Cobol','Brainfuck')
df = pd.DataFrame(list(zip(last,job,language)),columns =['last','job','language'],index=first)
我想要的输出是使用假名称更改姓氏列,但是例如,应始终将Meyer替换为相同的假姓氏。
解决方法
获取所有唯一名称,创建具有映射唯一名称->假名称的字典,并将其映射到您的列:
import pandas as pd
first = ('Mike','Dorothee','Tom','Bill','Pete','Kate')
last = ('Meyer','Maier','Meyer','Mayer','Meyr','Mair')
job = ('data analyst','programmer','computer scientist','data scientist','accountant','psychiatrist')
language = ('Python','Perl','Java','Cobol','Brainfuck')
df = pd.DataFrame(list(zip(last,job,language)),columns =['last','job','language'],index=first)
print(df)
# get all unique names - this can easily hande a couple tenthousand names
all_names = set(df["last"])
# create mapper: you would use fake.last_name() instead of 42+i
# mapper = {k: fake.last_name() for k in all_names }
mapper = {k: 42 + i for i,k in enumerate(all_names )}
# apply it
df["last"] = df["last"].map(mapper)
print(df)
输出:
# before
last job language
Mike Meyer data analyst Python
Dorothee Maier programmer Perl
Tom Meyer computer scientist Java
Bill Mayer data scientist Java
Pete Meyr accountant Cobol
Kate Mair psychiatrist Brainfuck
# after
last job language
Mike 44 data analyst Python
Dorothee 43 programmer Perl
Tom 44 computer scientist Java
Bill 45 data scientist Java
Pete 46 accountant Cobol
Kate 47 psychiatrist Brainfuck