问题描述
import pandas as pd
scale = df.ServiceSubCodeKey.max() + 1
onehot = []
for claimid,ssc in df.groupby('ClaimId').ServiceSubCodeKey:
ssc_list = ssc.to_list()
onehot.append([claimid,''.join(['1' if i in ssc_list else '0' for i in range(1,scale)])])
onehot = pd.DataFrame(onehot,columns=['ClaimId','onehot'])
print(onehot)
onehot
Out[25]:
ClaimId onehot
0 1902659 0000000000000000000000000000000000000000000000...
1 1902663 0000000000000000000000000000000000000000000000...
2 1902674 0000000000010000000000100000000000000000100000...
3 1904129 0000000000000000000000100000000000000000000000...
4 1904130 0000000000000000000010000000000000000000000000...
... ...
626853 2592904 0000000000000000000000100000000000000000000000...
626854 2592920 0000000000000000000000100000000000000000000000...
626855 2593386 0000000000000000000000000000000000000000000000...
626856 2593387 0000000000000000000000000000000000000000000000...
626857 2593533 0000000000000000000000000000000000000000000000...
我希望每个hotencoded值都表示一个唯一的数字,除非重复该数字。我该怎么办?
类似地,我创建了一个哈希算法,
import hashlib
hashes1 = df2.apply(lambda x:hashlib.sha1(str(x[0]*1024+x[1]).encode('utf8')).hexdigest(),axis=1)
# Create a DataFrame from the above Series
df_hash = pd.DataFrame(hashes1,columns=['hash'])
df2 = df2.join(df_hash)
df2
Out[24]:
ClaimId SubDiagnosisId hash
0 2094825 141 ad0334de4a944401aa6c847b06246d553362b45a
1 2259956 155 8b9eb6f311d4a9f98dedb32dae7a2effeaf46fe9
2 2327668 583 ef87b808734992ddfd480a87eb1fe7269111062f
3 1985370 100 7a0907f4818a3edb3414b51c85a85605bc367787
4 2417177 47 24fa886d4e01f5c581ae171ffe5ce1323e3201b0
... ... ...
1063955 1958912 355 de0c5fb7ee479c8b7a174f517349fcb5edea4602
1063956 1994638 163 300c0845403d9936cb80d1afa898452fd11a606c
1063957 2371059 74 87f0c57ac85a169c425f2d31e70011f9bd0db366
1063958 2522719 155 b2c5114e4de1be96959d0425711b926d350fe3f0
1063959 2349207 18 b829ce393ac5c1e5948c3b72f7f000f9737ca005
我也想给这些哈希分配一个唯一的数字。我该怎么办?
解决方法
您要尝试做的事情称为标签编码。您可以使用skleran来获取
尝试一下
#Import label encoder
from sklearn import preprocessing
#label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
#Encode labels in column 'species'.
df['uniquevalue']= label_encoder.fit_transform(df['hash'])