如何在python中为OneHotEncoded值和hashlib创建签名数字?

问题描述

我想为数据框中的一个hotencoded值分配一个数字:

import pandas as pd
scale = df.ServiceSubCodeKey.max() + 1
onehot = []
for claimid,ssc in df.groupby('ClaimId').ServiceSubCodeKey:
    ssc_list = ssc.to_list()
    onehot.append([claimid,''.join(['1' if i in ssc_list else '0' for i in range(1,scale)])])
onehot = pd.DataFrame(onehot,columns=['ClaimId','onehot'])
print(onehot)
onehot
Out[25]: 
        ClaimId                                             onehot
0       1902659  0000000000000000000000000000000000000000000000...
1       1902663  0000000000000000000000000000000000000000000000...
2       1902674  0000000000010000000000100000000000000000100000...
3       1904129  0000000000000000000000100000000000000000000000...
4       1904130  0000000000000000000010000000000000000000000000...
        ...                                                ...
626853  2592904  0000000000000000000000100000000000000000000000...
626854  2592920  0000000000000000000000100000000000000000000000...
626855  2593386  0000000000000000000000000000000000000000000000...
626856  2593387  0000000000000000000000000000000000000000000000...
626857  2593533  0000000000000000000000000000000000000000000000...

我希望每个hotencoded值都表示一个唯一的数字,除非重复该数字。我该怎么办?

类似地,我创建了一个哈希算法,

import hashlib

hashes1 = df2.apply(lambda x:hashlib.sha1(str(x[0]*1024+x[1]).encode('utf8')).hexdigest(),axis=1)

# Create a DataFrame from the above Series
df_hash = pd.DataFrame(hashes1,columns=['hash'])

df2 = df2.join(df_hash)

df2
Out[24]: 
         ClaimId  SubDiagnosisId                                      hash
0        2094825             141  ad0334de4a944401aa6c847b06246d553362b45a
1        2259956             155  8b9eb6f311d4a9f98dedb32dae7a2effeaf46fe9
2        2327668             583  ef87b808734992ddfd480a87eb1fe7269111062f
3        1985370             100  7a0907f4818a3edb3414b51c85a85605bc367787
4        2417177              47  24fa886d4e01f5c581ae171ffe5ce1323e3201b0
         ...             ...                                       ...
1063955  1958912             355  de0c5fb7ee479c8b7a174f517349fcb5edea4602
1063956  1994638             163  300c0845403d9936cb80d1afa898452fd11a606c
1063957  2371059              74  87f0c57ac85a169c425f2d31e70011f9bd0db366
1063958  2522719             155  b2c5114e4de1be96959d0425711b926d350fe3f0
1063959  2349207              18  b829ce393ac5c1e5948c3b72f7f000f9737ca005

我也想给这些哈希分配一个唯一的数字。我该怎么办?

解决方法

您要尝试做的事情称为标签编码。您可以使用skleran来获取

尝试一下

#Import label encoder 

from sklearn import preprocessing 

  
#label_encoder object knows how to understand word labels. 

label_encoder = preprocessing.LabelEncoder() 

  
#Encode labels in column 'species'. 

df['uniquevalue']= label_encoder.fit_transform(df['hash'])