问题描述
@H_502_3@import pandas as pd scale = df.ServiceSubCodeKey.max() + 1 onehot = [] for claimid,ssc in df.groupby('ClaimId').ServiceSubCodeKey: ssc_list = ssc.to_list() onehot.append([claimid,''.join(['1' if i in ssc_list else '0' for i in range(1,scale)])]) onehot = pd.DataFrame(onehot,columns=['ClaimId','onehot']) print(onehot) onehot Out[25]: ClaimId onehot 0 1902659 0000000000000000000000000000000000000000000000... 1 1902663 0000000000000000000000000000000000000000000000... 2 1902674 0000000000010000000000100000000000000000100000... 3 1904129 0000000000000000000000100000000000000000000000... 4 1904130 0000000000000000000010000000000000000000000000... ... ... 626853 2592904 0000000000000000000000100000000000000000000000... 626854 2592920 0000000000000000000000100000000000000000000000... 626855 2593386 0000000000000000000000000000000000000000000000... 626856 2593387 0000000000000000000000000000000000000000000000... 626857 2593533 0000000000000000000000000000000000000000000000...
我希望每个hotencoded值都表示一个唯一的数字,除非重复该数字。我该怎么办?
类似地,我创建了一个哈希算法,
@H_502_3@import hashlib hashes1 = df2.apply(lambda x:hashlib.sha1(str(x[0]*1024+x[1]).encode('utf8')).hexdigest(),axis=1) # Create a DataFrame from the above Series df_hash = pd.DataFrame(hashes1,columns=['hash']) df2 = df2.join(df_hash) df2 Out[24]: ClaimId SubDiagnosisId hash 0 2094825 141 ad0334de4a944401aa6c847b06246d553362b45a 1 2259956 155 8b9eb6f311d4a9f98dedb32dae7a2effeaf46fe9 2 2327668 583 ef87b808734992ddfd480a87eb1fe7269111062f 3 1985370 100 7a0907f4818a3edb3414b51c85a85605bc367787 4 2417177 47 24fa886d4e01f5c581ae171ffe5ce1323e3201b0 ... ... ... 1063955 1958912 355 de0c5fb7ee479c8b7a174f517349fcb5edea4602 1063956 1994638 163 300c0845403d9936cb80d1afa898452fd11a606c 1063957 2371059 74 87f0c57ac85a169c425f2d31e70011f9bd0db366 1063958 2522719 155 b2c5114e4de1be96959d0425711b926d350fe3f0 1063959 2349207 18 b829ce393ac5c1e5948c3b72f7f000f9737ca005
我也想给这些哈希分配一个唯一的数字。我该怎么办?
解决方法
您要尝试做的事情称为标签编码。您可以使用skleran来获取
尝试一下
#Import label encoder
from sklearn import preprocessing
#label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
#Encode labels in column 'species'.
df['uniquevalue']= label_encoder.fit_transform(df['hash'])