问题描述
我已经训练了线性支持向量机(SVM)来基于单词将电子邮件分类为垃圾邮件或非垃圾邮件。我首先使用此代码将电子邮件转换为处理后的文本:
def processEmail(email):
email = email.lower()
#replace strings like <html> with a space
email = re.sub("<[^<>]+"," ",email)
#ruplace numbers with strings
email = re.sub("[0-9]+","number",email)
#replace anything that starts with http:// or https:// with httpaddr
email = re.sub("(http|https)://[^\s]*","httpaddr",email)
#replace strings with @ in the middle with emailaddr as they are strings
email = re.sub("[^\s] + @[^\s]","emailaddr",email)
#repace $ with dollar
email = re.sub("[$]+","dollar",email)
#replace >,?
email = re.sub("[\>\>\,\?]","",email)
print("--------------------------------Pre-processed Email------------------------")
print(email)
return email
我收到了我转换过的单词袋或常用单词的词汇表 使用:
def getVocabDict():
vocab_txt = open("C:/Users/dynam/Desktop/Coursera AndrewNg/machine-learning-ex6/machine-learning-ex6/ex6/vocab.txt","r")
vocab_dict = {}
for line in vocab_txt:
(key,val) = line.split() #default splitting is using space
vocab_dict.update({key:val})
return vocab_dict
此后,我使用:
将电子邮件转换为令牌def email2Token(Iemail):
#initialize the stemmer software
stemmer = nltk.stem.porter.Porterstemmer()
email = processEmail(Iemail)
#split the email into individual words
tokens = re.split("[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%\\n]",email)
print("------------------------Email after splitting into individual words/tokens------------------")
print(tokens)
#apply stemmer to each word
stemmed_tokens = []
for token in tokens:
#use porter stemmer to stem the word
stemmed_token = stemmer.stem(token)
stemmed_tokens.append(stemmed_token)
print("---------stemmed token-------------")
print(stemmed_token)
return stemmed_tokens
然后,我将电子邮件转换为特征向量,其中第一个元素表示天气,我在我编写的词汇词典中显示的电子邮件中的单词:
def email2featureVec(Iemail,vocab_dict):
n = len(vocab_dict)
emailrec = email2Token(Iemail)
print("---------The token recieved by feature vector converter-----------")
print(emailrec)
email_feature = np.zeros((n,1))
indx = 0
for i in emailrec:
if i in vocab_dict.values():
email_feature[indx,0] = 1
else:
email_feature[indx,0] = 0
indx+=1
print("--------------------------Email feature vec----------------------------------")
print(email_feature)
return email_feature
最后,我创建一个线性SVM模型,并在训练数据集X及其标签y上对其进行训练:
#creating instance of an SVM with C = 0.1
linear_svm = svm.SVC(C = 0.1,kernel = "linear")
#fitting SVM to our X-matrix given labels y
linear_svm.fit(X,y.flatten())
现在,我想知道如何获得15个最重要的单词来对垃圾邮件进行分类? 我稀疏,我必须使用系数来找出答案,但是我的系数是:
for i in linear_svm.coef_:
for j in i:
print(j)
0.007932077307221794
0.015633235616866917
0.055464916277558125
-0.013416103446075411
-0.06619756700850743
0.03659516600411697
0.18337597875664702
-0.02488628335729145 and so on ........
我尝试使用:
sorted_arr = np.sort(linear_svm.coef_,axis = None)[::-1]
for i in sorted_arr:
print(vocab_dict[(i)])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-32-9027571acfa4> in <module>()
1 sorted_arr = np.sort(linear_svm.coef_,axis = None)[::-1]
2 for i in sorted_arr:
----> 3 print(vocab_dict[(i)])
KeyError: 0.5006137361746403
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)