如何找出垃圾邮件中最常用的15个单词？

问题描述

我已经训练了线性支持向量机（SVM）来基于单词将电子邮件分类为垃圾邮件或非垃圾邮件。我首先使用此代码将电子邮件转换为处理后的文本：

def processEmail(email):
    email = email.lower()
    #replace strings like <html> with a space
    email = re.sub("<[^<>]+"," ",email)
    #ruplace numbers with strings
    email = re.sub("[0-9]+","number",email)
    #replace anything that starts with http:// or https:// with httpaddr
    email = re.sub("(http|https)://[^\s]*","httpaddr",email)
    #replace strings with @ in the middle with emailaddr as they are strings
    email = re.sub("[^\s] + @[^\s]","emailaddr",email)
    #repace $ with dollar
    email = re.sub("[$]+","dollar",email)
    #replace >,?
    email = re.sub("[\>\>\,\?]","",email)
    print("--------------------------------Pre-processed Email------------------------")
    print(email)
    return email

我收到了我转换过的单词袋或常用单词的词汇表使用：

def getVocabDict():
    vocab_txt = open("C:/Users/dynam/Desktop/Coursera AndrewNg/machine-learning-ex6/machine-learning-ex6/ex6/vocab.txt","r")
    vocab_dict = {}
    for line in vocab_txt:
        (key,val) = line.split() #default splitting is using space
        vocab_dict.update({key:val})
    return vocab_dict

此后，我使用：

将电子邮件转换为令牌

def email2Token(Iemail):
    #initialize the stemmer software
    stemmer = nltk.stem.porter.Porterstemmer()
    email = processEmail(Iemail)
    #split the email into individual words
    tokens = re.split("[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%\\n]",email)
    print("------------------------Email after splitting into individual words/tokens------------------")
    print(tokens)
    #apply stemmer to each word
    stemmed_tokens = []
    for token in tokens:
        #use porter stemmer to stem the word
        stemmed_token = stemmer.stem(token)
        stemmed_tokens.append(stemmed_token)
        print("---------stemmed token-------------")
        print(stemmed_token)
    return stemmed_tokens

然后，我将电子邮件转换为特征向量，其中第一个元素表示天气，我在我编写的词汇词典中显示的电子邮件中的单词：

def email2featureVec(Iemail,vocab_dict):
    n = len(vocab_dict)
    emailrec = email2Token(Iemail)
    print("---------The token recieved by feature vector converter-----------")
    print(emailrec)
    email_feature = np.zeros((n,1))
    indx = 0
    for i in emailrec:
        if i in vocab_dict.values():
            email_feature[indx,0] = 1
        else:
            email_feature[indx,0] = 0
        indx+=1
    print("--------------------------Email feature vec----------------------------------")
    print(email_feature)
    return email_feature

最后，我创建一个线性SVM模型，并在训练数据集X及其标签y上对其进行训练：

#creating instance of an SVM with C = 0.1
linear_svm = svm.SVC(C = 0.1,kernel = "linear")
#fitting SVM to our X-matrix given labels y
linear_svm.fit(X,y.flatten())

现在，我想知道如何获得15个最重要的单词来对垃圾邮件进行分类？我稀疏，我必须使用系数来找出答案，但是我的系数是：

for i in linear_svm.coef_:
    for j in i:
        print(j)

0.007932077307221794
0.015633235616866917
0.055464916277558125
-0.013416103446075411
-0.06619756700850743
0.03659516600411697
0.18337597875664702
-0.02488628335729145 and so on ........

我尝试使用：

sorted_arr = np.sort(linear_svm.coef_,axis = None)[::-1]
for i in sorted_arr:
    print(vocab_dict[(i)])

但是会弹出一个错误：

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-32-9027571acfa4> in <module>()
      1 sorted_arr = np.sort(linear_svm.coef_,axis = None)[::-1]
      2 for i in sorted_arr:
----> 3     print(vocab_dict[(i)])

KeyError: 0.5006137361746403

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

libsvm machine-learning scikit-learn spam svm