如何使用停用词删除标点符号和不相关的单词文本挖掘

问题描述

我正在使用的库是:

      import pandas as pd
      import string
      from nltk.corpus import stopwords
      import nltk

我有以下数据框:

     df = pd.DataFrame({'Send': ['Golgi body,membrane-bound organelle of eukaryotic cells (cells 
                                  with clearly defined nuclei).','The Golgi apparatus is responsible for transporting,modifying,and 
                                  packaging proteins','Non-foliated Metamorphic rocks do not have a platy or sheet-like 
                                  structure.','The process of Metamorphism does not melt the rocks.'],'Class': ['biology','biology','geography','geography']})

     print(df)

                              Send                           Class
         Golgi body,membrane-bound organelle of eukary...  biology
         The Golgi apparatus is responsible for transpo...  biology
         Non-foliated Metamorphic rocks do not have a p...  geography
         The process of Metamorphism does not melt the ...  geography

我想生成一个用于清除“发送”列中数据的函数。我想:

  1. 删除分数;
  2. 删除停用词“ stopwords”;
  3. 使用“发送”列返回包含“干净单词”的新数据框。

尝试开发以下功能

      def Text_Process(mess): 
           nopunc = [char for char in mess if char not in string.punctuation]
           nopunc = ''.join(nopunc)  
           return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

但是,回报并不完全符合我的期望。当我跑步时:

        Text_Process(df['Send'])

输出为:

       ['Golgi','body,','membrane-bound','organelle','eukaryotic','cells','(cells','clearly','defined','nuclei).The','Golgi','apparatus','responsible','transporting,'modifying,'packaging','proteinsNon-foliated','Metamorphic','rocks','platy','sheet-like','structure.The','process','Metamorphism','melt','rocks.']

我希望输出是经过修改的“发送”列的数据框:

       df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells 
                                   clearly defined nuclei','Golgi apparatus responsible transporting modifying                                     
                                   packaging proteins','Non foliated Metamorphic rocks platy sheet like 
                                  structure','process Metamorphism mel rocks'],'geography']})

我希望输出为带有“发送”列的数据帧(没有分数且没有不相关的词)。

谢谢。

解决方法

这是一个清理列的脚本。请注意,您可能想在停用词集中添加更多词,以满足您的要求。

import pandas as pd
import string
import re
from nltk.corpus import stopwords

df = pd.DataFrame(
    {'Send': ['Golgi body,membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).','The Golgi apparatus is responsible for transporting,modifying,and packaging proteins','Non-foliated metamorphic rocks do not have a platy or sheet-like structure.','The process of metamorphism does not melt the rocks.'],'Class': ['biology','biology','geography','geography']})

table = str.maketrans('','',string.punctuation)

def text_process(mess):
    words = re.split(r'\W+',mess)
    nopunc = [w.translate(table) for w in words]
    nostop =  ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
    return nostop

df['Send'] = df.apply(lambda row: text_process(row.Send),axis=1)

print(df)

输出:

                                                                                 Send      Class
0  Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei     biology
1               Golgi apparatus responsible transporting modifying packaging proteins    biology
2                          Non foliated metamorphic rocks platy sheet like structure   geography
3                                                    process metamorphism melt rocks   geography