在一列中词干

问题描述

我需要使用词干

   D            Words
0   2020-06-19  excellent
1   2020-06-19  make
2   2020-06-19  many
3   2020-06-19  game
4   2020-06-19  play
... ... ...
3042607 2020-07-28  praised
3042608 2020-07-28  playing
3042609 2020-07-28  made
3042610 2020-07-28  terms
3042611 2020-07-28  bad

我尝试使用Portstemmer进行以下操作：

from nltk.stem import Porterstemmer 
from nltk.tokenize import word_tokenize 
   
ps = Porterstemmer() 
for w in df.Words: 
    print(w," : ",ps.stem(w))

但是我没有得到想要的输出（词干）。我将需要保留日期（D）信息，因此最后我应该有一个类似的数据集，但带有词干），但是我想通过Words列来运行词干，以实现类似于以下内容：

 D          Words
    0   2020-06-19  excellent
    1   2020-06-19  make
    2   2020-06-19  many
    3   2020-06-19  game
    4   2020-06-19  play
    ... ... ...
    3042607 2020-07-28  praise
    3042608 2020-07-28  play
    3042609 2020-07-28  make
    3042610 2020-07-28  terms
    3042611 2020-07-28  bad

任何提示都会受到欢迎。

解决方法

当我运行您的代码时

ps = PorterStemmer() 
for w in df.Words: 
    print(w," : ",ps.stem(w))

它正确打印word : stem结构（至少根据PorterStemmer）。

如果要将词干作为数据框中的一列，则需要通过在整个ps.stem列上应用Words函数来创建一个新列，如下所示：

df['stem'] = df1.Words.apply(ps.stem)

这会将您的数据框转换为以下形式：

    D           Words     stem
0   2020-06-19  excellent excel
1   2020-06-19  make      make
2   2020-06-19  many      mani
3   2020-06-19  game      game
4   2020-06-19  play      play

因此现在您可以使用stem列进行任何进一步的分析，而不会删除其余数据。