问题描述
我正在使用的库是:
import pandas as pd
import string
from nltk.corpus import stopwords
import nltk
我有以下数据框:
df = pd.DataFrame({'Send': ['Golgi body,membrane-bound organelle of eukaryotic cells (cells
with clearly defined nuclei).','The Golgi apparatus is responsible for transporting,modifying,and
packaging proteins','Non-foliated Metamorphic rocks do not have a platy or sheet-like
structure.','The process of Metamorphism does not melt the rocks.'],'Class': ['biology','biology','geography','geography']})
print(df)
Send Class
Golgi body,membrane-bound organelle of eukary... biology
The Golgi apparatus is responsible for transpo... biology
Non-foliated Metamorphic rocks do not have a p... geography
The process of Metamorphism does not melt the ... geography
尝试开发以下功能:
def Text_Process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
但是,回报并不完全符合我的期望。当我跑步时:
Text_Process(df['Send'])
输出为:
['Golgi','body,','membrane-bound','organelle','eukaryotic','cells','(cells','clearly','defined','nuclei).The','Golgi','apparatus','responsible','transporting,'modifying,'packaging','proteinsNon-foliated','Metamorphic','rocks','platy','sheet-like','structure.The','process','Metamorphism','melt','rocks.']
df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells
clearly defined nuclei','Golgi apparatus responsible transporting modifying
packaging proteins','Non foliated Metamorphic rocks platy sheet like
structure','process Metamorphism mel rocks'],'geography']})
我希望输出为带有“发送”列的数据帧(没有分数且没有不相关的词)。
谢谢。
解决方法
这是一个清理列的脚本。请注意,您可能想在停用词集中添加更多词,以满足您的要求。
import pandas as pd
import string
import re
from nltk.corpus import stopwords
df = pd.DataFrame(
{'Send': ['Golgi body,membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).','The Golgi apparatus is responsible for transporting,modifying,and packaging proteins','Non-foliated metamorphic rocks do not have a platy or sheet-like structure.','The process of metamorphism does not melt the rocks.'],'Class': ['biology','biology','geography','geography']})
table = str.maketrans('','',string.punctuation)
def text_process(mess):
words = re.split(r'\W+',mess)
nopunc = [w.translate(table) for w in words]
nostop = ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
return nostop
df['Send'] = df.apply(lambda row: text_process(row.Send),axis=1)
print(df)
输出:
Send Class
0 Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei biology
1 Golgi apparatus responsible transporting modifying packaging proteins biology
2 Non foliated metamorphic rocks platy sheet like structure geography
3 process metamorphism melt rocks geography