问题描述
下面的代码是我目前拥有的代码,它可以正常工作,但是它将诸如“ did n't”变为“ didn”和“ t”之类的单词。我希望它删除撇号,使其以“同义”的形式出现,或者只是以“不存在”的形式出现,尽管这可能导致以后使用TfidfVectorizer出现问题?
有什么方法可以实现这一目标而又不会带来太多麻烦?
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,"N": wordnet.NOUN,"V": wordnet.VERB,"R": wordnet.ADV}
return tag_dict.get(tag,wordnet.NOUN)
lemmatizer = WordNetLemmatizer()
def lemmatize_review(review):
"""Lemmatize single review string"""
lemmatized_review = ' '.join([lemmatizer.lemmatize(word,get_wordnet_pos(word)) for word in word_tokenize(review)])
return lemmatized_review
review_data['Lemmatized_Review'] = review_data['Review'].apply(lemmatize_review)
解决方法
您可以在继续进行脱字符处理之前,将"'"
字符替换为""
并将其替换为空字符,如下所示:
>>> word = "didn't can't won't"
>>> word
"didn't can't won't"
>>> x = word.replace("'","")
>>> x
'didnt cant wont'
,
您可以使用tweettokenizer代替单词tokenizer
from nltk.tokenize import TweetTokenizer
str = "didn't can't won't how are you"
tokenizer = TweetTokenizer()
tokenizer.tokenize(str)
#op
["didn't","can't","won't",'how','are','you']