有什么方法可以防止我的WordNetLemmatizer对诸如“不能”或“不能”之类的缩写词进行词形化吗?

问题描述

下面的代码是我目前拥有的代码,它可以正常工作,但是它将诸如“ did n't”变为“ didn”和“ t”之类的单词。我希望它删除撇号,使其以“同义”的形式出现,或者只是以“不存在”的形式出现,尽管这可能导致以后使用TfidfVectorizer出现问题?

有什么方法可以实现这一目标而又不会带来太多麻烦?

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,"N": wordnet.NOUN,"V": wordnet.VERB,"R": wordnet.ADV}
    return tag_dict.get(tag,wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def lemmatize_review(review):
    """Lemmatize single review string"""
    lemmatized_review = ' '.join([lemmatizer.lemmatize(word,get_wordnet_pos(word)) for word in word_tokenize(review)])
    return lemmatized_review

review_data['Lemmatized_Review'] = review_data['Review'].apply(lemmatize_review)

解决方法

您可以在继续进行脱字符处理之前,将"'"字符替换为""并将其替换为空字符,如下所示:

>>> word = "didn't can't won't"
>>> word
"didn't can't won't"
>>> x = word.replace("'","")
>>> x
'didnt cant wont'
,

您可以使用tweettokenizer代替单词tokenizer

from nltk.tokenize import TweetTokenizer

str = "didn't can't won't how are you"
tokenizer = TweetTokenizer()

tokenizer.tokenize(str)
#op
["didn't","can't","won't",'how','are','you']