问题描述
我已经对该代码测试了一个句子,我想对其进行转换,以便可以对整列进行词组化,其中每一行包含单词,而不会出现标点符号,例如:deportivas calcetin hombres deportivas shoes
import wordnet,nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd
df = pd.read_excel(r'C:\Test2\test.xlsx')
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
sentence = 'FINAL_KEYWORDS'
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,"N": wordnet.NOUN,"V": wordnet.VERB,"R": wordnet.ADV}
return tag_dict.get(tag,wordnet.NOUN)
#Lemmatize a Sentence with the appropriate POS tag
sentence = "The striped bats are hanging on their feet for best"
print([lemmatizer.lemmatize(w,get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
让我们假设列名称为df ['keywords'],您能帮我使用lambda函数来使整个列都具有词性吗?
非常感谢
解决方法
您在这里:
- 使用
apply
应用于该列的句子 - 使用可以获取
sentence
作为输入并应用您编写的函数的lambda表达式,类似于在print语句中使用的方式
作为词干化关键字:
# Lemmatize a Sentence with the appropriate POS tag
df['keywords'] = df['keywords'].apply(lambda sentence: [lemmatizer.lemmatize(w,get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
作为修饰词句({'{3}}个关键字使用''):
# Lemmatize a Sentence with the appropriate POS tag
df['keywords'] = df['keywords'].apply(lambda sentence: ' '.join([lemmatizer.lemmatize(w,get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)]))