问题描述
我想使用正则表达式标记器编写 POS 规则来修复以下标记。我的代码:
import nltk as nltk
from nltk import word_tokenize,UnigramTagger
from nltk.corpus import treebank
# download missing packages
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('treebank')
sent1 = "My aunt's can opener can open a drum"
tokens1 = word_tokenize(sent1)
tag1 = nltk.pos_tag(tokens1 )
print(tokens1)
print(tag1)
>> output : ['My','aunt',"'s",'can','opener','open','a','drum']
[('My','PRP$'),('aunt','NN'),("'s",'POS'),('can','MD'),('opener','VB'),('open',('a','DT'),('drum','NN')]
patterns = [(r'\w*er\b',(r'.*',(r'(?=<\'s).*','NN')]
default_tagger = nltk.RegexpTagger(patterns)
train_sentences = treebank.tagged_sents()
tagger1 = UnigramTagger(train_sentences,backoff=default_tagger)
tagger1_2=nltk.BigramTagger(train_sentences,backoff=tagger1)
tagger1_3=nltk.TrigramTagger(train_sentences,backoff=tagger1_2)
tagged1_true = tagger1_3.tag(tokens1)
tagged1_true
>> output : [('My','NN')]
我需要修复第一个“can”并使其成为“NN”
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)