使用 NLTK

问题描述

我有 txt 扩展的语料库，格式如下：

Mike NNP B-PERSON
Noah NNP I-PERSON
eats VB O
donuts NN O
Sarah NNP B-PERSON
larsson NNP I-PERSON
comes VB O
from IN O
Stockholm NN B-GPE

我想读取文件来训练 POS 标签（仅获取单词和 POS 标签），就像读取带有括号扩展名的文件（已经是树格式）一样。我尝试使用迭代将语料库更改为 str 格式：

(NNP Mike) (NNP Noah) (VB eats) (NN donuts) (NNP Sarah) (NNP larsson) (VB comes) (IN from) (NN Stockholm)

但是，当我使用 tagged_sents() 函数时，出现错误：

'str' object has no attribute 'tagged_sents'

如何正确阅读？有什么建议么？谢谢。

解决方法

尝试使用 ConllChunkCorpusReader，我相信它会类似于下面的这个片段，因为您的语料库文本文件位于项目的根目录中。相应地更新 chunk_types。

from nltk.corpus.reader import ConllChunkCorpusReader 
chunk_types = ('PERSON','GPE')
corpusReader = ConllChunkCorpusReader('./','*.txt',chunk_types)
print(corpusReader.iob_words())

这将为您提供元组列表，您可以遍历这些列表以收集 POS 标签..

pos_tags = [postag for token,postag,label in corpusReader.iob_sents()]

参考：https://www.geeksforgeeks.org/nlp-customization-using-tagged-corpus-reader/

corpus nltk pos-tagger