无法在spaCY中将ORTH转换为String

问题描述

我知道单词的出现频率，但没有单词格式。您能建议我如何将ID与单词相关联吗？

from spacy.attrs import ORTH

doc = nlp("apple is the man good orange apple orange banana")

print(Text.count_by(ORTH))

{8566208034543834098：2，3411606890003347522：1，7425985699627899538：1，3104811030673030468：1，5711639017775284443：1，2208928596161743350：2，2525716904149915114：1}

就像如何将“ 8566208034543834098”关联到苹果？

解决方法

使用Counter代替count_by来获取令牌/单词的数量：

import spacy
from collections import Counter

nlp = spacy.load('en_core_web_sm')

doc = nlp("apple is the man good orange apple orange banana")
word_freq = Counter([tok.text.lower() for tok in doc])
print(word_freq)

输出：

Counter({'apple': 2,'orange': 2,'is': 1,'the': 1,'man': 1,'good': 1,'banana': 1})

要将orth转换为字符串：

print(doc.vocab[8566208034543834098].text)

输出

apple

您可以按以下方式使用count_by：

from spacy.attrs import ORTH

doc = nlp("apple is the man good orange apple orange banana")
counts = doc.count_by(ORTH)
{nlp.vocab.strings[word_id]:count for word_id,count in counts.items()}

输出：

{'apple': 2,'banana': 1}

machine-learning nlp python spacy