无法在spaCY中将ORTH转换为String

问题描述

我知道单词的出现频率,但没有单词格式。您能建议我如何将ID与单词相关联吗?

from spacy.attrs import ORTH

doc = nlp("apple is the man good orange apple orange banana")

print(Text.count_by(ORTH))

{8566208034543834098:2,3411606890003347522:1,7425985699627899538:1,3104811030673030468:1,5711639017775284443:1,2208928596161743350:2,2525716904149915114:1}

就像如何将“ 8566208034543834098”关联到苹果?

解决方法

使用Counter代替count_by来获取令牌/单词的数量:

import spacy
from collections import Counter

nlp = spacy.load('en_core_web_sm')

doc = nlp("apple is the man good orange apple orange banana")
word_freq = Counter([tok.text.lower() for tok in doc])
print(word_freq)

输出:

Counter({'apple': 2,'orange': 2,'is': 1,'the': 1,'man': 1,'good': 1,'banana': 1})

要将orth转换为字符串:

print(doc.vocab[8566208034543834098].text)

输出

apple
,

您可以按以下方式使用count_by:

from spacy.attrs import ORTH

doc = nlp("apple is the man good orange apple orange banana")
counts = doc.count_by(ORTH)
{nlp.vocab.strings[word_id]:count for word_id,count in counts.items()}

输出:

{'apple': 2,'banana': 1}