问题描述
我知道单词的出现频率,但没有单词格式。您能建议我如何将ID与单词相关联吗?
from spacy.attrs import ORTH
doc = nlp("apple is the man good orange apple orange banana")
print(Text.count_by(ORTH))
{8566208034543834098:2,3411606890003347522:1,7425985699627899538:1,3104811030673030468:1,5711639017775284443:1,2208928596161743350:2,2525716904149915114:1}
就像如何将“ 8566208034543834098”关联到苹果?
解决方法
使用Counter
代替count_by
来获取令牌/单词的数量:
import spacy
from collections import Counter
nlp = spacy.load('en_core_web_sm')
doc = nlp("apple is the man good orange apple orange banana")
word_freq = Counter([tok.text.lower() for tok in doc])
print(word_freq)
输出:
Counter({'apple': 2,'orange': 2,'is': 1,'the': 1,'man': 1,'good': 1,'banana': 1})
要将orth
转换为字符串:
print(doc.vocab[8566208034543834098].text)
输出
apple
,
您可以按以下方式使用count_by:
from spacy.attrs import ORTH
doc = nlp("apple is the man good orange apple orange banana")
counts = doc.count_by(ORTH)
{nlp.vocab.strings[word_id]:count for word_id,count in counts.items()}
输出:
{'apple': 2,'banana': 1}