问题描述
当输入中输入一个单词时,我的bigram语言模型可以正常工作,但是当我在Trigram模型中输入两个单词时,它的行为会很奇怪,并预测下一个单词为“ unkNown”。 我的代码:
def get_unigram_probability(word):
if word not in unigram:
return 0
return unigram[word] / total_words
def get_bigram_probability(words):
if words not in bigram:
return 0
return bigram[words] / unigram[words[0]]
V = len(vocabulary)
def get_trigram_probability(words):
if words not in trigram:
return 0
return trigram[words] + 1 / bigram[words[:2]] + V
用于二元语法下一个单词的预测:
def find_next_word_bigram(words):
candidate_list = []
# Calculate probability for each word by looping through them
for word in vocabulary:
p2 = get_bigram_probability((words[-1],word))
candidate_list.append((word,p2))
# sort the list with words with often occurence in the beginning
candidate_list.sort(key=lambda x: x[1],reverse=True)
# print(candidate_list)
return candidate_list[0]
对于Trigram:
def find_next_word_trigram(words):
candidate_list = []
# Calculate probability for each word by looping through them
for word in vocabulary:
p3 = get_trigram_probability((words[-2],words[-1],word)) if len(words) >= 3 else 0
candidate_list.append((word,p3))
# sort the list with words with often occurence in the beginning
candidate_list.sort(key=lambda x: x[1],reverse=True)
# print(candidate_list)
return candidate_list[0]
我只想知道应该在代码中的哪个位置进行更改,以便Trigram可以预测给定2个单词的输入大小的下一个单词。
解决方法
构建三字母组合时,请使用特殊的BOS(句子开头)令牌,以便处理短序列。基本上在每个句子之前添加两次BOS,如下所示:
I like cheese
BOS BOS I like cheese
通过这种方式,当您从用户那里获取输入时,您可以在BOS BOS
之前添加它,甚至可以完成短序列。