在输入大小为2个单词的情况下，trigram的行为将如何预测下一个单词？

问题描述

当输入中输入一个单词时，我的bigram语言模型可以正常工作，但是当我在Trigram模型中输入两个单词时，它的行为会很奇怪，并预测下一个单词为“ unkNown”。 我的代码：

def get_unigram_probability(word):
  if word not in unigram:
      return 0
  return unigram[word] / total_words
    
def get_bigram_probability(words):
  if words not in bigram:
      return 0
  return bigram[words] / unigram[words[0]]
    
V = len(vocabulary)

def get_trigram_probability(words):
  if words not in trigram:
      return 0
  return trigram[words] + 1 / bigram[words[:2]] + V

用于二元语法下一个单词的预测：

def find_next_word_bigram(words):
  candidate_list = []

  # Calculate probability for each word by looping through them
  for word in vocabulary:
    p2 = get_bigram_probability((words[-1],word))
    candidate_list.append((word,p2))
    
  # sort the list with words with often occurence in the beginning
  candidate_list.sort(key=lambda x: x[1],reverse=True)
  # print(candidate_list)
  return candidate_list[0]

对于Trigram：

def find_next_word_trigram(words):
  candidate_list = []

  # Calculate probability for each word by looping through them
  for word in vocabulary:
    p3 = get_trigram_probability((words[-2],words[-1],word)) if len(words) >= 3 else 0
    candidate_list.append((word,p3))
    
  # sort the list with words with often occurence in the beginning
  candidate_list.sort(key=lambda x: x[1],reverse=True)
  # print(candidate_list)
  return candidate_list[0]

我只想知道应该在代码中的哪个位置进行更改，以便Trigram可以预测给定2个单词的输入大小的下一个单词。

解决方法

构建三字母组合时，请使用特殊的BOS（句子开头）令牌，以便处理短序列。基本上在每个句子之前添加两次BOS，如下所示：

I like cheese
BOS BOS I like cheese

通过这种方式，当您从用户那里获取输入时，您可以在BOS BOS之前添加它，甚至可以完成短序列。

google-colaboratory n-gram nlp python trigram

在输入大小为2个单词的情况下，trigram的行为将如何预测下一个单词？

问题描述

解决方法

相关问答