为什么tf.random.log_uniform_candidate_sampler给出真实的类?

问题描述

我正在阅读tensorflow的word2vec教程:https://www.tensorflow.org/tutorials/text/word2vec#define_loss_function_and_compile_model

在本教程中,负采样是使用tf.random.log_uniform_candidate_sampler进行的。给定上下文类别(真实类别),目标是从整个词汇表中抽取否定类别。据我了解,否定类必须不同于给定的上下文类。但是,我发现上下文类可能出现在tf.random.log_uniform_candidate_sampler采样的否定类中。这是代码

import tensorflow as tf
SEED = 42 

# encode the words
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
vocab,index = {},1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token 
for token in tokens:
  if token not in vocab: 
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
print(vocab)
inverse_vocab = {index: token for token,index in vocab.items()}
print(inverse_vocab)


# make (hot,the) as a context pair
target_word,context_word = 6,1
print("target: {},context: {}".format(inverse_vocab[target_word],inverse_vocab[context_word]))


# negative sampling
# Set the number of negative samples per positive context. 
num_ns = 4

context_class = tf.reshape(tf.constant(context_word,dtype="int64"),(1,1))
negative_sampling_candidates,_,_ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,# class that should be sampled as 'positive'
    num_true=1,# each positive skip-gram has 1 positive context class
    num_sampled=num_ns,# number of negative context words to sample
    unique=True,# all the negative samples should be unique
    range_max=vocab_size,# pick index of the samples from [0,vocab_size]
    seed=SEED,# seed for reproducibility
    name="negative_sampling" # name of this operation
)
print("negative samples\' index",negative_sampling_candidates)
print("negetive samples: ",[inverse_vocab[index.numpy()] for index in negative_sampling_candidates])
# "the" will show in negative samples,if not,run it several times.

the是单词hot的上下文类,为什么它可以在采样的否定类中显示?此外,目标词hot也可以被采样为否定类别。我会误会吗?

解决方法

你说得对。 Tensorflow 犯了一个错误。请参阅 https://github.com/tensorflow/tensorflow/issues/49490

上的错误报告