问题描述
我正在阅读tensorflow的word2vec教程:https://www.tensorflow.org/tutorials/text/word2vec#define_loss_function_and_compile_model
在本教程中,负采样是使用tf.random.log_uniform_candidate_sampler
进行的。给定上下文类别(真实类别),目标是从整个词汇表中抽取否定类别。据我了解,否定类必须不同于给定的上下文类。但是,我发现上下文类可能出现在tf.random.log_uniform_candidate_sampler
采样的否定类中。这是代码:
import tensorflow as tf
SEED = 42
# encode the words
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
vocab,index = {},1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token
for token in tokens:
if token not in vocab:
vocab[token] = index
index += 1
vocab_size = len(vocab)
print(vocab)
inverse_vocab = {index: token for token,index in vocab.items()}
print(inverse_vocab)
# make (hot,the) as a context pair
target_word,context_word = 6,1
print("target: {},context: {}".format(inverse_vocab[target_word],inverse_vocab[context_word]))
# negative sampling
# Set the number of negative samples per positive context.
num_ns = 4
context_class = tf.reshape(tf.constant(context_word,dtype="int64"),(1,1))
negative_sampling_candidates,_,_ = tf.random.log_uniform_candidate_sampler(
true_classes=context_class,# class that should be sampled as 'positive'
num_true=1,# each positive skip-gram has 1 positive context class
num_sampled=num_ns,# number of negative context words to sample
unique=True,# all the negative samples should be unique
range_max=vocab_size,# pick index of the samples from [0,vocab_size]
seed=SEED,# seed for reproducibility
name="negative_sampling" # name of this operation
)
print("negative samples\' index",negative_sampling_candidates)
print("negetive samples: ",[inverse_vocab[index.numpy()] for index in negative_sampling_candidates])
# "the" will show in negative samples,if not,run it several times.
the
是单词hot
的上下文类,为什么它可以在采样的否定类中显示?此外,目标词hot
也可以被采样为否定类别。我会误会吗?
解决方法
你说得对。 Tensorflow 犯了一个错误。请参阅 https://github.com/tensorflow/tensorflow/issues/49490
上的错误报告