问题描述
我正在解决nlp任务,即在Keras中将英语句子转换为德语。但是该模型不是在学习...但是,一旦我从最后一层删除了softmax,它就开始工作了!这是Keras中的错误,还是与其他原因有关?
optimizer = Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True,reduction='none')
def loss_function(real,pred):
mask = tf.math.logical_not(tf.math.equal(real,0))
loss_ = loss_object(real,pred)
mask = tf.cast(mask,dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
EPOCHS = 20
batch_size = 64
batch_per_epoch = int(train_x1.shape[0] / batch_size)
embed_dim = 256
units = 1024
attention_units = 10
encoder_embed = Embedding(english_vocab_size,embed_dim)
decoder_embed = Embedding(german_vocab_size,embed_dim)
encoder = GRU(units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')
decoder = GRU(units,recurrent_initializer='glorot_uniform')
dense = Dense(german_vocab_size)
attention1 = Dense(attention_units)
attention2 = Dense(attention_units)
attention3 = Dense(1)
def train_step(english_input,german_target):
loss = 0
with tf.GradientTape() as tape:
enc_output,enc_hidden = encoder(encoder_embed(english_input))
dec_hidden = enc_hidden
dec_input = tf.expand_dims([german_tokenizer.word_index['startseq']] * batch_size,1)
for i in range(1,german_target.shape[1]):
attention_weights = attention1(enc_output) + attention2(tf.expand_dims(dec_hidden,axis=1))
attention_weights = tanh(attention_weights)
attention_weights = attention3(attention_weights)
attention_weights = softmax(axis=1)(attention_weights)
Context_Vector = tf.reduce_sum(enc_output * attention_weights,axis=1)
Context_Vector = tf.expand_dims(Context_Vector,axis = 1)
x = decoder_embed(dec_input)
x = Concatenate(axis=-1)([x,Context_Vector])
dec_output,dec_hidden = decoder(x)
output = tf.reshape(dec_output,(-1,dec_output.shape[2]))
prediction = dense(output)
loss += loss_function(german_target[:,i],prediction)
dec_input = tf.expand_dims(german_target[:,1)
batch_loss = (loss / int(german_target.shape[1]))
variables = encoder_embed.trainable_variables + decoder_embed.trainable_variables + encoder.trainable_variables + decoder.trainable_variables + dense.trainable_variables + attention1.trainable_variables + attention2.trainable_variables + attention3.trainable_variables
gradients = tape.gradient(loss,variables)
optimizer.apply_gradients(zip(gradients,variables))
return batch_loss
代码摘要
该代码仅将英语句子和德语句子作为输入(以德语句子作为输入来实施教师强迫方法),并预测翻译后的德语句子。
损失函数为SparseCategoricalCrossentropy
,但它减去0
的损失。例如,假设我们有一个句子,即:' StartSeq这是Stackoverflow 0 0 0 0 0 EndSeq '(该句子的填充也为零,使所有输入句子相同长度)。现在,我们将计算每个单词的损失,而不是0的损失。这样做可以使模型更好。
注意-此模型实现实现了Bahdanau Attention
问题
当我将softmax应用于最后一层的预测概率时,该模型什么都不学。但是它可以在没有softmax的情况下正确学习。为什么会这样?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)