在每次进行策略梯度优化迭代后，结果都会变得越来越糟

问题描述

我正在尝试实现在这里找到的图像字幕代码： https://github.com/chenxinpeng/Optimization_of_image_description_metrics_using_policy_gradient_methods/blob/master/image_caption.py

它通过对一系列指标应用策略梯度来实现图像字幕。

我遇到麻烦的功能是这样

  def SGD_update(self,batch_num_images=1000):
    images = tf.placeholder(tf.float32,[batch_num_images,self.feats_dim])
    images_embed = tf.matmul(images,self.encode_img_W) + self.encode_img_b
    Q_rewards = tf.placeholder(tf.float32,self.lstm_step])
    Baselines = tf.placeholder(tf.float32,self.lstm_step])

    state = self.lstm.zero_state(batch_size=batch_num_images,dtype=tf.float32)

    loss = 0.0

    with tf.variable_scope("LSTM"):
        tf.get_variable_scope().reuse_variables()
        output,state = self.lstm(images_embed,state)

        with tf.device("/cpu:0"):
            current_emb = tf.nn.embedding_lookup(self.Wemb,tf.ones([batch_num_images],dtype=tf.int64))

        for i in range(0,self.lstm_step):
            output,state = self.lstm(current_emb,state)

            logit_words = tf.matmul(output,self.embed_word_W) + self.embed_word_b
            logit_words_softmax = tf.nn.softmax(logit_words)
            max_prob_word = tf.argmax(logit_words_softmax,1)
            max_prob = tf.reduce_max(logit_words_softmax,1)

            current_rewards = Q_rewards[:,i] - Baselines[:,i]
            
            loss = loss + tf.reduce_sum(-tf.log(max_prob) * current_rewards)
            
            with tf.device("/cpu:0"):
                current_emb = tf.nn.embedding_lookup(self.Wemb,max_prob_word)
                #current_emb = tf.expand_dims(current_emb,0)

    return images,Q_rewards,Baselines,loss,max_prob,current_rewards,logit_words

我在每次迭代（蓝色1,2,3,4的线性组合）中输出奖励，并注意到每次奖励都会减少。

我需要帮助找出可能的原因，这可能是我的损失函数或一些错误的代码，因为我不熟悉tensorflow或其他原因。

我注意到的一件事是，它更喜欢较小的标题并使它们非常笼统，例如“两只狗”

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

reinforcement-learning tensorflow