如何在 openai-gym 强化学习中的 Bipedalwalker-v3 中获得目标 Q 值？

问题描述

我是强化学习的新手，我正在尝试使用深度 Q 学习来解决 BipedalWalker-v3。但是我发现 env.action_space.sample() = numpy array with 4 elements 并且我不确定如何添加 rewards 并将其乘以 (1-done_list)，我尝试从

项目复制我的代码。

在月球着陆器的情况下，env.action_space.sample() = integer。

这是我更新“月球着陆器”模型的方法：

def update_model(self):
        random_sample = random.sample(self.replay_buffer,self.batch_size)
        
        states,actions,rewards,next_states,done_list = self.get_attributes_from_sample(random_sample)
        # How do I fix the below target for BipedalWalker
        targets = rewards + self.gamma * (np.max(self.model.predict_on_batch(next_states),axis=1)) * (1 - done_list)
        
        target_vec = self.model.predict_on_batch(states) # shape = (64,4)
        indexes = np.array([i for i in range(self.batch_size)])
        target_vec[[indexes],[actions]] = targets

        self.model.fit(states,target_vec,epochs=1,verbose=0)

这在 LunarLander 环境中非常有效。

我需要在 BiPedalWalker 项目中实现这一点。可以在这里找到：LunarLander

然而，即使在 1000 集之后，该模型也没有产生任何好的结果。

这是 BipedalWalker 的相同方法：

   def update_model(self):
        # replay_buffer size Check
        if len(self.replay_buffer) < self.batch_size or self.counter != 0:
            return

        # Early Stopping
        if np.mean(self.rewards_list[-10:]) > 180:
            return

        # take a random sample:
        random_sample = random.sample(self.replay_buffer,self.batch_size)
        # Extract the attributes from sample
        states,done_list = self.get_attributes_from_sample(random_sample)
        targets = np.tile(rewards,(self.num_action_space,1)).T + np.multiply(np.tile((1 - done_list),(self.action_space.sample().size,1)).T,np.multiply(self.gamma,self.model.predict_on_batch(next_states)))
        # print(targets.shape) = (64,)
        target_vec = self.model.predict_on_batch(states) # shape = (64,4)
        indexes = np.array([i for i in range(self.batch_size)])
        target_vec = targets

        self.model.fit(states,verbose=0)

解决方法

使用 DQN 算法解决连续动作空间环境（如 Bipedal walker v3）是一个坏主意，因为 DQN 算法依赖于对（状态、动作）的迭代优化过程。我建议改用另一种算法，如 TD3、SAC 或 PPO。

deep-learning openai-gym python reinforcement-learning tensorflow tensorflow tensorflow