问题描述
如何在具有 Per-Arm 特征的 Multi-Arm Bandits 上绘制此 example from Tensorflow(带有完整代码)的每次迭代的奖励值?
教程有一个带有情节的遗憾指标:
def _all_rewards(observation,hidden_param):
"""Outputs rewards for all actions,given an observation."""
hidden_param = tf.cast(hidden_param,dtype=tf.float32)
global_obs = observation['global']
per_arm_obs = observation['per_arm']
num_actions = tf.shape(per_arm_obs)[1]
tiled_global = tf.tile(
tf.expand_dims(global_obs,axis=1),[1,num_actions,1])
concatenated = tf.concat([tiled_global,per_arm_obs],axis=-1)
rewards = tf.linalg.matvec(concatenated,hidden_param)
return rewards
def optimal_reward(observation):
"""Outputs the maximum expected reward for every element in the batch."""
return tf.reduce_max(_all_rewards(observation,reward_param),axis=1)
regret_metric = tf_bandit_metrics.RegretMetric(optimal_reward)
num_iterations = 40# @param
steps_per_loop = 1 # @param
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.policy.trajectory_spec,batch_size=BATCH_SIZE,max_length=steps_per_loop)
observers = [replay_buffer.add_batch,regret_metric]
driver = dynamic_step_driver.DynamicStepDriver(
env=per_arm_tf_env,policy=agent.collect_policy,num_steps=steps_per_loop * BATCH_SIZE,observers=observers)
regret_values = []
for _ in range(num_iterations):
driver.run()
loss_info = agent.train(replay_buffer.gather_all())
replay_buffer.clear()
regret_values.append(regret_metric.result())
plt.plot(regret_values)
plt.title('Regret of LinUCB on the Linear per-arm environment')
plt.xlabel('Number of Iterations')
_ = plt.ylabel('Average Regret')
而且我最终会喜欢这样的情节,但随着迭代的奖励显示它们在增加;我该如何修改代码来做到这一点?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)