ML代理未学习相对“简单”的任务

问题描述

我试图创建一个简单的ML代理(球)以学习朝目标移动并与之碰撞。

不幸的是,特工似乎并没有在学习,只是一直在似乎随机的位置四处走动。经过5M步后,平均奖励仍为-1。

关于我在做什么错的任何建议?

Tensorflow Cumulative reward graph

我的观察在这里

/// <summary>
/// Observations:
/// 1: distance to nearest target
/// 3: Vector to nearest target
/// 3: Target Position
/// 3: Agent position
/// 1: Agent VeLocity X
/// 1: Agent VeLocity Y
/// //12 observations in total
/// </summary>
/// <param name="sensor"></param>
public override void CollectObservations(VectorSensor sensor)
{

    //If nearest Target is null,observe an empty array and return early
    if (target == null)
    {
        sensor.Addobservation(new float[12]);
        return;
    }

    float distancetoTarget = Vector3.distance(target.transform.position,this.transform.position);

    //distance to nearest target (1 observervation)
    sensor.Addobservation(distancetoTarget);

    //Vector to nearest target (3 observations)
    Vector3 toTarget = target.transform.position - this.transform.position;

    sensor.Addobservation(toTarget.normalized);


    //Target position
    sensor.Addobservation(target.transform.localPosition);

    //Current Position
    sensor.Addobservation(this.transform.localPosition);

    //Agent VeLocities
    sensor.Addobservation(rigidbody.veLocity.x);
    sensor.Addobservation(rigidbody.veLocity.y);
}

我的YAML文件配置:

    behaviors:
  PlayerAgent:
    trainer_type: ppo
    hyperparameters:
      batch_size: 512 #128
      buffer_size: 2048
      learning_rate: 3.0e-4
      beta: 5.0e-4
      epsilon: 0.2 #0.2
      lambd: 0.99
      num_epoch: 3 #3
      learning_rate_schedule: linear
    network_settings:
      normalize: false
      hidden_units: 32 #256
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
      curiosity:
        strength: 0.02
        gamma: 0.99
        encoding_size: 64
        learning_rate: 3.0e-4
    #keep_checkpoints: 5
    #checkpoint_interval: 500000
    max_steps: 5000000
    time_horizon: 64
    summary_freq: 10000
    threaded: true
    framework: tensorflow

Unity Inspector Component config

奖励(有关代理脚本的全部信息):

private void Update()
{

    //If Agent falls off the screen,give negative reward an end episode
    if (this.transform.position.y < 0)
    {
        AddReward(-1.0f);
        EndEpisode();
    }

    if(target != null)
    {
        Debug.DrawLine(this.transform.position,target.transform.position,Color.green);
    }

}

private void OnCollisionEnter(Collision collidedobj)
{
    //If agent collides with goal,provide reward
    if (collidedobj.gameObject.CompareTag("Goal"))
    {
        AddReward(1.0f);
        Destroy(target);
        EndEpisode();
    }
}

public override void OnActionReceived(float[] vectorAction)
{
    if (!target)
    {
        //Place and assign the target
        envController.PlaceTarget();
        target = envController.ProvideTarget();
    }

    Vector3 controlSignal = Vector3.zero;
    controlSignal.x = vectorAction[0];
    controlSignal.z = vectorAction[1];
    rigidbody.AddForce(controlSignal * moveSpeed,ForceMode.VeLocityChange);

    // Apply a tiny negative reward every step to encourage action
    if (this.MaxStep > 0) AddReward(-1f / this.MaxStep);

}

解决方法

您说您的环境有多难?如果很少达到目标,则代理将无法学习。在这种情况下,当业务代表朝正确方向行动时,您需要添加一些内在奖励。即使奖励很少,也可以使代理学习。

通过您设计奖励的方式,奖励黑客也可能会出现问题。如果代理无法找到获得更大报酬的目标,则最有效的方法是尽快从平台上掉下来,以免在每个时间步长中遭受小的损失。

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...