使用一种实例类型而不是另一种实例类型自动缩放光线

问题描述

我正在尝试使用 ray 在 AWS EC2 上扩展我的应用程序。如果我使用 'InstanceType: t2.micro' 和自定义 AMI 运行我的代码，它运行良好；它自动缩放到我的 max_workers。（AMI 是安装了 Ray 1.3.0 的 Ubuntu 18.04）。在我的代码中，我有一个带有 @ray.remote(num_cpu=1) 的 Ray Actor，它只计算了一段时间（10 秒左右）的 pi。当我更改为“InstanceType: p2.xlarge”时，它根本不缩放。我尝试添加一个资源部分，以便它知道该实例有 4 个 cpu，但这无济于事。（见下面的yaml）。我不明白为什么它不能为 p2.xlarge 扩展，但它可以为 t2.micro 扩展。有什么建议吗？

此外，我尝试从自动缩放器获取更多调试信息。我在射线启动命令中添加了“export RAY_BACKEND_LOG_LEVEL=debug”，但据我所知，这并没有添加更多调试输出。

这是 monitor.log 的输出，它告诉我它找不到工作的节点：

======== Autoscaler status: 2021-06-15 23:57:13.225737 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 4.0/4.0 cpu
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:K80
 0.00/35.759 GiB memory
 0.00/17.879 GiB object_store_memory

Demands:
 {'cpu': 1.0}: 95+ pending tasks/actors
2021-06-15 23:57:18,588 WARNING resource_demand_scheduler.py:713 -- The autoscaler     Could not find a node type to satisfy the request: [{'cpu': 1.0},{'cpu': 1.0},{'cpu':     1.0},..... {'cpu': 1.0}]. If this request is related to placement groups the resource request will resolve itself,otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
2021-06-15 23:57:18,699 INFO autoscaler.py:309 --

这是我的缩放.yaml

cluster_name: scaling-test10
max_workers: 12
upscaling_speed: 4.0
idle_timeout_minutes: 10

provider:
  type: aws
  region: us-east-1
  availability_zone: us-east-1c,us-east-1d,us-east-1e
  cache_stopped_nodes: True

available_node_types:
  ray.head.default:
    min_workers: 0
    max_workers: 0
    resources: {"cpu": 4,"GPU": 1}
    node_config:
      InstanceType: p2.xlarge
      ImageId: ami-05c6b7aac78a6e921
  ray.worker.default:
    min_workers: 0
    max_workers: 12
    resources: {"cpu": 4,"GPU": 1}
    node_config:
      InstanceType: p2.xlarge
      ImageId: ami-05c6b7aac78a6e921

auth:
  ssh_user: ubuntu

head_node_type: ray.head.default

setup_commands:
  - pip install -U ray==1.3.0

head_start_ray_commands:
  - ray stop
  - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
  - ray stop
  - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

amazon-ec2 ray