问题描述
我不知道这是从哪里来的,或者为什么会发生这个错误:
集群使用 yaml 启动正常,但是当我查看日志时出现此错误。
尽管出现错误,它还能工作吗?我如何检查我的 docker 镜像的打印结果?
Ray 似乎没有任何可遵循的“工作”示例。我正在尝试启动最简单的 aws docker 集群启动版本,以进行原理验证。
ray exec /home/user/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Fetched IP: xxxxxxxxx
Warning: Permanently added 'xxxxxxxxx' (ECDSA) to the list of kNown hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py",line 854,in custom_excepthook
worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'
Original exception was:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py",line 390,in <module>
redis_password=args.redis_password)
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py",line 111,in __init__
self.load_metrics)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py",line 76,in __init__
self.reset(errors_fatal=True)
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py",line 490,in reset
raise e
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py",line 452,in reset
self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'
==> /tmp/ray/session_latest/logs/monitor.log <==
==> /tmp/ray/session_latest/logs/monitor.out <==
Shared connection to 18.130.107.42 closed.
Error: Command Failed:
ssh -tt -i /home/joe/.ssh/aws_ubuntu_test.pem -o StrictHostKeyChecking=no -o UserKNownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ff32489f9/8dbdda48fb/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@xxxxxxxx bash --login -c -i ''"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it my_simple_docker_container /bin/bash -c '"'"'"'"'"'"'"'"'bash --login -c -i '"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"''"'"'"'"'"'"'"'"' )'"'"''
(base) xxxxx:~/RAY_AWS_DOCKER/3xxxxx/aws_docker_simple$ ray exec /home/xxxxxxxxx/unit/aws_docker_simple/simple.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Loaded cached provider configuration
If you experience issues with the cloud provider,try re-running the command with --no-config-cache.
Fetched IP: xxxxxx
Warning: Permanently added 'xxxxxxxx' (ECDSA) to the list of kNown hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py",in reset
self.available_node_types = self.config["available_node_types"]
KeyError: 'available_node_types'
Dockerfile:
FROM continuumio/miniconda3:4.7.10
CMD ["mkdir","hello_folder"]
CMD ["echo","Hello StackOverflow!"]
yaml:
cluster_name: simple
min_workers: 0
max_workers: 2
docker:
image: "xxxxxx/simple "
container_name: "my_simple_docker_container"
pull_before_run: True
idle_timeout_minutes: 5
initialization_commands:
# - curl https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh --output anaconda.sh
# - bash anaconda.sh
# - conda install python=3.8
- sudo apt-get update
- sudo apt-get upgrade
- sudo apt-get install -y python-setuptools
- sudo apt-get install -y build-essential curl unzip psmisc
- pip install --upgrade pip
- pip install discord
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f
provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
file_mounts_sync_continuously: False
auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxfd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxxxxxfd2c
KeyName: aws_ubuntu_test
InstanceMarketoptions:
MarketType: spot
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",# "/path2/on/remote/machine": "/path2/on/local/machine",}
setup_commands:
- conda install python=3.7
- conda create --name ray
- conda activate ray
- conda install --name ray pip
- pip install --upgrade pip
- pip install discord
- pip install ray
head_setup_commands:
- pip install boto3==1.4.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
解决方法
这是因为 ray 版本有问题。例如,如果您执行 pip install ray==1.0,它会起作用。
,更好的解决方案是确保两个头部簇上的光线与本地的光线相同。
您可以使用:
ray --version
本地和集群上:
ray attach config.yaml