本地Jupyter笔记本中的SageMaker:无法使用AWS托管的XGBoost容器“ KeyError:'S3DistributionType'”和“无法运行:['docker-compose'”

问题描述

在本地Jupyter笔记本中运行SageMaker(使用VS Code)可以正常工作,除了尝试使用AWS托管容器训练XGBoost模型会导致错误(容器名称246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3)之外。

Jupyter笔记本电脑

import sagemaker

session = sagemaker.LocalSession()

# Load and prepare the training and validation data
...

# Upload the training and validation data to S3
test_location = session.upload_data(os.path.join(data_dir,'test.csv'),key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir,'validation.csv'),key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir,'train.csv'),key_prefix=prefix)

region = session.boto_region_name
instance_type = 'ml.m4.xlarge'
container = sagemaker.image_uris.retrieve('xgboost',region,'1.0-1','py3',instance_type=instance_type)

role = 'arn:aws:iam::<USER ID #>:role/service-role/AmazonSageMaker-ExecutionRole-<ROLE ID #>'

xgb_estimator = sagemaker.estimator.Estimator(
    container,role,train_instance_count=1,train_instance_type=instance_type,output_path=f's3://{session.default_bucket()}/{prefix}/output',sagemaker_session=session)

xgb_estimator.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,objective='reg:squarederror',early_stopping_rounds=10,num_round=200)

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location,content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location,content_type='csv')

xgb_estimator.fit({'train': s3_input_train,'validation': s3_input_validation})

Docker Container KeyError

algo-1-tfcvc_1  | ERROR:sagemaker-containers:Reporting training FAILURE
algo-1-tfcvc_1  | ERROR:sagemaker-containers:framework error: 
algo-1-tfcvc_1  | Traceback (most recent call last):
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py",line 84,in train
algo-1-tfcvc_1  |     entrypoint()
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py",line 94,in main
algo-1-tfcvc_1  |     train(framework.training_env())
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py",line 90,in train
algo-1-tfcvc_1  |     run_algorithm_mode()
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py",line 68,in run_algorithm_mode
algo-1-tfcvc_1  |     checkpoint_config=checkpoint_config
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py",line 115,in sagemaker_train
algo-1-tfcvc_1  |     validated_data_config = channels.validate(data_config)
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py",line 106,in validate
algo-1-tfcvc_1  |     channel_obj.validate(value)
algo-1-tfcvc_1  |   File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py",line 52,in validate
algo-1-tfcvc_1  |     if (value[CONTENT_TYPE],value[TRAINING_INPUT_MODE],value[S3_disT_TYPE]) not in self.supported:
algo-1-tfcvc_1  | KeyError: 'S3distributionType'

本地PC运行时错误

RuntimeError: Failed to run: ['docker-compose','-f','/tmp/tmp71tx0fop/docker-compose.yaml','up','--build','--abort-on-container-exit'],Process exited with code: 1

如果Jupyter笔记本使用Amazon Cloud SageMaker环境(而不是在本地PC上)运行,则没有错误。请注意,在云笔记本上运行时,会话初始化为:

session = sagemaker.Session()

LocalSession()与托管的Docker容器的工作方式似乎存在问题。

解决方法

在本地Jupyter笔记本中运行SageMaker时,它希望Docker容器也在本地计算机上运行。

确保SageMaker(在本地笔记本中运行)使用AWS托管的Docker容器的关键是在初始化LocalSession时省略Estimator对象。

xgb_estimator = sagemaker.estimator.Estimator(
    container,role,train_instance_count=1,train_instance_type=instance_type,output_path=f's3://{session.default_bucket()}/{prefix}/output',sagemaker_session=session)

正确

xgb_estimator = sagemaker.estimator.Estimator(
    container,output_path=f's3://{session.default_bucket()}/{prefix}/output')

其他信息

SageMaker Python SDK源代码提供以下有用的提示:

文件: sagemaker / local / local_session.py

class LocalSagemakerClient(object):
    """A SageMakerClient that implements the API calls locally.

    Used for doing local training and hosting local endpoints. It still needs access to
    a boto client to interact with S3 but it won't perform any SageMaker call.
    ...

文件: sagemaker / estimator.py

class EstimatorBase(with_metaclass(ABCMeta,object)):
    """Handle end-to-end Amazon SageMaker training and deployment tasks.

    For introduction to model training and deployment,see
    http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

    Subclasses must define a way to determine what image to use for training,what hyperparameters to use,and how to create an appropriate predictor instance.
    """

    def __init__(self,train_instance_count,train_instance_type,train_volume_size=30,train_max_run=24 * 60 * 60,input_mode='File',output_path=None,output_kms_key=None,base_job_name=None,sagemaker_session=None,tags=None):
        """Initialize an ``EstimatorBase`` instance.

        Args:
            role (str): An AWS IAM role (either name or full ARN). ...
            
        ...

            sagemaker_session (sagemaker.session.Session): Session object which manages interactions with
                Amazon SageMaker APIs and any other AWS services needed. If not specified,the estimator creates one
                using the default AWS configuration chain.
        """