问题描述
我有一个包含 6 个资源的 AWS Batch POC 的云形成模板。
AWS::IAM::Role 具有策略“arn:aws:iam::aws:policy/AdministratorAccess”(为了避免问题。)
使用的角色:
但即使使用策略“arn:aws:iam::aws:policy/AdministratorAccess”,我也会收到“CannotPullContainerError:来自守护进程的错误响应:获取 https://********.dkr.ecr。 eu-west-1.amazonaws.com/v2/: net/http: 请求在等待连接时被取消(Client.Timeout 在等待标题时超出)”,当我完成一项工作时。
免责声明:一切都是 FARGATE(计算环境和作业),而不是 EC2
AWstemplateFormatVersion: '2010-09-09'
Description: Creates a POC AWS Batch environment.
Parameters:
Environment:
Type: String
Description: 'Environment Name'
Default: TEST
subnets:
Type: List<AWS::EC2::subnet::Id>
Description: 'List of subnets to boot into'
ImageName:
Type: String
Description: 'Name and tag of Process Container Image'
Default: 'upload:6.0.0'
Resources:
BatchServiceRole:
Type: 'AWS::IAM::Role'
Properties:
RoleName: !Join ['',['Demo',BatchServiceRole]]
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: 'Allow'
Principal:
Service: 'batch.amazonaws.com'
Action: 'sts:AssumeRole'
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/AdministratorAccess'
BatchContainerRole:
Type: 'AWS::IAM::Role'
Properties:
RoleName: !Join ['',BatchContainerRole]]
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
-
Effect: 'Allow'
Principal:
Service:
- 'ecs-tasks.amazonaws.com'
Action:
- 'sts:AssumeRole'
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/AdministratorAccess'
BatchJobrole:
Type: 'AWS::IAM::Role'
Properties:
RoleName: !Join ['',BatchJobrole]]
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: 'Allow'
Principal:
Service: 'ecs-tasks.amazonaws.com'
Action: 'sts:AssumeRole'
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/AdministratorAccess'
BatchCompute:
Type: "AWS::Batch::ComputeEnvironment"
Properties:
ComputeEnvironmentName: DemoContentInput
ComputeResources:
Maxvcpus: 256
SecurityGroupIds:
- sg-0b33333333333333
subnets: !Ref subnets
Type: FARGATE
ServiceRole: !Ref BatchServiceRole
State: ENABLED
Type: Managed
Queue:
Type: "AWS::Batch::JobQueue"
DependsOn: BatchCompute
Properties:
ComputeEnvironmentOrder:
- ComputeEnvironment: DemoContentInput
Order: 1
Priority: 1
State: "ENABLED"
JobQueueName: DemoContentInput
ContentInputJob:
Type: "AWS::Batch::JobDeFinition"
Properties:
Type: Container
ContainerProperties:
Command:
- -v
- process
- new-file
- -o
- s3://contents/{content_id}/{content_id}.mp4
Environment:
- Name: SECRETS
Value: !Join [ ':',[ '{{resolve:secretsmanager:common.secrets:SecretString:aws_access_key_id}}','{{resolve:secretsmanager:common.secrets:SecretString:aws_secret_access_key}}' ] ]
- Name: APPLICATION
Value: upload
- Name: API_KEY
Value: '{{resolve:secretsmanager:common.secrets:SecretString:fluzo.api_key}}'
- Name: CLIENT
Value: upload-container
- Name: ENVIRONMENT
Value: !Ref Environment
- Name: SETTINGS
Value: !Join [ ':','{{resolve:secretsmanager:common.secrets:SecretString:aws_secret_access_key}}','upload-container' ] ]
ExecutionRoleArn: 'arn:aws:iam::**********:role/DemoBatchJobrole'
Image: !Join ['',[!Ref 'AWS::AccountId','.dkr.ecr.',!Ref 'AWS::Region','.amazonaws.com/',!Ref ImageName ] ]
JobroleArn: !Ref BatchContainerRole
ResourceRequirements:
- Type: Vcpu
Value: 1
- Type: MEMORY
Value: 2048
JobDeFinitionName: DemoContentInput
PlatformCapabilities:
- FARGATE
RetryStrategy:
Attempts: 1
Timeout:
AttemptDurationSeconds: 600
进入AWS::Batch::JobQueue:ContainerProperties:ExecutionRoleArn 我对arn 进行了硬编码,因为如果写入!Ref BatchJobrole,我会收到一个错误。但这不是我提出这个问题的目的。
问题是如何避免“CannotPullContainerError: Error response from daemon: Get https://********.dkr.ecr.eu-west-1.amazonaws.com/v2/: net/ http:当我运行作业时,请求在等待连接时被取消(等待标头时超出了 Client.Timeout)”。
解决方法
听起来您无法从子网内部访问互联网。
确保:
- 有一个与您的 VPC 关联的互联网网关设备(如果没有,请创建一个 -- 即使您只是使用 nat-gateway 进行出口)
- 与您的子网关联的路由表有一条默认路由 (0.0.0./0) 到 Internet 网关或带有附加弹性 IP 的 nat 网关。
- 附加的安全组具有允许您的端口和协议的出站互联网流量 (0.0.0.0/0) 的规则。 (例如 80/http、443/https)
- 与子网关联的网络访问控制列表(网络 ACL)具有允许到互联网的出站和入站流量的规则。
参考文献:
https://aws.amazon.com/premiumsupport/knowledge-center/ec2-connect-internet-gateway/