aws emr 添加步骤 Spark 应用程序

问题描述

我想添加一个步骤作为使用 AWS CLI 的 spark 应用程序，但我找不到工作命令，来自 AWS 官方文档：https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html，他们列出了 6 个示例，没有一个是用于 spark . 但是我可以通过 AWS 控制台 UI 对其进行配置并且运行良好，但是为了效率，我希望能够通过 aws cli 进行配置。

我能想到的最接近的是这个命令：

aws emr add-steps --cluster-id j-cluster-id --steps  Type=SPARK,Name='SPARK APP',ActionOnFailure=CONTINUE,Jar=s3://my-test/RandomJava-1.0-SNAPSHOT.jar,MainClass=JavaParquetExample1,Args=s3://my-test/my-file_0000_part_00.parquet,my-test --profile my-test --region us-west-2

但这导致在 AWS EMR 步骤上进行此配置：

JAR location : command-runner.jar
Main class : None
Arguments : spark-submit s3://my-test/my-file_0000_part_00.parquet my-test
Action on failure: Continue

导致失败。

正确的（成功完成，通过 AWS 控制台 UI 配置）如下所示：

JAR location : command-runner.jar
Main class : None
Arguments : spark-submit --deploy-mode cluster --class sparkExamples.JavaParquetExample1 s3://my-test/RandomJava-1.0-SNAPSHOT.jar --s3://my-test/my-file_0000_part_00.parquet --my-test
Action on failure: Continue

非常感谢任何帮助！

解决方法

这似乎对我有用。我正在将一个 spark 应用程序添加到步骤名称为 My step name 的集群中。假设您将文件命名为 step-addition.sh。其内容如下：

#!/bin/bash
set -x

#cluster id
clusterId=$1
startDate=$2
endDate=$3

aws emr add-steps --cluster-id $clusterId --steps Type=Spark,Name='My step name',\
ActionOnFailure=TERMINATE_CLUSTER,Args=[\
"--deploy-mode","cluster","--executor-cores","1","--num-executors","20","--driver-memory","10g","--executor-memory","3g",\
"--class","your-package-structure-like-com.a.b.c.JavaParquetExample1",\
"--master","yarn",\
"--conf","spark.driver.my.custom.config1=my-value-1","spark.driver.my.custom.config2=my-value-2","spark.driver.my.custom.config.startDate=${startDate}","spark.driver.my.custom.config.endDate=${endDate}",\
"s3://my-bucket/my-prefix/path-to-your-actual-application.jar"]

你可以像这样简单地执行上面的脚本：

bash $WORK_DIR/step-addition.sh $clusterId $startDate $endDate

amazon-emr amazon-web-services apache-spark aws-cli aws-emr