Azure Data Factory v2中的HDInsight / Spark活动没有选项为spark-submit指定--files参数

问题描述

我已经在Azure中创建了HDInsight群集(v4,Spark 2.4),并希望通过Azure Data Factory v2活动在此群集上运行Spark.Ne应用。 在Spark活动中,可以指定jar的路径,--class参数和参数以传递到Spark应用程序。运行时,参数会自动以“ -args”为前缀。 但是能够设置“ --files”是必要的,因为它告诉spark-submit,需要将哪些文件部署到工作节点。在这种情况下,它是用于分发具有UDF定义的dll。这些文件是Spark运行所必需的。由于UDF是Spark应用程序的关键组件,因此我认为这应该是可能的。

Spark Activity setup

如果我通过SSH到集群并直接运行spark-submit命令并指定--files参数,Spark应用程序将正常运行,因为文件已分发到工作节点。

spark-submit --deploy-mode cluster --master yarn --files wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/mySparkApp.dll --class org.apache.spark.deploy.dotnet.DotnetRunner wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/microsoft-spark-2.4.x-0.12.1.jar wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/publish.zip mySparkApp

以下是遵循的指南:

  1. https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries
  2. https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/hdinsight-deploy-methods
  3. https://docs.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment

解决方法

您可以将参数/参数传递给Azure数据工厂中的Pyspark脚本,如下所示:

enter image description here

代码:

{
    "name": "SparkActivity","properties": {
        "activities": [
            {
                "name": "Spark1","type": "HDInsightSpark","dependsOn": [],"policy": {
                    "timeout": "7.00:00:00","retry": 0,"retryIntervalInSeconds": 30,"secureOutput": false,"secureInput": false
                },"userProperties": [],"typeProperties": {
                    "rootPath": "adftutorial/spark/script","entryFilePath": "WordCount_Spark.py","arguments": [
                        "--input-file","wasb://sampledata@chepra.blob.core.windows.net/data","--output-file","wasb://sampledata@chepra.blob.core.windows.net/results"
                    ],"sparkJobLinkedService": {
                        "referenceName": "AzureBlobStorage1","type": "LinkedServiceReference"
                    }
                },"linkedServiceName": {
                    "referenceName": "HDInsight","type": "LinkedServiceReference"
                }
            }
        ],"annotations": []
    },"type": "Microsoft.DataFactory/factories/pipelines"
}

如何在ADF中传递参数:

enter image description here

enter image description here

一些在Azure Data Factory中传递参数的示例:

{
    "name": "SparkSubmit","properties": {
        "description": "Submit a spark job","activities": [
            {
                "type": "HDInsightMapReduce","typeProperties": {
                    "className": "com.adf.spark.SparkJob","jarFilePath": "libs/spark-adf-job-bin.jar","jarLinkedService": "StorageLinkedService","arguments": [
                        "--jarFile","libs/sparkdemoapp_2.10-1.0.jar","--jars","/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar","--mainClass","com.adf.spark.demo.Demo","--master","yarn-cluster","--driverMemory","2g","--driverExtraClasspath","/usr/lib/hdinsight-logging/*","--executorCores","1","--executorMemory","4g","--sparkHome","/usr/hdp/current/spark-client","--connectionString","DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>","input=wasb://input@<YOUR_ACCOUNT>.blob.core.windows.net/data","output=wasb://output@<YOUR_ACCOUNT>.blob.core.windows.net/results"
                    ]
                },"inputs": [
                    {
                        "name": "input"
                    }
                ],"outputs": [
                    {
                        "name": "output"
                    }
                ],"policy": {
                    "executionPriorityOrder": "OldestFirst","timeout": "01:00:00","concurrency": 1,"retry": 1
                },"scheduler": {
                    "frequency": "Day","interval": 1
                },"name": "Spark Launcher","description": "Submits a Spark Job","linkedServiceName": "HDInsightLinkedService"
            }
        ],"start": "2015-11-16T00:00:01Z","end": "2015-11-16T23:59:00Z","isPaused": false,"pipelineMode": "Scheduled"
    }
}
,

您是否尝试将这些文件上传到存储 sparkJobLinkedService 的 files 文件夹?根据{{​​3}},“files文件夹下的所有文件都上传并放置在执行器工作目录中”,所以我将publish.zip上传到 files 文件夹,之后似乎我的 Spark.NET 工作正在运行。

例如,对于 ADF spark 活动,mcirosoft-spark-2-4_2.11-1.0.0.jar 存储在我的存储帐户的 /binary/spark/ 下,如下所示:

https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-spark

然后publish.zip被上传到/binary/spark/files/ spark activity

storage folder

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...