无法在独立的火花集群中运行火花提交

问题描述

我正在使用以下 docker-compose 映像来构建 Spark 独立集群：

---
# ----------------------------------------------------------------------------------------
# -- Docs: https://github.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker --
# ----------------------------------------------------------------------------------------
version: "3.6"
volumes:
  shared-workspace:
    name: "hadoop-distributed-file-system"
    driver: local
services:
  jupyterlab:
    image: andreper/jupyterlab:3.0.0-spark-3.0.0
    container_name: jupyterlab
    ports:
      - 8888:8888
      - 4040:4040
    volumes:
      - shared-workspace:/opt/workspace
  spark-master:
    image: andreper/spark-master:3.0.0
    container_name: spark-master
    ports:
      - 8080:8080
      - 7077:7077
    volumes:
      - shared-workspace:/opt/workspace
  spark-worker-1:
    image: andreper/spark-worker:3.0.0
    container_name: spark-worker-1
    environment:
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=512m
    ports:
      - 8081:8081
    volumes:
      - shared-workspace:/opt/workspace
    depends_on:
      - spark-master
  spark-worker-2:
    image: andreper/spark-worker:3.0.0
    container_name: spark-worker-2
    environment:
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=512m
    ports:
      - 8082:8081
    volumes:
      - shared-workspace:/opt/workspace
    depends_on:
      - spark-master

我遵循了本指南：https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445。

这里可以找到 Github 存储库：https://github.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker

我可以运行集群，我可以在 jupyter 容器内运行代码，连接到主 Spark 节点没有问题。

当我想使用 spark submit 运行 spark 代码时，问题就开始了。我真的无法理解集群是如何工作的。当我在 Jupyter 容器中运行时，我可以快速看到我创建的脚本在哪里，但是在 spark master 容器中找不到它们。如果我检查 docker-compose.yml，volumes 表明存储脚本的文件夹是：

 volumes:
      - shared-workspace:/opt/workspace

但是我在任何 spark 容器中都找不到这个文件夹。

当我运行时，spark submit，一旦我在 Jupyter 容器内执行，我就会运行它。在 Jupyter 容器中，我有我正在使用的所有脚本，但是当我编写以下命令时我有疑问：spark submit --master spark:// spark-master:7077 <PATH to my python script>，python 脚本的路径，是 Jupyter 容器中脚本所在的路径或spark主容器？

我可以在不指定 master 的情况下运行 spark submit 命令，然后它在本地运行，并且它在 Jupyter 容器内部运行没有问题。

这是我正在执行的python代码：

from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf
from os.path import expanduser,join,abspath

sparkConf = SparkConf()
sparkConf.setMaster("spark://spark-master:7077")
sparkConf.setAppName("pyspark-4")
sparkConf.set("spark.executor.memory","2g")
sparkConf.set("spark.driver.memory","2g")
sparkConf.set("spark.executor.cores","1")
sparkConf.set("spark.driver.cores","1")
sparkConf.set("spark.dynamicAllocation.enabled","false")
sparkConf.set("spark.shuffle.service.enabled","false")
sparkConf.set("spark.sql.warehouse.dir",warehouse_location)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

df = spark.createDataFrame(
    [
        (1,"foo"),# create your data here,be consistent in the types.
        (2,"bar"),],["id","label"],# add your column names here
)
print(df.show())

但是当我指定 master= --master spark:// spark-master: 7077 并指定脚本在 jupyter 容器中的路径时：

spark-submit --master spark://spark-master:7077 test.py

ant 这是我收到的日志：

21/06/06 21:32:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/06/06 21:32:08 INFO SparkContext: Running Spark version 3.0.0
21/06/06 21:32:09 INFO ResourceUtils: ==============================================================
21/06/06 21:32:09 INFO ResourceUtils: Resources for spark.driver:

21/06/06 21:32:09 INFO ResourceUtils: ==============================================================
21/06/06 21:32:09 INFO SparkContext: Submitted application: pyspark-4
21/06/06 21:32:09 INFO SecurityManager: Changing view acls to: root
21/06/06 21:32:09 INFO SecurityManager: Changing modify acls to: root
21/06/06 21:32:09 INFO SecurityManager: Changing view acls groups to: 
21/06/06 21:32:09 INFO SecurityManager: Changing modify acls groups to: 
21/06/06 21:32:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/06/06 21:32:12 INFO Utils: Successfully started service 'sparkDriver' on port 45627.
21/06/06 21:32:12 INFO SparkEnv: Registering MapOutputTracker
21/06/06 21:32:13 INFO SparkEnv: Registering BlockManagerMaster
21/06/06 21:32:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/06/06 21:32:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/06/06 21:32:13 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
21/06/06 21:32:13 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-5a81855c-3160-49a5-b9f9-9cdfe6e5ca62
21/06/06 21:32:14 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
21/06/06 21:32:14 INFO SparkEnv: Registering OutputCommitCoordinator
21/06/06 21:32:16 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/06/06 21:32:16 INFO SparkUI: Bound SparkUI to 0.0.0.0,and started at http://3b232f9ed93b:4040
21/06/06 21:32:19 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-master:7077...
21/06/06 21:32:20 INFO TransportClientFactory: Successfully created connection to spark-master/172.21.0.5:7077 after 284 ms (0 ms spent in bootstraps)
21/06/06 21:32:23 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20210606213223-0000
21/06/06 21:32:23 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46539.
21/06/06 21:32:23 INFO NettyBlockTransferService: Server created on 3b232f9ed93b:46539
21/06/06 21:32:23 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/06/06 21:32:23 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver,3b232f9ed93b,46539,None)
21/06/06 21:32:23 INFO BlockManagerMasterEndpoint: Registering block manager 3b232f9ed93b:46539 with 366.3 MiB RAM,BlockManagerId(driver,None)
21/06/06 21:32:23 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver,None)
21/06/06 21:32:23 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver,None)
21/06/06 21:32:25 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
21/06/06 21:32:29 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('/opt/workspace/spark-warehouse').
21/06/06 21:32:29 INFO SharedState: Warehouse path is '/opt/workspace/spark-warehouse'.
ESTOY AQUI¿¿
21/06/06 21:33:09 INFO CodeGenerator: Code generated in 1925.0009 ms
21/06/06 21:33:09 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
21/06/06 21:33:09 INFO DAGScheduler: Got job 0 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
21/06/06 21:33:09 INFO DAGScheduler: Final stage: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0)
21/06/06 21:33:09 INFO DAGScheduler: Parents of final stage: List()
21/06/06 21:33:09 INFO DAGScheduler: Missing parents: List()
21/06/06 21:33:10 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0),which has no missing parents
21/06/06 21:33:10 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.3 KiB,free 366.3 MiB)
21/06/06 21:33:11 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.9 KiB,free 366.3 MiB)
21/06/06 21:33:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 3b232f9ed93b:46539 (size: 5.9 KiB,free: 366.3 MiB)
21/06/06 21:33:11 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1200
21/06/06 21:33:11 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
21/06/06 21:33:11 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
21/06/06 21:33:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/06/06 21:33:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/06/06 21:33:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/06/06 21:34:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/06/06 21:34:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

当我在 jupyter notebook 中执行相同的代码时，它可以正常工作。

这是因为我必须为脚本指明的路径，是脚本在 spark-master 节点中所在的路径吗？或者我在这里混淆了事情

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark-standalone pyspark