dask-yarn 脚本因分布式.scheduler.KilledWorker 和空工人而失败

问题描述

我正在尝试运行以下代码以在 HD Insight Spark 集群上测试 dask-yarn,该集群具有两个头节点和两个具有 4 个内核和 16 GB 内存的工作线程。数据集有 10 万条记录,小于 50MB。

import os
os.environ['ARROW_LIBHDFS_DIR'] = '/usr/hdp/4.1.4.0/'

from dask_yarn import YarnCluster
from dask.distributed import Client
import dask.dataframe as dd

cluster = YarnCluster(environment='conf/conda_envs/dask_yarn.tar.gz')

cluster.scale(1)

client = Client(cluster)

path = 'hdfs:///samples/data_100K_dask_casted/data_100K_dask_casted'

df = dd.read_parquet(path)
print(df.count().compute())

但是,每次我遇到相同的异常时:

WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
21/04/29 23:52:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/04/29 23:52:56 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
21/04/29 23:52:57 INFO client.RequestHedgingRMFailoverProxyProvider: Created wrapped proxy for [rm1,rm2]
21/04/29 23:52:57 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.16:10200
21/04/29 23:52:58 INFO skein.Driver: Driver started,listening on 32935
21/04/29 23:52:59 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/4.1.4.0/0/resource-types.xml
21/04/29 23:53:00 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1,rm2]...
21/04/29 23:53:00 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
21/04/29 23:53:00 INFO skein.Driver: Uploading application resources to hdfs://mycluster/user/sshuser/.skein/application_1619736095085_0008
21/04/29 23:53:05 INFO skein.Driver: Submitting application...
21/04/29 23:53:05 INFO impl.YarnClientImpl: Submitted application application_1619736095085_0008
/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/fsspec/implementations/hdfs.py:49: FutureWarning: pyarrow.hdfs.HadoopFileSystem is deprecated as of 2.0.0,please use pyarrow.fs.HadoopFileSystem instead.
  pahdfs = HadoopFileSystem(
21/04/29 23:53:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/04/29 23:53:26 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Traceback (most recent call last):
  File "cluster_dask_test.py",line 17,in <module>
    print(df.count().compute())
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/dask/base.py",line 285,in compute
    (result,) = compute(self,traverse=False,**kwargs)
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/dask/base.py",line 567,in compute
    results = schedule(dsk,keys,**kwargs)
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/client.py",line 2666,in get
    results = self.gather(packed,asynchronous=asynchronous,direct=direct)
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/client.py",line 1975,in gather
    return self.sync(
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/client.py",line 843,in sync
    return sync(
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/utils.py",line 353,in sync
    raise exc.with_traceback(tb)
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/utils.py",line 336,in f
    result[0] = yield future
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/tornado/gen.py",line 762,in run
    value = future.result()
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/client.py",line 1840,in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ("('dataframe-count-chunk-read-parquet-dataframe-count-agg-c6aec6008fdc93d0dd64a09d3e5956c3',0)",<Worker 'tcp://10.0.0.4:45891',name: dask.worker_0,memory: 0,processing: 1>)
Exception ignored in: <function YarnCluster.__del__ at 0x7f08a4c12550>
Traceback (most recent call last):
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/dask_yarn/core.py",line 788,in __del__
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/dask_yarn/core.py",line 780,in close
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/dask_yarn/core.py",line 771,in shutdown
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/utils.py",line 465,in stop
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/utils.py",line 480,in _stop_unlocked
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/utils.py",line 489,in _real_stop
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/tornado/platform/asyncio.py",line 321,in close
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/tornado/platform/asyncio.py",line 140,in close
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/asyncio/unix_events.py",line 58,in close
  File "/home/sshuser/miniconda3/envs/dask_yarn/lib/python3.8/asyncio/selector_events.py",line 89,in close
RuntimeError: Cannot close a running event loop
distributed.core - WARNING - rpc object <rpc to 'tcp://10.0.0.4:39065',1 comms> deleted with 1 open comms

纱线日志包含以下内容

WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
21/04/30 00:17:52 INFO client.RequestHedgingRMFailoverProxyProvider: Created wrapped proxy for [rm1,rm2]
21/04/30 00:17:52 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.16:10200
21/04/30 00:17:52 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1,rm2]...
21/04/30 00:17:52 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
21/04/30 00:17:53 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
21/04/30 00:17:53 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Container: container_e03_1619736095085_0009_01_000001 on wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net_30050_1619741837044
LogAggregationType: AGGREGATED
===========================================================================================================================================
LogType:application.master.log
LogLastModifiedTime:Fri Apr 30 00:17:17 +0000 2021
LogLength:3316
LogContents:
21/04/30 00:16:50 INFO skein.ApplicationMaster: Starting Skein version 0.8.1
21/04/30 00:16:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/04/30 00:16:50 INFO skein.ApplicationMaster: Running as user sshuser
21/04/30 00:16:51 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/4.1.4.0/0/resource-types.xml
21/04/30 00:16:51 INFO skein.ApplicationMaster: Application specification successfully loaded
21/04/30 00:16:51 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
21/04/30 00:16:51 INFO client.RequestHedgingRMFailoverProxyProvider: Created wrapped proxy for [rm1,rm2]
21/04/30 00:16:51 INFO skein.ApplicationMaster: gRPC server started at wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net:42071
21/04/30 00:16:52 INFO skein.ApplicationMaster: WebUI server started at wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net:36771
21/04/30 00:16:52 INFO skein.ApplicationMaster: Registering application with resource manager
21/04/30 00:16:52 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1,rm2]...
21/04/30 00:16:52 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
21/04/30 00:16:52 INFO client.RequestHedgingRMFailoverProxyProvider: Created wrapped proxy for [rm1,rm2]
21/04/30 00:16:52 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.16:10200
21/04/30 00:16:52 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1,rm2]...
21/04/30 00:16:52 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
21/04/30 00:16:52 INFO skein.ApplicationMaster: Initializing service 'dask.worker'.
21/04/30 00:16:52 INFO skein.ApplicationMaster: Initializing service 'dask.scheduler'.
21/04/30 00:16:52 INFO skein.ApplicationMaster: REQUESTED: dask.scheduler_0
21/04/30 00:16:53 INFO skein.ApplicationMaster: Starting container_e03_1619736095085_0009_01_000002...
21/04/30 00:16:53 INFO skein.ApplicationMaster: RUNNING: dask.scheduler_0 on container_e03_1619736095085_0009_01_000002
21/04/30 00:17:02 INFO skein.ApplicationMaster: Scaling service 'dask.worker' to 1 instances,a delta of 1.
21/04/30 00:17:02 INFO skein.ApplicationMaster: REQUESTED: dask.worker_0
21/04/30 00:17:04 INFO skein.ApplicationMaster: Starting container_e03_1619736095085_0009_01_000003...
21/04/30 00:17:04 INFO skein.ApplicationMaster: RUNNING: dask.worker_0 on container_e03_1619736095085_0009_01_000003
21/04/30 00:17:15 INFO skein.ApplicationMaster: Shutting down: Shutdown requested by user.
21/04/30 00:17:15 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
21/04/30 00:17:15 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
21/04/30 00:17:15 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
21/04/30 00:17:15 INFO skein.ApplicationMaster: Deleted application directory hdfs://mycluster/user/sshuser/.skein/application_1619736095085_0009
21/04/30 00:17:15 INFO skein.ApplicationMaster: WebUI server shut down
21/04/30 00:17:15 INFO skein.ApplicationMaster: gRPC server shut down

End of LogType:application.master.log
***************************************************************************************

Container: container_e03_1619736095085_0009_01_000001 on wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net_30050_1619741837044
LogAggregationType: AGGREGATED
===========================================================================================================================================
LogType:directory.info
LogLastModifiedTime:Fri Apr 30 00:17:17 +0000 2021
LogLength:1735
LogContents:
ls -l:
total 24
-rw-r--r-- 1 yarn hadoop   74 Apr 30 00:16 container_tokens
-rwx------ 1 yarn hadoop  707 Apr 30 00:16 default_container_executor_session.sh
-rwx------ 1 yarn hadoop  762 Apr 30 00:16 default_container_executor.sh
-rwx------ 1 yarn hadoop 4763 Apr 30 00:16 launch_container.sh
drwx--x--- 2 yarn hadoop 4096 Apr 30 00:16 tmp
find -L . -maxdepth 5 -ls:
  5111810      4 drwx--x---   3 yarn     hadoop       4096 Apr 30 00:16 .
  5111832      4 -rw-r--r--   1 yarn     hadoop         16 Apr 30 00:16 ./.default_container_executor.sh.crc
  5111813   7660 -r-x------   1 yarn     hadoop    7842343 Apr 30 00:16 ./.skein.jar
  5111825      4 -rw-r--r--   1 yarn     hadoop         74 Apr 30 00:16 ./container_tokens
  5111830      4 -rw-r--r--   1 yarn     hadoop         16 Apr 30 00:16 ./.default_container_executor_session.sh.crc
  5111819      4 -r-x------   1 yarn     hadoop       1013 Apr 30 00:16 ./.skein.crt
  5111816      4 -r-x------   1 yarn     hadoop       1704 Apr 30 00:16 ./.skein.pem
  5111826      4 -rw-r--r--   1 yarn     hadoop         12 Apr 30 00:16 ./.container_tokens.crc
  5111822      4 -r-x------   1 yarn     hadoop       1981 Apr 30 00:16 ./.skein.proto
  5111828      4 -rw-r--r--   1 yarn     hadoop         48 Apr 30 00:16 ./.launch_container.sh.crc
  5111824      4 drwx--x---   2 yarn     hadoop       4096 Apr 30 00:16 ./tmp
  5111827      8 -rwx------   1 yarn     hadoop       4763 Apr 30 00:16 ./launch_container.sh
  5111829      4 -rwx------   1 yarn     hadoop        707 Apr 30 00:16 ./default_container_executor_session.sh
  5111831      4 -rwx------   1 yarn     hadoop        762 Apr 30 00:16 ./default_container_executor.sh
broken symlinks(find -L . -maxdepth 5 -type l -ls):

End of LogType:directory.info
*******************************************************************************

Container: container_e03_1619736095085_0009_01_000001 on wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net_30050_1619741837044
LogAggregationType: AGGREGATED
===========================================================================================================================================
LogType:launch_container.sh
LogLastModifiedTime:Fri Apr 30 00:17:17 +0000 2021
LogLength:4763
LogContents:
#!/bin/bash

set -o pipefail -e
export Prelaunch_OUT="/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/prelaunch.out"
exec >"${Prelaunch_OUT}"
export Prelaunch_ERR="/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/prelaunch.err"
exec 2>"${Prelaunch_ERR}"
echo "Setting up env variables"
export JAVA_HOME=${JAVA_HOME:-"/usr/lib/jvm/zulu-8-azure-amd64"}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/usr/hdp/4.1.4.0/hadoop/conf"}
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/usr/hdp/4.1.4.0/hadoop-yarn"}
export HADOOP_HOME=${HADOOP_HOME:-"/usr/hdp/4.1.4.0/hadoop"}
export PATH=${PATH:-"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/var/lib/ambari-agent"}
export HADOOP_TOKEN_FILE_LOCATION="/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/container_tokens"
export CONTAINER_ID="container_e03_1619736095085_0009_01_000001"
export NM_PORT="30050"
export NM_HOST="wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net"
export NM_HTTP_PORT="30060"
export LOCAL_Dirs="/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1619736095085_0009"
export LOCAL_USER_Dirs="/mnt/resource/hadoop/yarn/local/usercache/sshuser/"
export LOG_Dirs="/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001"
export USER="sshuser"
export LOGNAME="sshuser"
export HOME="/home/"
export PWD="/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001"
export JVM_PID="$$"
export MALLOC_ARENA_MAX="4"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="
export NM_AUX_SERVICE_spark2_shuffle=""
export APPLICATION_WEB_PROXY_BASE="/proxy/application_1619736095085_0009"
export SKEIN_APPLICATION_ID="application_1619736095085_0009"
export CLAsspATH="$CLAsspATH:./*:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*"
export LANG="en_US.UTF-8"
export APP_SUBMIT_TIME_ENV="1619741809940"
export HADOOP_USER_NAME="sshuser"
echo "Setting up job resources"
ln -sf "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1619736095085_0009/filecache/13/.skein.proto" ".skein.proto"
ln -sf "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1619736095085_0009/filecache/12/.skein.crt" ".skein.crt"
ln -sf "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1619736095085_0009/filecache/11/.skein.pem" ".skein.pem"
ln -sf "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1619736095085_0009/filecache/10/skein.jar" ".skein.jar"
echo "copying debugging information"
# Creating copy of launch script
cp "launch_container.sh" "/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/launch_container.sh"
chmod 640 "/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/launch_container.sh"
# Determining directory contents
echo "ls -l:" 1>"/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/directory.info"
ls -l 1>>"/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/directory.info"
find -L . -maxdepth 5 -ls 1>>"/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/directory.info"
echo "Launching container"
exec /bin/bash -c "$JAVA_HOME/bin/java -Xmx128M -Dskein.log.level=INFO -Dskein.log.directory=/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001 com.anaconda.skein.ApplicationMaster hdfs://mycluster/user/sshuser/.skein/application_1619736095085_0009 >/mnt/resource/hadoop/yarn/log/application_1619736095085_0009/container_e03_1619736095085_0009_01_000001/application.master.log 2>&1"

End of LogType:launch_container.sh
************************************************************************************


End of LogType:prelaunch.err
******************************************************************************

Container: container_e03_1619736095085_0009_01_000001 on wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net_30050_1619741837044
LogAggregationType: AGGREGATED
===========================================================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Fri Apr 30 00:17:17 +0000 2021
LogLength:100
LogContents:
Setting up env variables
Setting up job resources
copying debugging information
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_e03_1619736095085_0009_01_000002 on wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net_30050_1619741837044
LogAggregationType: AGGREGATED
===========================================================================================================================================
LogType:dask.scheduler.log
LogLastModifiedTime:Fri Apr 30 00:17:17 +0000 2021
LogLength:3506
LogContents:
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://10.0.0.12:38447
distributed.scheduler - INFO -   dashboard at:                    :38653
distributed.scheduler - INFO - Receive client connection: Client-63068e40-a949-11eb-8158-002248a55a09
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.0.0.4:36545',processing: 1>
distributed.scheduler - INFO - Starting worker compute stream,tcp://10.0.0.4:36545
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove worker <Worker 'tcp://10.0.0.4:36545',processing: 1>
distributed.core - INFO - Removing comms to tcp://10.0.0.4:36545
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.0.0.4:36545',processing: 1>
distributed.core - INFO - Removing comms to tcp://10.0.0.4:36545
distributed.scheduler - INFO - Task ('dataframe-count-chunk-read-parquet-dataframe-count-agg-c6aec6008fdc93d0dd64a09d3e5956c3',0) marked as Failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Remove client Client-63068e40-a949-11eb-8158-002248a55a09
distributed.scheduler - INFO - Remove client Client-63068e40-a949-11eb-8158-002248a55a09
distributed.scheduler - INFO - Close client connection: Client-63068e40-a949-11eb-8158-002248a55a09
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.0.0.4:36545',processing: 0>
distributed.scheduler - INFO - Starting worker compute stream,processing: 0>
distributed.core - INFO - Removing comms to tcp://10.0.0.4:36545
distributed.scheduler - INFO - Lost all workers

End of LogType:dask.scheduler.log
***********************************************************************************

Container: container_e03_1619736095085_0009_01_000002 on wn0-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net_30050_1619741837044
LogAggregationType: AGGREGATED
===========================================================================================================================================
LogType:directory.info
LogLastModifiedTime:Fri Apr 30 00:17:17 +0000 2021
LogLength:1563718
LogContents:
ls -l:
total 28
-rw-r--r-- 1 yarn hadoop    7 Apr 30 00:17 container_tokens
-rwx------ 1 yarn hadoop  707 Apr 30 00:17 default_container_executor_session.sh
-rwx------ 1 yarn hadoop  762 Apr 30 00:17 default_container_executor.sh
lrwxrwxrwx 1 yarn hadoop  119 Apr 30 00:17 environment -> /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1619736095085_0009/filecache/14/dask_yarn.tar.gz
-rwx------ 1 yarn hadoop 4319 Apr 30 00:17 launch_container.sh
drwx--x--- 2 yarn hadoop 4096 Apr 30 00:17 tmp
find -L . -maxdepth 5 -ls:
  5118907      4 drwx--x---   3 yarn     hadoop       4096 Apr 30 00:17 .
  5119737      4 -rw-r--r--   1 yarn     hadoop         16 Apr 30 00:17 ./.default_container_executor.sh.crc
  5119727      4 -r-x------   1 yarn     hadoop         90 Apr 30 00:17 ./.skein.sh
  5119730      4 -rw-r--r--   1 yarn     hadoop          7 Apr 30 00:17 ./container_tokens
  5119735      4 -rw-r--r--   1 yarn     hadoop         16 Apr 30 00:17 ./.default_container_executor_session.sh.crc
  5111819      4 -r-x------   1 yarn     hadoop       1013 Apr 30 00:16 ./.skein.crt
  5111816      4 -r-x------   1 yarn     hadoop       1704 Apr 30 00:16 ./.skein.pem
  5111840      4 drwx------  13 yarn     hadoop       4096 Apr 30 00:16 ./environment
  5375401     20 drwx------  20 yarn     hadoop      20480 Apr 30 00:16 ./environment/lib
  5383552    176 -r-x------   1 yarn     hadoop     179043 Oct  8  2017 ./environment/lib/libgsasl.so.7
  5375593      4 -r-x------   1 yarn     hadoop       1854 Jan 21 14:16 ./environment/lib/libboost_exception.a
  5383774   3876 -r-x------   1 yarn     hadoop    3964976 Nov 20 17:37 ./environment/lib/libicui18n.so.68
  5383777     76 -r-x------   1 yarn     hadoop      73976 Nov 20 17:37 ./environment/lib/libicuio.so
  5383461     44 -r-x------   1 yarn     hadoop      44880 Nov 18 00:14 ./environment/lib/libkrad.so.0.0
  5382664    124 -r-x------   1 yarn     hadoop     125072 Jun  2  2020 ./environment/lib/libyaml.so

############################### MUCH MORE LInes LIKE THIS ###############################

  5777726     24 -r-x------   1 yarn     hadoop      23022 Nov 20 17:37 ./environment/bin/icu-config
  5777329     24 -r-x------   1 yarn     hadoop      22424 Apr 20 13:17 ./environment/bin/hmac256
  5777951      4 -r-x------   1 yarn     hadoop       1013 Apr 29 23:53 ./.skein.crt
  5777954      4 -rw-r--r--   1 yarn     hadoop         12 Apr 29 23:53 ./.container_tokens.crc
broken symlinks(find -L . -maxdepth 5 -type l -ls):

End of LogType:directory.info
*******************************************************************************


End of LogType:prelaunch.err
******************************************************************************

Container: container_e03_1619736095085_0008_01_000002 on wn1-rita-t.ue1j5l1mq4befkisvpaptyjfbd.jx.internal.cloudapp.net_30050_1619740411569
LogAggregationType: AGGREGATED
===========================================================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Thu Apr 29 23:53:31 +0000 2021
LogLength:100
LogContents:
Setting up env variables
Setting up job resources
copying debugging information
Launching container

End of LogType:prelaunch.out
******************************************************************************

YARN 正在为 Spark 充分分配资源,并且有足够的资源可供分配。

我尝试了不同的工作线程、内核和内存配置以及不同的操作,例如 headsumcount。此外,我将重试次数增加到 300 次,以防我的任务被列入黑名单,甚至杀死了集群并尝试重新尝试,但结果相同。

我还尝试通过按照 here 中包含的步骤复制每个节点上的环境来手动设置工作器,并且部署成功。

如果您对此有任何想法,我将不胜感激。谢谢!

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)