Hortonworks Hadoop NN 和 RM 堆在高利用率时卡在过载状态,但没有应用程序运行? java.io.IOException:设备上没有剩余空间

问题描述

最近从 Hadoop (HDP-3.1.0.0) 客户端节点启动的一些 Spark 作业引发了一些

线程“main”中的异常 org.apache.hadoop.fs.FSError: java.io.IOException: 设备上没有剩余空间

错误,现在我看到 NN 和 RM 堆似乎卡在高利用率水平(例如 80-95%),尽管在 RM/YARN UI 中有待处理或正在运行的作业。

在 Ambari 仪表板上我看到

enter image description here

然而在 RM UI 中,似乎没有任何东西在运行:

enter image description here

enter image description here

我在最近失败的 Spark 作业中看到的错误是...

[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO YarnScheduler: Removed TaskSet 10.0,whose tasks have all completed,from pool
[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: ResultStage 10 (csv at NativeMethodAccessorImpl.java:0) finished in 8.558 s
[2021-02-11 22:05:20,982] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: Job 7 finished: csv at NativeMethodAccessorImpl.java:0,took 8.561029 s
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Job null committed.
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Finished processing stats for job null.
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - writing to local FS staging area
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO - Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:262)
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO -     at java.io.bufferedoutputstream.write(bufferedoutputstream.java:122)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at java.io.DataOutputStream.write(DataOutputStream.java:107)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IoUtils.copyBytes(IoUtils.java:96)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IoUtils.copyBytes(IoUtils.java:68)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IoUtils.copyBytes(IoUtils.java:129)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.copyFiletoTarget(CommandWithDestination.java:342)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processpath(CommandWithDestination.java:277)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processpath(CommandWithDestination.java:262)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processpathInternal(Command.java:367)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processpaths(Command.java:331)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processpaths(Command.java:352)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:441)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:305)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processpathInternal(Command.java:369)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processpathArgument(Command.java:304)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processpathArgument(CommandWithDestination.java:257)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:228)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - Caused by: java.io.IOException: No space left on device
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at java.io.FileOutputStream.writeBytes(Native Method)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at java.io.FileOutputStream.write(FileOutputStream.java:326)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:260)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     ... 29 more
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - Traceback (most recent call last):
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO -   File "/home/airflow/projects/hph_etl_airflow/common_prep.py",line 112,in <module>
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO -     assert get.returncode == 0,"ERROR: Failed to copy to local dir"
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO - AssertionError: ERROR: Failed to copy to local dir
[2021-02-11 22:05:24,034] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Invoking stop() from shutdown hook
[2021-02-11 22:05:24,040] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO AbstractConnector: Stopped Spark@599cff94{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
[2021-02-11 22:05:24,048] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkUI: Stopped Spark web UI at http://airflowetl.ucera.local:4041
[2021-02-11 22:05:24,092] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Interrupting monitor thread
[2021-02-11 22:05:24,106] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Shutting down all executors
[2021-02-11 22:05:24,107] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - (serviceOption=None,[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO -  services=List(),114] {bash_operator.py:128} INFO -  started=false)
[2021-02-11 22:05:24,115] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Stopped
[2021-02-11 22:05:24,123] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
[2021-02-11 22:05:24,154] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MemoryStore: MemoryStore cleared
[2021-02-11 22:05:24,155] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManager: BlockManager stopped
[2021-02-11 22:05:24,157] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManagerMaster: BlockManagerMaster stopped
[2021-02-11 22:05:24,162] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
[2021-02-11 22:05:24,173] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Successfully stopped SparkContext
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Shutdown hook called
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-57e1dfa3-26e8-490b-b7ca-94bce93e36d7
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506/pyspark-225760d8-f365-49fe-8333-6d0df3cb99bd
[2021-02-11 22:05:24,646] {bash_operator.py:132} INFO - Command exited with return code 1
[2021-02-11 22:05:24,663] {taskinstance.py:1088} ERROR - Bash command Failed

注意:不能做更多的调试,因为通过 Ambari 重新启动了集群(某些日常任务需要它,所以不能就这样离开它)并将 NN 和 RM 堆设置为 10% 和 25%分别。

有人知道这里会发生什么吗?任何其他可以(仍然)检查进一步调试信息的地方?

解决方法

在执行 Spark 调用的机器上运行 df -hdu -h -d1 /some/paths/of/interest 只是从错误中的“写入本地 FS”和“磁盘上没有空间”消息(运行 {{1在所有 hadoop 节点中,我可以看到启动 Spark 作业的客户端节点是唯一一个具有高磁盘利用率的节点),我发现在调用 Spark 作业的机器上只剩下 1GB 的磁盘空间(由于其他问题)最终为其中一些引发了此错误并已修复该问题,但不确定这是否相关(因为我的理解是 Spark 在集群中的其他节点上进行实际处理)。

我怀疑 this was the problem,但如果有更多经验的人可以解释更多这里表面下发生的问题,这对未来的调试和对这篇文章的更好的实际答案非常有帮助。例如。

  1. 为什么集群节点之一(在本例中为客户端节点)上缺少可用磁盘空间会导致 RM 堆保持如此高的利用率即使在没有报告在 RM UI 中运行作业?
  2. 为什么本地机器上的磁盘空间不足会影响 Spark 作业(我的理解是 Spark 在集群中的其他节点上进行实际处理)?

如果调用 spark 作业的本地计算机上的磁盘空间确实是问题所在,则此问题可能会被标记为与此处回答的问题重复:https://stackoverflow.com/a/18365738/8236733

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...