Hortonworks Hadoop NN 和 RM 堆在高利用率时卡在过载状态,但没有应用程序运行? java.io.IOException:设备上没有剩余空间

问题描述

最近从 Hadoop (HDP-3.1.0.0) 客户端节点启动的一些 Spark 作业引发了一些

线程“main”中的异常 org.apache.hadoop.fs.FSError: java.io.IOException: 设备上没有剩余空间

错误,现在我看到 NN 和 RM 堆似乎卡在高利用率水平(例如 80-95%),尽管在 RM/YARN UI 中有待处理或正在运行的作业。

在 Ambari 仪表板上我看到

enter image description here

然而在 RM UI 中,似乎没有任何东西在运行:

enter image description here

enter image description here

我在最近失败的 Spark 作业中看到的错误是...

[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO YarnScheduler: Removed TaskSet 10.0,whose tasks have all completed,from pool
[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: ResultStage 10 (csv at NativeMethodAccessorImpl.java:0) finished in 8.558 s
[2021-02-11 22:05:20,982] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: Job 7 finished: csv at NativeMethodAccessorImpl.java:0,took 8.561029 s
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Job null committed.
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Finished processing stats for job null.
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - writing to local FS staging area
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO - Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:262)
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO -     at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at java.io.DataOutputStream.write(DataOutputStream.java:107)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:96)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:352)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:441)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:305)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:369)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:257)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:228)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - Caused by: java.io.IOException: No space left on device
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at java.io.FileOutputStream.writeBytes(Native Method)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at java.io.FileOutputStream.write(FileOutputStream.java:326)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:260)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     ... 29 more
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - Traceback (most recent call last):
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO -   File "/home/airflow/projects/hph_etl_airflow/common_prep.py",line 112,in <module>
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO -     assert get.returncode == 0,"ERROR: failed to copy to local dir"
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO - AssertionError: ERROR: failed to copy to local dir
[2021-02-11 22:05:24,034] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Invoking stop() from shutdown hook
[2021-02-11 22:05:24,040] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO AbstractConnector: Stopped Spark@599cff94{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
[2021-02-11 22:05:24,048] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkUI: Stopped Spark web UI at http://airflowetl.ucera.local:4041
[2021-02-11 22:05:24,092] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Interrupting monitor thread
[2021-02-11 22:05:24,106] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Shutting down all executors
[2021-02-11 22:05:24,107] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - (serviceOption=None,[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO -  services=List(),114] {bash_operator.py:128} INFO -  started=false)
[2021-02-11 22:05:24,115] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Stopped
[2021-02-11 22:05:24,123] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
[2021-02-11 22:05:24,154] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MemoryStore: MemoryStore cleared
[2021-02-11 22:05:24,155] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManager: BlockManager stopped
[2021-02-11 22:05:24,157] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManagerMaster: BlockManagerMaster stopped
[2021-02-11 22:05:24,162] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
[2021-02-11 22:05:24,173] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Successfully stopped SparkContext
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Shutdown hook called
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-57e1dfa3-26e8-490b-b7ca-94bce93e36d7
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506/pyspark-225760d8-f365-49fe-8333-6d0df3cb99bd
[2021-02-11 22:05:24,646] {bash_operator.py:132} INFO - Command exited with return code 1
[2021-02-11 22:05:24,663] {taskinstance.py:1088} ERROR - Bash command failed

注意:不能做更多的调试,因为通过 Ambari 重新启动了集群(某些日常任务需要它,所以不能就这样离开它)并将 NN 和 RM 堆设置为 10% 和 25%分别。

有人知道这里会发生什么吗?任何其他可以(仍然)检查进一步调试信息的地方?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)