“应用程序尝试...在 ApplicationMasterService 缓存中不存在”原因?Pregel: maxIterations 对非收敛算法集群的影响

问题描述

我尝试为一个相对较小的图(25 万个顶点,150 万条边)运行我自己的 pregel 方法。我使用的算法可能(很有可能)是非收敛的,在大多数情况下,maxIterations 设置实际上充当了完成所有计算的硬停止。

我将 AWS EMR 与 apache spark 和 m5.2xlarge 实例结合使用,用于具有 EMR 托管扩展的设置中的所有节点。最初,集群设置为运行 1 个主节点和 4 个工作节点,最多可扩展到 8 个。

对于相同的集群设置,我将 maxIterations 的数量从 100 逐渐增加到 500,步长为 100 [100,200,300,400,500]。我假设设置足以进行 100 次迭代也足以用于任何其他数字,因为未使用的内存将被释放。

但是,当我运行一组 maxIterations 从 100 增加到 500 的作业时,我发现所有 maxIterations > 100 的作业都由于步骤错误而终止。我检查了 Spark 的日志以查找问题,这就是我得到的:

日志开始

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt1/yarn/usercache/hadoop/filecache/10/__spark_libs__364046395941885636.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for TERM
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for HUP
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for INT
21/02/13 21:23:24 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/13 21:23:24 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/13 21:23:24 INFO SecurityManager: Changing view acls groups to: 
21/02/13 21:23:24 INFO SecurityManager: Changing modify acls groups to: 
21/02/13 21:23:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn,hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn,hadoop); groups with modify permissions: Set()
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:24 INFO ApplicationMaster: Preparing Local resources
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:25 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1613251201422_0001_000001
21/02/13 21:23:25 INFO ApplicationMaster: Starting the user application in a separate Thread
21/02/13 21:23:25 INFO ApplicationMaster: Waiting for spark context initialization...
21/02/13 21:23:25 INFO SparkContext: Running Spark version 2.4.7-amzn-0
21/02/13 21:23:25 INFO SparkContext: Submitted application: Read JDBC Datasites2
21/02/13 21:23:25 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/13 21:23:25 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/13 21:23:25 INFO SecurityManager: Changing view acls groups to: 
21/02/13 21:23:25 INFO SecurityManager: Changing modify acls groups to: 
21/02/13 21:23:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn,hadoop); groups with modify permissions: Set()
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:25 INFO Utils: Successfully started service 'sparkDriver' on port 41117.
21/02/13 21:23:25 INFO SparkEnv: Registering MapOutputTracker
21/02/13 21:23:25 INFO SparkEnv: Registering BlockManagerMaster
21/02/13 21:23:25 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/02/13 21:23:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/02/13 21:23:25 INFO diskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1613251201422_0001/blockmgr-bc544c91-1a59-41f3-890f-faaa392bea09
21/02/13 21:23:25 INFO diskBlockManager: Created local directory at /mnt1/yarn/usercache/hadoop/appcache/application_1613251201422_0001/blockmgr-14e3f36f-6d3f-4ffe-a28c-fa3f81f0c5c9
21/02/13 21:23:26 INFO MemoryStore: MemoryStore started with capacity 1008.9 MB
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:26 INFO SparkEnv: Registering OutputCommitCoordinator
21/02/13 21:23:26 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs,/jobs/json,/jobs/job,/jobs/job/json,/stages,/stages/json,/stages/stage,/stages/stage/json,/stages/pool,/stages/pool/json,/storage,/storage/json,/storage/rdd,/storage/rdd/json,/environment,/environment/json,/executors,/executors/json,/executors/threadDump,/executors/threadDump/json,/static,/,/api,/jobs/job/kill,/stages/stage/kill.
21/02/13 21:23:26 INFO Utils: Successfully started service 'SparkUI' on port 43659.
21/02/13 21:23:26 INFO SparkUI: Bound SparkUI to 0.0.0.0,and started at http://ip-172-31-21-88.ec2.internal:43659
21/02/13 21:23:26 INFO YarnClusterScheduler: Created YarnClusterScheduler
21/02/13 21:23:26 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1613251201422_0001 and attemptId Some(appattempt_1613251201422_0001_000001)
21/02/13 21:23:26 INFO Utils: Using initial executors = 100,max of spark.dynamicAllocation.initialExecutors,spark.dynamicAllocation.minexecutors and spark.executor.instances
21/02/13 21:23:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34665.
21/02/13 21:23:26 INFO Utils: Using initial executors = 100,spark.dynamicAllocation.minexecutors and spark.executor.instances
21/02/13 21:23:26 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:27 INFO RMProxy: Connecting to ResourceManager at ip-172-31-29-
  command:
    LD_LIBRARY_PATH=\"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH\" \ 
      {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx4743m \ 
      '-verbose:gc' \ 
      '-XX:+PrintGCDetails' \ 
      '-XX:+PrintGCDateStamps' \ 
      '-XX:OnOutOfMemoryError=kill -9 %p' \ 
      '-XX:+UseParallelGC' \ 
      '-XX:InitiatingHeapOccupancyPercent=70' \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.history.ui.port=18080' \ 
      '-Dspark.ui.port=0' \ 
      '-Dspark.driver.port=41117' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@ip-172-31-21-88.ec2.internal:41117 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      2 \ 
      --app-id \ 
      application_1613251201422_0001 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    __app__.jar -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/force-pregel.jar" } size: 27378 timestamp: 1613251399566 type: FILE visibility: PRIVATE
    __spark_libs__ -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/__spark_libs__364046395941885636.zip" } size: 239655683 timestamp: 1613251397751 type: ARCHIVE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/__spark_conf__.zip" } size: 274365 timestamp: 1613251399776 type: ARCHIVE visibility: PRIVATE
    hive-site.xml -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/hive-site.xml" } size: 2137 timestamp: 1613251399631 type: FILE visibility: PRIVATE 

===============================================================================
    21/02/13 21:23:27 INFO Configuration: resource-types.xml not found
    21/02/13 21:23:27 INFO ResourceUtils: Unable to find 'resource-types.xml'.
    21/02/13 21:23:27 INFO ResourceUtils: Adding resource type - name = memory-mb,units = Mi,type = COUNTABLE
    21/02/13 21:23:27 INFO ResourceUtils: Adding resource type - name = vcores,units =,type = COUNTABLE
    21/02/13 21:23:27 INFO Utils: Using initial executors = 100,spark.dynamicAllocation.minexecutors and spark.executor.instances
    21/02/13 21:23:27 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM@ip-172-31-21-88.ec2.internal:41117)
    21/02/13 21:23:27 INFO YarnAllocator: Will request up to 100 executor container(s),each with <memory:5632,max memory:2147483647,vCores:2,max vCores:2147483647>
    21/02/13 21:23:27 INFO YarnAllocator: Submitted 100 unlocalized container requests.
    21/02/13 21:23:27 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000,initial allocation : 200) intervals
   org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /sql/json.
    21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /sql/execution.
    21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /sql/execution/json.
    21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
    21/02/13 21:23:27 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000002 on host ip-172-31-21-88.ec2.internal for executor with ID 1 with resources <memory:5632,max memory:12288,vCores:1,max vCores:8>
    21/02/13 21:23:27 INFO YarnAllocator: Launching executor with 4742m of heap (plus 890m overhead) and 2 cores
    21/02/13 21:23:27 INFO YarnAllocator: Received 1 containers from YARN,launching executors on 1 of them.
    21/02/13 21:23:28 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000004 on host ip-172-31-25-102.ec2.internal for executor with ID 2 with resources <memory:11264,vCores:2>
    21/02/13 21:23:28 INFO YarnAllocator: Launching executor with 9485m of heap (plus 1779m overhead) and 4 cores
    21/02/13 21:23:28 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000006 on host ip-172-31-28-143.ec2.internal for executor with ID 3 with resources <memory:11264,vCores:2>
    21/02/13 21:23:28 INFO YarnAllocator: Launching executor with 9485m of heap (plus 1779m overhead) and 4 cores
    21/02/13 21:23:28 INFO YarnAllocator: Received 2 containers from YARN,launching executors on 2 of them.
  30 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.21.88:53634) with ID 1
    21/02/13 21:23:30 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
    21/02/13 21:23:30 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-21-88.ec2.internal:45667 with 2.3 GB RAM,BlockManagerId(1,ip-172-31-21-88.ec2.internal,45667,None)

然后大约 2Mbytes 的相同输出然后它完成:

21/02/13 21:28:25 INFO TaskSetManager: Finished task 199.0 in stage 37207.0 (TID 93528) in 8 ms on ip-172-31-25-102.ec2.internal (executor 2) (158/200)

21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_31 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 252.3 KB,free: 2.1 GB)
21/02/13 21:28:25 ERROR ApplicationMaster: Exception from Reporter thread.
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupinformation.doAs(UserGroupinformation.java:1926)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
    at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
    at sun.reflect.GeneratedMethodAccessor36.invoke(UnkNown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
    at com.sun.proxy.$Proxy23.allocate(UnkNown Source)
    at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:300)
    at org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:279)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$allocationThreadImpl(ApplicationMaster.scala:541)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:607)
Caused by: org.apache.hadoop.ipc.remoteexception(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupinformation.doAs(UserGroupinformation.java:1926)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)

    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
    at org.apache.hadoop.ipc.Client.call(Client.java:1495)
    at org.apache.hadoop.ipc.Client.call(Client.java:1394)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
    at com.sun.proxy.$Proxy22.allocate(UnkNown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
    ... 13 more
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_30 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 244.8 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 40.0 in stage 37207.0 (TID 93533,executor 1,partition 40,PROCESS_LOCAL,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 31.0 in stage 37207.0 (TID 93532) in 16 ms on ip-172-31-21-88.ec2.internal (executor 1) (162/200)
21/02/13 21:28:25 INFO ApplicationMaster: Final app status: Failed,exitCode: 12,(reason: Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupinformation.doAs(UserGroupinformation.java:1926)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 41.0 in stage 37207.0 (TID 93534,partition 41,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 30.0 in stage 37207.0 (TID 93531) in 22 ms on ip-172-31-21-88.ec2.internal (executor 1) (163/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_40 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 234.2 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 48.0 in stage 37207.0 (TID 93535,partition 48,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 40.0 in stage 37207.0 (TID 93533) in 17 ms on ip-172-31-21-88.ec2.internal (executor 1) (164/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_41 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 233.4 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 51.0 in stage 37207.0 (TID 93536,partition 51,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 41.0 in stage 37207.0 (TID 93534) in 15 ms on ip-172-31-21-88.ec2.internal (executor 1) (165/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_48 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 235.1 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 57.0 in stage 37207.0 (TID 93537,partition 57,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 48.0 in stage 37207.0 (TID 93535) in 11 ms on ip-172-31-21-88.ec2.internal (executor 1) (166/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_57 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 232.2 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_51 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 244.2 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 61.0 in stage 37207.0 (TID 93538,partition 61,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 57.0 in stage 37207.0 (TID 93537) in 10 ms on ip-172-31-21-88.ec2.internal (executor 1) (167/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 63.0 in stage 37207.0 (TID 93539,partition 63,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 51.0 in stage 37207.0 (TID 93536) in 17 ms on ip-172-31-21-88.ec2.internal (executor 1) (168/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_61 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 228.6 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 67.0 in stage 37207.0 (TID 93540,partition 67,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 61.0 in stage 37207.0 (TID 93538) in 10 ms on ip-172-31-21-88.ec2.internal (executor 1) (169/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_63 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 238.3 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 71.0 in stage 37207.0 (TID 93541,partition 71,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 63.0 in stage 37207.0 (TID 93539) in 14 ms on ip-172-31-21-88.ec2.internal (executor 1) (170/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_67 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 247.2 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_71 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 243.6 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 77.0 in stage 37207.0 (TID 93542,partition 77,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 67.0 in stage 37207.0 (TID 93540) in 18 ms on ip-172-31-21-88.ec2.internal (executor 1) (171/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 79.0 in stage 37207.0 (TID 93543,partition 79,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 71.0 in stage 37207.0 (TID 93541) in 12 ms on ip-172-31-21-88.ec2.internal (executor 1) (172/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_79 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 253.6 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_77 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 222.5 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 86.0 in stage 37207.0 (TID 93544,partition 86,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 79.0 in stage 37207.0 (TID 93543) in 12 ms on ip-172-31-21-88.ec2.internal (executor 1) (173/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 87.0 in stage 37207.0 (TID 93545,partition 87,19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 77.0 in stage 37207.0 (TID 93542) in 14 ms on ip-172-31-21-88.ec2.internal (executor 1) (174/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_86 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 254.5 KB,free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_87 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 267.1 KB,free: 2.1 GB)
  • 由于某些集群节点上的 OutOfMemory 错误pregel 没有完成 200 次或更多次迭代,我是否正确?
  • 如果是这样,100 次迭代不会导致它而 200 或 300 次导致它的 pregel 如何工作?我在这个问题之前的理解是,pregel 和许多其他迭代方法一样只“存储”以前和当前的迭代值,结果和迭代值的迭代正在发生变化,但它们的数量没有增加,这意味着它仍然是具有 250k 个顶点和 1.5m 的图形边和只有对当前迭代有效的消息才会加到堆中。
  • 在整个日志中,我找不到任何关于内存不足的信息,正如所见,在它终止之前,每个节点上都有数 GB 的可用内存

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)