节点死亡时Docker Swarm无法调度容器

问题描述

编辑

请注意此问题。我发现我的一个正在使用Docker.DotNet的服务正在终止标记Shutdown的服务。我已更正了该错误,并重新获得了对Docker和Docker Swarm的信任。 谢谢卡洛斯的帮助。我的错,我的错。抱歉!

我在docker-compose文件上配置了13个服务,并以Swarm模式运行,其中有一个管理器和两个工作器节点。

enter image description here

然后我通过耗尽它来使其中一个工作节点不可用

docker node update --availability drain ****-v3-6by7ddst

我注意到的是,在耗尽节点上运行的所有服务均已删除,并且未计划到可用节点上。 可用的worker和manager节点仍有大量资源。.这些服务被简单地删除了。我现在只有9个服务

enter image description here

enter image description here

在日志中,我看到类似波纹管的内容,但是重复使用不同的服务ID

level=warning msg="Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."
level=error msg="Error getting service u68b1fofzb3nefpnasctpywav: service u68b1fofzb3nefpnasctpywav not found"
level=warning msg="rmServiceBinding 021eda460c5744fd4d499475e5aa0f1cfbe5df479e5b21389ed1b501a93b47e1 possible transient state ok:false entries:0 set:false "

然后,出于调试目的,我将节点设置回可用状态 docker node update --availability active ****-v3-6by7ddst

然后,我尝试将一些服务平衡到新可用的节点。这就是结果。

enter image description here

我在日志中遇到相同的错误

level=error msg="Error getting service ****_frontend: service ****_frontend not found"
level=warning msg="rmServiceBinding 6bb220c0a95b30cdb3ff7b577c7e9dec7ad6383b34aff85e1685e94e7486e3ea possible transient state ok:false entries:0 set:false "
msg="Error getting service l29wlucttul75pzqo2sgr0u9e: service l29wlucttul75pzqo2sgr0u9e not found"

在我的docker-compose文件中,我正在像这样配置所有服务。重启策略是任意的。

  frontend:
    image: {FRONTEND_IMAGE}
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.docker.lbswarm=true"
        - "traefik.http.routers.frontend.rule=Host(`${FRONTEND_HOST}`)"
        - "traefik.http.routers.frontend.entrypoints=websecure"
        - "traefik.http.routers.frontend.tls.certresolver=myhttpchallenge"
        - "traefik.http.services.frontend.loadbalancer.server.port=80"
        - "traefik.docker.network=ingress"
      replicas: 1
      resources:
        limits:
          memory: ${FRONTEND_LIMITS_MEMORY}
          cpus: ${FRONTEND_LIMITS_cpuS}
        reservations:
          memory: ${FRONTEND_RESERVATION_MEMORY}
          cpus: ${FRONTEND_RESERVATION_cpuS}
      restart_policy:
        condition: any
    networks:
      - ingress

在不同节点上重新创建服务时,某些操作失败,即使只有一个管理者/工人节点,我也得到相同的结果。

其余的似乎工作正常。举例来说,如果我扩展服务,它将很好地工作。

enter image description here

新编辑

再做一次测试。

  • 这次我只有traefik和前端两个服务。
  • traefik的一个实例
  • 4个前端实例
  • 两个节点(一个经理和一个工人)
  • 耗尽的工作程序节点和在耗尽的节点上运行的前端实例已移至管理器节点
  • 激活工作节点
  • 在管理器节点上杀死了docker service update cords_frontend --force和两个前端实例,并使其在工作节点上运行。

因此,对于只有两个服务的测试,一切正常。

对服务和堆栈的数量是否有任何限制?

任何线索为什么会这样?

谢谢

雨果

解决方法

我相信您可能会遇到资源预留问题。您提到可用节点有很多资源,但是保留的工作方式是,如果服务不能保留指定的资源,则不会调度服务,非常重要的一点是要注意,这与服务有多少资源无关实际使用。这意味着,如果您指定保留,则基本上是在说服务将保留该数量的资源,而这些资源不可用于其他服务。因此,如果您所有的服务都具有相似的保留,那么您可能会遇到这样的情况,即使该节点显示了可用资源,这些资源实际上仍由现有服务保留。因此,我建议您删除保留部分并尝试查看是否确实在发生这种情况。

,

所以我还在为此苦苦挣扎。

回顾一下,我在docker stack上有13个以swarm模式运行的服务,其中有两个节点(manager + worker)。每个节点具有4个核心和8GB RAM(Ubuntu 18.04,docker 19.03.12)。 如果某个节点死亡,或者我耗尽了一个节点,则在该节点上运行的所有服务都会死亡,并标记为Removed。如果我只运行docker service update front_end --force,该服务也会消失,并标记为已删除。

另一个重要的细节是,如果我总结了13个服务中所有的保留内存和内核,最终得到1.9个内核和4GB RAM,那么每个节点的资源就会更少。

我在容器,服务或堆栈日志上看不到任何内存不足。另外,通过使用htop工具,我可以看到内存使用情况在Manager节点上使用647MB / 7.79GB,在worker节点上使用2GB / 7.79GB。

这是我到目前为止尝试过的:

  • 将13个服务分为两个不同的堆栈。没有运气。
  • 从撰写文件中删除了所有保留标签。没有运气。
  • 尝试在3个节点上运行。没有运气。
  • 我看到此警告WARNING: No swap limit support,所以我在两个节点enter link description here上遵循了本文档中的建议。没有运气。
  • 将两个节点资源都升级到8核和16BG RAM。没有运气。
  • 尝试一次启动每个服务,我注意到10个或更多服务开始表现不佳。就是说,如果我有多达9个服务在运行,一切都会正常,然后我看到上面描述的行为。

此外,我启用了docker的调试模式以查看发生了什么。这是输出。

如果我运行docker service update front_end --force并且front_end服务死亡,则这是输出形式docker events

service update k6a7go4uhexb4b1u1fp98dtke (name=frontend)
service update k6a7go4uhexb4b1u1fp98dtke (name=frontend,updatestate.new=updating)
service remove k6a7go4uhexb4b1u1fp98dtke (name=frontend)

journalctl -fu docker.service上的日志

level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="form data: {\"EndpointSpec\":{\"Mode\":\"vip\"},\"Labels\":{\"com.docker.stack.image\":\"registry.gitlab.com/devteam/.frontend:1.0.5\",\"com.docker.stack.namespace\":\"\",\"traefik.docker.lbswarm\":\"true\",\"traefik.docker.network\":\"net\",\"traefik.enable\":\"true\",\"traefik.http.routers.frontend.entrypoints\":\"websecure\",\"traefik.http.routers.frontend.rule\":\"Host(`www.frontend.website`)\",\"traefik.http.routers.frontend.tls.certresolver\":\"myhttpchallenge\",\"traefik.http.services.frontend.loadbalancer.server.port\":\"80\"},\"Mode\":{\"Replicated\":{\"Replicas\":1}},\"Name\":\"frontend\",\"TaskTemplate\":{\"ContainerSpec\":{\"Image\":\"registry.gitlab.com/fdevteam/frontend:1.0.5@sha256:e9a0d88bc14848c3b40c3d2905842313bbc648c1bbf09305f8935f9eb23f289a\",\"Isolation\":\"default\",\"Labels\":{\"com.docker.stack.namespace\":\"f\"},\"Privileges\":{\"CredentialSpec\":null,\"SELinuxContext\":null}},\"ForceUpdate\":1,\"Networks\":[{\"Aliases\":[\"frontend\"],\"Target\":\"w7aqg3stebnmk5c5pbhgslh2d\"}],\"Placement\":{\"Platforms\":[{\"Architecture\":\"amd64\",\"OS\":\"linux\"}]},\"Resources\":{},\"RestartPolicy\":{\"Condition\":\"any\",\"MaxAttempts\":0},\"Runtime\":\"container\"}}"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/tasks?filters=%7B%22_up-to-date%22%3A%7B%22true%22%3Atrue%7D%2C%22service%22%3A%7B%22frontend%22%3Atrue%7D%7D"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="handleEpTableEvent UPD 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 R:{frontend.1.lv7tjjaev45pvn0f7qtppb21r frontend nnlg81dsspnj6oxip4iqwwjc3 10.0.1.73 10.0.1.74 [] [frontend] [e661c9f39097] true}"
level=debug msg="rmServiceBinding from handleEpTableEvent START for frontend 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 p:0xc004a1f880 nid:w7aqg3stebnmk5c5pbhgslh2d sKey:{nnlg81dsspnj6oxip4iqwwjc3 } deleteSvc:true"
level=debug msg="deleteEndpointNameResolution 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 frontend rm_service:false suppress:false sAliases:[frontend] tAliases:[e661c9f39097]"
level=debug msg="delContainerNameResolution 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 frontend.1.lv7tjjaev45pvn0f7qtppb21r"
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(frontend.1.lv7tjjaev45pvn0f7qtppb21r,10.0.1.74,<nil>,true) rmServiceBinding sid:6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 "
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(tasks.frontend,false) rmServiceBinding sid:nnlg81dsspnj6oxip4iqwwjc3 "
level=debug msg="rmServiceBinding from handleEpTableEvent END for frontend 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=error msg="Error getting service frontend: service frontend not found"
level=debug msg="handleEpTableEvent DEL 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 R:{frontend.1.lv7tjjaev45pvn0f7qtppb21r frontend nnlg81dsspnj6oxip4iqwwjc3 10.0.1.73 10.0.1.74 [] [frontend] [e661c9f39097] true}"
level=debug msg="rmServiceBinding from handleEpTableEvent START for frontend 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 p:0xc004a1f880 nid:w7aqg3stebnmk5c5pbhgslh2d sKey:{nnlg81dsspnj6oxip4iqwwjc3 } deleteSvc:true"
level=debug msg="deleteEndpointNameResolution 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 frontend rm_service:true suppress:false sAliases:[frontend] tAliases:[e661c9f39097]"
level=debug msg="delContainerNameResolution 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 frontend.1.lv7tjjaev45pvn0f7qtppb21r"
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(frontend.1.lv7tjjaev45pvn0f7qtppb21r,false) rmServiceBinding sid:nnlg81dsspnj6oxip4iqwwjc3 "
level=debug msg="6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745 (w7aqg3s).deleteSvcRecords(frontend,10.0.1.73,false) rmServiceBinding sid:nnlg81dsspnj6oxip4iqwwjc3 "
level=debug msg="rmServiceBinding from handleEpTableEvent END for frontend 6b20c2924ec1eafa20c27d572019207551819b10a2c4f8d0574f2e142274c745"

如果服务没有消失(9个或更少的服务就是这种情况),则输出为:

service update n1wh16ru879699cpv3topcanc (name=frontend)
service update n1wh16ru879699cpv3topcanc (name=frontend,updatestate.new=updating)
service update n1wh16ru879699cpv3topcanc (name=frontend,updatestate.new=completed,updatestate.old=updating)

journalctl -fu docker.service上的日志

level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="form data: {\"EndpointSpec\":{\"Mode\":\"vip\"},\"TaskTemplate\":{\"ContainerSpec\":{\"Image\":\"registry.gitlab.com/devteam/.frontend:1.0.5@sha256:e9a0d88bc14848c3b40c3d2905842313bbc648c1bbf09305f8935f9eb23f289a\",\"Labels\":{\"com.docker.stack.namespace\":\"\"},\"ForceUpdate\":3,\"Runtime\":\"container\"}}"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/tasks?filters=%7B%22_up-to-date%22%3A%7B%22true%22%3Atrue%7D%2C%22service%22%3A%7B%22frontend%22%3Atrue%7D%7D"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="handleEpTableEvent UPD e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 R:{frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo frontend n1wh16ru879699cpv3topcanc 10.0.1.32 10.0.1.46 [] [frontend] [f986fe859440] true}"
level=debug msg="rmServiceBinding from handleEpTableEvent START for frontend e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 p:0xc005e1fa00 nid:w7aqg3stebnmk5c5pbhgslh2d sKey:{n1wh16ru879699cpv3topcanc } deleteSvc:true"
level=debug msg="deleteEndpointNameResolution e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 frontend rm_service:false suppress:false sAliases:[frontend] tAliases:[f986fe859440]"
level=debug msg="delContainerNameResolution e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo"
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo,10.0.1.46,true) rmServiceBinding sid:e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 "
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(tasks.frontend,false) rmServiceBinding sid:n1wh16ru879699cpv3topcanc "
level=debug msg="rmServiceBinding from handleEpTableEvent END for frontend e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/tasks?filters=%7B%22_up-to-date%22%3A%7B%22true%22%3Atrue%7D%2C%22service%22%3A%7B%22frontend%22%3Atrue%7D%7D"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="handleEpTableEvent DEL e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 R:{frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo frontend n1wh16ru879699cpv3topcanc 10.0.1.32 10.0.1.46 [] [frontend] [f986fe859440] true}"
level=debug msg="rmServiceBinding from handleEpTableEvent START for frontend e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 p:0xc005e1fa00 nid:w7aqg3stebnmk5c5pbhgslh2d sKey:{n1wh16ru879699cpv3topcanc } deleteSvc:true"
level=debug msg="deleteEndpointNameResolution e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 frontend rm_service:true suppress:false sAliases:[frontend] tAliases:[f986fe859440]"
level=debug msg="delContainerNameResolution e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo"
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(frontend.1.zeq4jz8kzle4c7vtzx5ofbrqo,false) rmServiceBinding sid:n1wh16ru879699cpv3topcanc "
level=debug msg="e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98 (w7aqg3s).deleteSvcRecords(frontend,10.0.1.32,false) rmServiceBinding sid:n1wh16ru879699cpv3topcanc "
level=debug msg="rmServiceBinding from handleEpTableEvent END for frontend e21b861c447ffd78bd2014744c13a146accd4600412c12b8cccfe3f3af4f0b98"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/tasks?filters=%7B%22_up-to-date%22%3A%7B%22true%22%3Atrue%7D%2C%22service%22%3A%7B%22frontend%22%3Atrue%7D%7D"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
level=debug msg="handleEpTableEvent ADD 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 R:{frontend.1.1v9ggahd87x2ydlkna0qx7jmz frontend n1wh16ru879699cpv3topcanc 10.0.1.32 10.0.1.47 [] [frontend] [3671840709bb] false}"
level=debug msg="addServiceBinding from handleEpTableEvent START for frontend 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 p:0xc004a1ed80 nid:w7aqg3stebnmk5c5pbhgslh2d skey:{n1wh16ru879699cpv3topcanc }"
level=debug msg="addEndpointNameResolution 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 frontend add_service:true sAliases:[frontend] tAliases:[3671840709bb]"
level=debug msg="addContainerNameResolution 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 frontend.1.1v9ggahd87x2ydlkna0qx7jmz"
level=debug msg="521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 (w7aqg3s).addSvcRecords(frontend.1.1v9ggahd87x2ydlkna0qx7jmz,10.0.1.47,true) addServiceBinding sid:521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0"
level=debug msg="521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 (w7aqg3s).addSvcRecords(tasks.frontend,false) addServiceBinding sid:n1wh16ru879699cpv3topcanc"
level=debug msg="521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0 (w7aqg3s).addSvcRecords(frontend,false) addServiceBinding sid:n1wh16ru879699cpv3topcanc"
level=debug msg="addServiceBinding from handleEpTableEvent END for frontend 521ffeee31efe056900fb5a1fe73007c179594e964f703625cf3272eb14983c0"
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService
...
level=debug msg="Calling GET /v1.40/services/frontend?insertDefaults=false"
level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = service frontend not found" rpc=/docker.swarmkit.v1.Control/GetService