印花布`套接字错误:绑定:Gitlab kube runner 作业后地址不可用`

问题描述

我在 AWS 中有一个使用 Kops 部署的 Kubernetes 集群,没有使用 EKS(目前太贵了)。 CNI 是 Calico我有 4 个节点:

  • 1 个主 t3a.medium
  • 2 个小工人 t2.micro
  • 1 个更大的工人 t3a.medium

它们被贴上了标签

我有一个与此 Helm chart 一起安装的 Gitlab kubernetes 运行器,它被配置为在大型工作器上运行。 它还被配置为在大型工作器上生成跑步者。

它运行良好,但几天后我注意到有时(如果不是每次)管道的第一个作业会导致在大节点上运行的 pod calico-node 出现错误错误是:

bird: Mesh_10_1_1_136: Socket error: bind: Address not available
bird: Mesh_10_1_1_212: Socket error: bind: Address not available
bird: Mesh_10_1_1_14: Socket error: bind: Address not available

10.1.1.136 212 和 14 是主节点和 2 个较小节点的 ip。 大节点IP永远不会出现。

所以我的问题是:

  • 发生了什么?
  • 我该怎么做才能防止这个 calico pod 出错?

在此先非常感谢您。 干杯。

[编辑] 我在 calico-node 日志中发现了这些行:

2021-05-24 22:43:26.881 [INFO][49] monitor-addresses/startup.go 576: Node IPv4 changed,will check for conflicts
2021-05-24 22:43:26.907 [WARNING][49] monitor-addresses/startup.go 1107: IPv4 address has changed. This Could happen if there are multiple nodes with the same name. node="ip-10-1-1-198.eu-west-1.compute.internal" original="10.1.1.198" updated="192.168.0.1"
2021-05-24 22:43:26.936 [INFO][45] confd/client.go 877: Recompute BGP peerings: HostBGPConfig(node=ip-10-1-1-198.eu-west-1.compute.internal; name=ip_addr_v4) updated; HostBGPConfig(node=ip-10-1-1-198.eu-west-1.compute.internal; name=network_v4) updated
2021-05-24 22:43:26.937 [INFO][53] Felix/int_dataplane.go 1325: Received *proto.HostMetadataUpdate update from calculation graph msg=hostname:"ip-10-1-1-198.eu-west-1.compute.internal" ipv4_addr:"192.168.0.1" 
2021-05-24 22:43:26.937 [INFO][53] Felix/int_dataplane.go 1453: Applying dataplane updates
2021-05-24 22:43:26.937 [INFO][53] Felix/ipip_mgr.go 222: All-hosts IP set out-of sync,refreshing it.
2021-05-24 22:43:26.937 [INFO][53] Felix/ipsets.go 119: queueing IP set for creation family="inet" setID="all-hosts-net" setType="hash:net"
2021-05-24 22:43:26.941 [INFO][49] monitor-addresses/startup.go 308: Updated node IP addresses
2021-05-24 22:43:26.951 [INFO][53] Felix/ipsets.go 749: Doing full IP set rewrite family="inet" numMembersInPendingReplace=4 setID="all-hosts-net"
bird: Mesh_10_1_1_212: Received: Peer de-configured
bird: Mesh_10_1_1_212: State changed to stop
bird: Mesh_10_1_1_212: State changed to down
bird: Mesh_10_1_1_212: Starting
bird: Mesh_10_1_1_212: State changed to start
2021-05-24 22:43:26.971 [INFO][53] Felix/int_dataplane.go 1467: Finished applying updates to dataplane. msecToApply=33.685021
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Reconfigured
2021-05-24 22:43:26.984 [INFO][45] confd/resource.go 277: Target config /etc/calico/confd/config/bird6.cfg has been updated due to change in key: /calico/bgp/v1/host
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Restarting protocol Mesh_10_1_1_136
bird: Mesh_10_1_1_136: Shutting down
bird: Mesh_10_1_1_136: State changed to stop
bird: Restarting protocol Mesh_10_1_1_14
bird: Mesh_10_1_1_14: Shutting down
bird: Mesh_10_1_1_14: State changed to stop
bird: Restarting protocol Mesh_10_1_1_212
bird: Mesh_10_1_1_212: Shutting down
bird: Mesh_10_1_1_212: State changed to stop
bird: Mesh_10_1_1_212: State changed to down
bird: Mesh_10_1_1_212: Initializing
bird: Mesh_10_1_1_212: Starting
bird: Mesh_10_1_1_212: State changed to start
bird: Mesh_10_1_1_136: State changed to down
bird: Mesh_10_1_1_136: Initializing
bird: Mesh_10_1_1_136: Starting
bird: Mesh_10_1_1_136: State changed to start
bird: Mesh_10_1_1_14: State changed to down
bird: Mesh_10_1_1_14: Initializing
bird: Mesh_10_1_1_14: Starting
bird: Mesh_10_1_1_14: State changed to start
bird: Reconfigured

[编辑 2] 引发此错误的 gitlab 步骤如下:

integrationtesting:
  tags:
    - kubernetes
  image: docker/compose:alpine-1.29.2
  stage: tests
  before_script:
    - echo "NPM_TOKEN=$NPM_TOKEN" > test_integ/dependencies/.env
    - docker-compose -f test_integ/dependencies/docker-compose.yaml up --build -d
  script:
    - docker-compose -f test_integ/dependencies/tester-compose.yaml up --build --abort-on-container-exit --exit-code-from tester
  after_script:
    - docker-compose -f test_integ/dependencies/docker-compose.yaml -f test_integ/dependencies/tester-compose.yaml down

与: docker-compose.yaml

version: "3.9"
networks:
  testinteg:
    name: testinteg
services:
  mongosrv:
    container_name: "mongosrv"
    image: mongo
    networks:
      - testinteg
  users:
    container_name: "users"
    build:
      context: "../.."
      dockerfile: "Dockerfile"
      target: run
      args:
        NPM_TOKEN: "${NPM_TOKEN}"
      network: host
    environment:
      NODE_ENV: "dev"
      PORT: 80
      LOG_LEVEL: "debug"
      LOG_FORMAT: "splat,simple"
      PASSWORD_JWT_SECRET: "anothersecurestring"
      PASSWORD_JWT_TTL: "30s"
      SSL_ENABLED: "false"
      MOCK_DB: "false"
      MONGO_DB: "users"
      MONGO_HOST: "mongosrv"
    depends_on:
      - mongosrv
    networks:
      - testinteg

test-compose.yaml

version: "3.9"
networks:
  testinteg:
    name: testinteg
services:
  tester:
    container_name: "tester"
    build:
      context: "../.."
      dockerfile: "Dockerfile"
      target: testinteg
      args:
        NPM_TOKEN: "${NPM_TOKEN}"
      network: host
    environment:
      MSHOST: users
      MSPORT: 80
    volumes:
      - ../tests:/app/test_integ/tests
    networks:
      - testinteg

最终信息:Dockerfile 正在针对 Jfrog Artifactory 的私有 npm 注册表运行 npm ci。 如果 network: host 部分中没有 build 选项,则无法解析域(Docker 中的 Docker 问题)。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)