问题描述
我正在 Kubernetes (GKE) 中运行 WebService 后端应用程序。它仅由我们的前端 Web 应用程序使用。通常有来自同一用户 (ClientIP) 的数十个请求的序列。 我的应用设置为至少运行 2 个实例(“minReplicas: 2”)。
问题:
从日志中,我可以看到一个 pod 过载(接收许多请求)而另一个空闲的情况。两个 Pod 都处于 Ready
状态。
我尝试修复它: 我尝试添加一个自定义的就绪运行状况检查,当打开的连接过多时返回“不健康”状态。 但即使在健康检查返回“Unhealthy”之后,负载均衡器也会在第二个(健康的)Pod 空闲时向同一个 Pod 发送更多请求。
这是 service.yaml 的摘录:
kind: Service
Metadata:
annotations:
networking.gke.io/load-balancer-type: "Internal"
spec:
type: LoadBalancer
ports:
- protocol: TCP
port: 80
targetPort: 8080
sessionAffinity
未指定所以我希望它是“无”
我的问题: 我究竟做错了什么? Readiness 健康检查对负载均衡器有任何影响吗? 如何控制请求分发?
其他信息:
集群创建:
gcloud container --project %PROJECT% clusters create %CLUSTER%
--zone "us-east1-b" --release-channel "stable" --machine-type "n1-standard-2"
--disk-type "pd-ssd" --disk-size "20" --Metadata disable-legacy-endpoints=true
--scopes "storage-rw" --num-nodes "1" --enable-stackdriver-kubernetes
--enable-ip-alias --network "xxx" --subnetwork "xxx"
--cluster-secondary-range-name "xxx" --services-secondary-range-name "xxx"
--no-enable-master-authorized-networks
节点池:
gcloud container node-pools create XXX --project %PROJECT% --zone="us-east1-b"
--cluster=%CLUSTER% --machine-type=c2-standard-4 --max-pods-per-node=16
--num-nodes=1 --disk-type="pd-ssd" --disk-size="10" --scopes="storage-full"
--enable-autoscaling --min-nodes=1 --max-nodes=30
服务:
apiVersion: v1
kind: Service
Metadata:
name: XXX
annotations:
networking.gke.io/load-balancer-type: "Internal"
labels:
app: XXX
version: v0.1
spec:
selector:
app: XXX
version: v0.1
type: LoadBalancer
ports:
- protocol: TCP
port: 80
targetPort: 8080
HPA:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
Metadata:
name: XXX
spec:
scaleTargetRef:
apiVersion: "apps/v1"
kind: Deployment
name: XXX
minReplicas: 2
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: utilization
averageutilization: 40
- type: Resource
resource:
name: memory
target:
type: utilization
averageutilization: 70
部署:
apiVersion: apps/v1
kind: Deployment
Metadata:
name: XXX
labels:
app: XXX
version: v0.1
spec:
replicas: 1
selector:
matchLabels:
app: XXX
version: v0.1
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
Metadata:
labels:
app: XXX
version: v0.1
spec:
containers:
- image: XXX
name: XXX
imagePullPolicy: Always
resources:
requests:
memory: "10Gi"
cpu: "3200m"
limits:
memory: "10Gi"
cpu: "3600m"
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 3
periodSeconds: 8
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
nodeselector:
cloud.google.com/gke-nodepool: XXX
解决方法
发布此社区 wiki 答案以扩展我对复制步骤所做的评论。
我已经复制了您的设置,但无法复制您遇到的问题。请求被平均分配。至于我使用普通 nginx
的图像,所有测试都显示使用率/平衡约为 50%(来自容器的日志,它们的 CPU 使用率)。您能否检查一下您的设置中的 nginx
图像是否发生了同样的情况?
我遵循的复制步骤:
- 运行以下脚本,该脚本将创建网络、子网、集群并添加节点池:
project_id="INSERT_PROJECT_ID_HERE"
zone="us-east1-b"
region="us-east1"
gcloud compute networks create vpc-network --project=$project_id --subnet-mode=auto --mtu=1460 --bgp-routing-mode=regional
gcloud compute firewall-rules create vpc-network-allow-icmp --project=$project_id --network=projects/$project_id/global/networks/vpc-network --description=Allows\ ICMP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network. --direction=INGRESS --priority=65534 --source-ranges=0.0.0.0/0 --action=ALLOW --rules=icmp
gcloud compute firewall-rules create vpc-network-allow-internal --project=$project_id --network=projects/$project_id/global/networks/vpc-network --description=Allows\ connections\ from\ any\ source\ in\ the\ network\ IP\ range\ to\ any\ instance\ on\ the\ network\ using\ all\ protocols. --direction=INGRESS --priority=65534 --source-ranges=10.128.0.0/9 --action=ALLOW --rules=all
gcloud compute firewall-rules create vpc-network-allow-rdp --project=$project_id --network=projects/$project_id/global/networks/vpc-network --description=Allows\ RDP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ port\ 3389. --direction=INGRESS --priority=65534 --source-ranges=0.0.0.0/0 --action=ALLOW --rules=tcp:3389
gcloud compute firewall-rules create vpc-network-allow-ssh --project=$project_id --network=projects/$project_id/global/networks/vpc-network --description=Allows\ TCP\ connections\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ port\ 22. --direction=INGRESS --priority=65534 --source-ranges=0.0.0.0/0 --action=ALLOW --rules=tcp:22
gcloud compute networks subnets update vpc-network --region=$region --add-secondary-ranges=service-range=10.1.0.0/16,pods-range=10.2.0.0/16
gcloud container --project $project_id clusters create cluster --zone $zone --release-channel "stable" --machine-type "n1-standard-2" --disk-type "pd-ssd" --disk-size "20" --metadata disable-legacy-endpoints=true --scopes "storage-rw" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "vpc-network" --subnetwork "vpc-network" --cluster-secondary-range-name "pods-range" --services-secondary-range-name "service-range" --no-enable-master-authorized-networks
gcloud container node-pools create second-pool --project $project_id --zone=$zone --cluster=cluster --machine-type=n1-standard-4 --max-pods-per-node=16 --num-nodes=1 --disk-type="pd-ssd" --disk-size="10" --scopes="storage-full" --enable-autoscaling --min-nodes=1 --max-nodes=5
gcloud container clusters get-credentials cluster --zone=$zone --project=$project_id
# n1-standard-4 used rather than c2-standard-4
- 使用以下清单在集群上安排工作负载:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
imagePullPolicy: Always
resources:
requests:
memory: "10Gi"
cpu: "3200m"
limits:
memory: "10Gi"
cpu: "3200m"
nodeSelector:
cloud.google.com/gke-nodepool: second-pool
---
apiVersion: v1
kind: Service
metadata:
name: nginx
annotations:
networking.gke.io/load-balancer-type: "Internal"
labels:
app: nginx
spec:
selector:
app: nginx
type: LoadBalancer
ports:
- protocol: TCP
port: 80
targetPort: 80
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-cluster-default-pool-XYZ Ready <none> 3h25m v1.18.17-gke.1901
gke-cluster-second-pool-one Ready <none> 83m v1.18.17-gke.1901
gke-cluster-second-pool-two Ready <none> 83m v1.18.17-gke.1901
gke-cluster-second-pool-three Ready <none> 167m v1.18.17-gke.1901
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-7db7cf7c77-4ttqb 1/1 Running 0 85m 10.2.1.6 gke-cluster-second-pool-three <none> <none>
nginx-7db7cf7c77-dtwc8 1/1 Running 0 85m 10.2.1.34 gke-cluster-second-pool-two <none> <none>
nginx-7db7cf7c77-r6wv2 1/1 Running 0 85m 10.2.1.66 gke-cluster-second-pool-one <none> <none>
测试是在可以访问内部负载均衡器的同一区域中使用 VM
完成的。
使用的工具/命令:
$ ab -n 100000 http://INTERNAL_LB_IP_ADDRESS/
日志相应地显示了每个 pod 的请求:
姓名 | 请求数量 |
---|---|
nginx-7db7cf7c77-4ttqb | ~33454 |
nginx-7db7cf7c77-dtwc8 | ~33208 |
nginx-7db7cf7c77-r6wv2 | ~33338 |
使用内部负载平衡器,流量应在后端之间平均分配(默认情况下,它使用 CONNECTION
平衡模式)。
流量分布不均的原因可能有很多。
- 应用的
replica
未处于Ready
状态。 -
Node
处于unhealthy
状态。 - 应用程序正在保持连接。
检查相同的情况是否发生在不同的场景(不同的集群、不同的图像等)可能很有用。
检查 Service
中有关 Pods
和 Cloud Console
的详细信息也是一个好主意:
-
Cloud Console
(Web UI) ->Kubernetes Engine
->Services & Ingress
->SERVICE_NAME
->Serving pods
其他资源: