乱序示例 prometheus 和 cadvisor

问题描述

这是我的 kube 集群中 prometheus 的 configmap 配置。

scrape_configs:
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
    - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecure_skip_verify: true
    authorization:
      credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
    - source_labels: [__Meta_kubernetes_namespace,__Meta_kubernetes_service_name,__Meta_kubernetes_endpoint_port_name]
      action: keep
      regex: default;kubernetes;https
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'kubernetes-nodes'
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecure_skip_verify: true
    authorization:
      credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    kubernetes_sd_configs:
    - role: node
    relabel_configs:
    - action: labelmap
      regex: __Meta_kubernetes_node_label_(.+)
  - job_name: kube-state-metrics
    honor_timestamps: true
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    follow_redirects: true
    static_configs:
    - targets:
      - kube-state-metrics.kube-system.svc.cluster.local:8080
  - job_name: kubernetes-cadvisor
    honor_timestamps: true
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /metrics/cadvisor
    scheme: https
    authorization:
      type: Bearer
      credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecure_skip_verify: true
    follow_redirects: true
    relabel_configs:
    - separator: ;
      regex: __Meta_kubernetes_node_label_(.+)
      replacement: $1
      action: labelmap
    kubernetes_sd_configs:
    - role: node
      follow_redirects: true
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__Meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__Meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__,__Meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__
    - action: labelmap
      regex: __Meta_kubernetes_pod_label_(.+)
    - source_labels: [__Meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__Meta_kubernetes_pod_name]
      action: replace
      target_label: kubernetes_pod_name
@H_404_4@

在集群顶部,我有一个 Prometheus 联邦,它联合集群内的 prometheus。

一切正常,但在集群内 prometheus 中,我有这个日志(调试级别打开)

提取

level=debug ts=2021-06-27T11:09:32.130Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_fs_io_current{container=\"\",device=\"/dev/shm\",id=\"/\",image=\"\",name=\"\",namespace=\"\",pod=\"\"}"
level=debug ts=2021-06-27T11:09:32.130Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_fs_io_current{container=\"\",device=\"/run\",device=\"/run/containerd/io.containerd.grpc.v1.cri/sandBoxes/36edd81cdc0bf2f5213054cf0ee4b6bc86328ec4473b879e3049ee0113a32728/shm\",device=\"/run/containerd/io.containerd.grpc.v1.cri/sandBoxes/5918a77ba85e2430cc0f434cde296c80f3f21f25739f73a9a7cf4296c0b2ad4d/shm\",device=\"/run/containerd/io.containerd.grpc.v1.cri/sandBoxes/5fa3d0a4389e13f8a96a8ff74e22172dd6ffb5c92e5e692a17e3a346660b49c5/shm\",pod=\"\"}"
level=debug ts=2021-06-27T11:09:32.131Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_fs_io_current{container=\"\",device=\"/run/containerd/io.containerd.grpc.v1.cri/sandBoxes/bc60ea15258c46d3c4cca2e9b28ed608ca89b26ce5b14f2bdb6313d87d762e3b/shm\",device=\"/run/containerd/io.containerd.grpc.v1.cri/sandBoxes/bcb7f220797f48216cb0017b0cce1398ef0a9d377f66fa1f8a742241f9133567/shm\",device=\"/run/containerd/io.containerd.grpc.v1.cri/sandBoxes/cc4d8fe8e9051bfddc228dda7d272be7514c9cb93f4d1b2d98c9d632c63dfc8a/shm\",device=\"/run/containerd/io.containerd.grpc.v1.cri/sandBoxes/d9033c81c8cb4f0419b4f5ac7f1c14e0d1bb706820f46dc5a98b9db7944b2b08/shm\",device=\"/run/lock\",device=\"/run/user/0\",device=\"/run/user/1001\",device=\"/sys/fs/cgroup\",device=\"/var/lib/kubelet/pods/01a67077-3639-41e4-9708-0a3bf1fe5acf/volumes/kubernetes.io~secret/flannel-token-ts79t\",pod=\"\"}"
level=debug ts=2021-06-27T11:09:32.133Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_fs_io_current{container=\"\",device=\"overlay_0-48\",device=\"overlay_0-53\",device=\"overlay_0-56\",device=\"overlay_0-62\",device=\"overlay_0-68\",device=\"overlay_0-79\",device=\"overlay_0-93\",pod=\"\"}"
level=debug ts=2021-06-27T11:09:32.134Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_fs_io_time_seconds_total{container=\"\",device=\"/dev/mapper/debian--vg-root\",device=\"/dev/sda1\",pod=\"\"}"
level=debug ts=2021-06-27T11:09:32.135Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_fs_io_time_seconds_total{container=\"\",pod=\"\"}"
level=debug ts=2021-06-27T11:09:32.136Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_fs_io_time_seconds_total{container=\"\",device=\"/var/lib/kubelet/pods/0c3c9be9-cc89-4e6e-93f4-e87c9356ad42/volumes/kubernetes.io~secret/kube-proxy-token-59vcr\",device=\"/var/lib/kubelet/pods/c05ffb01-231a-43d4-9941-e959ba521f52/volumes/kubernetes.io~secret/x509-certificate-exporter-node-token-m6w4w\",device=\"/var/lib/kubelet/pods/d57f59ba-6ef6-4cd5-84cf-c1e3a2f79433/volumes/kubernetes.io~secret/default-token-zwtjf\",device=\"overlay_0-115\",device=\"overlay_0-121\",device=\"overlay_0-145\",device=\"overlay_0-151\",device=\"overlay_0-157\",pod=\"\"}"
level=debug ts=2021-06-27T11:09:32.137Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_fs_io_time_seconds_total{container=\"\",device=\"overlay_0-164\",device=\"overlay_0-165\",device=\"overlay_0-188\",device=\"overlay_0-44\",pod=\"\"}"
level=warn ts=2021-06-27T11:09:32.149Z caller=scrape.go:1467 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Error on ingesting out-of-order samples" num_dropped=303
level=debug ts=2021-06-27T11:09:47.098Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_cpu_load_average_10s{container=\"\",id=\"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8e4fae43df4163b63617776dc1321fe0.slice\",namespace=\"kube-system\",pod=\"kube-controller-manager-master2\"}"
level=debug ts=2021-06-27T11:09:47.098Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_cpu_load_average_10s{container=\"\",id=\"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod97a02ba4a6b5572917c3b834d347981b.slice\",pod=\"etcd-master2\"}"
level=debug ts=2021-06-27T11:09:47.099Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_cpu_system_seconds_total{container=\"\",pod=\"kube-controller-manager-master2\"}"
level=debug ts=2021-06-27T11:09:47.099Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_cpu_system_seconds_total{container=\"\",pod=\"etcd-master2\"}"
level=debug ts=2021-06-27T11:09:47.099Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_cpu_user_seconds_total{container=\"\",pod=\"kube-controller-manager-master2\"}"
level=debug ts=2021-06-27T11:09:47.100Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_cpu_user_seconds_total{container=\"\",pod=\"etcd-master2\"}"
level=debug ts=2021-06-27T11:09:47.100Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_file_descriptors{container=\"\",pod=\"kube-controller-manager-master2\"}"
level=debug ts=2021-06-27T11:09:47.100Z caller=scrape.go:1511 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://10.10.10.61:10250/metrics/cadvisor msg="Out of order sample" series="container_file_descriptors{container=\"\",pod=\"etcd-master2\"}"
@H_404_4@

我怀疑 Cadvisor 并复制了指标,但我没有看到它们。

Kubernetes:1.20 普罗米修斯:2.27.1

解决方法

您可以尝试为这些指标添加一些任意标签,以区分它们所属的节点:

- job_name: kubernetes-cadvisor
  [...]
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - action: replace
    source_labels: [__meta_kubernetes_node_name]
    target_label: node_name
  [...]

我怀疑它会解决所有问题,但这应该有助于解决与设备相关的问题,例如 /run/lock、/run/user/0、/dev/shm、/dev/sda1、/dev/mapper/debian .+、overlay_0_[0-9]+、...因为我们很可能会在您的所有节点上找到这些。

完成后,让我们知道哪些仍在显示。