Basic cpu / Mem / disk Info Basic cpu / Mem / disk Gauge Basic cpu / Mem Graph Basic Net / disk Info cpu Memory Net disk Memory Detail Meminfo /proc/meminfo Memory Detail Vmstat Memory Detail Vmstat Counters System Detail disk Datail /proc/diskstats FileSystem Detail /proc/filesystems Network Traffic Detail /proc/net/dev Network Sockstat proc/net/sockstat Network Netstat TCP /proc/net/snmp Network Netstat TCP Linux MIPs Network Netstat UDP /proc/net/snmp Network Netstat ICMP /proc/net/snmp Node Exporter
Basic cpu / Mem / disk Info 1. cpu Cores 物理 cpu 的核数 cat /proc/cpuinfo| grep "cpu cores"| uniq type:Singlestat Unit: short metrics: count(count(node_cpu_seconds_total{instance=~\"$node:$port\",job=~\"$job\"}) by (cpu)) 2. Total RAM 内存大小 cat /proc/meminfo | grep MemTotal type:Singlestat Unit: bytes metrics: node_memory_MemTotal_bytes{instance=~\"$node:$port\",job=~\"$job\"} 3. Total SWAP 交换分区的大小 cat /proc/swaps type:Singlestat Unit: bytes metrics: node_memory_SwapTotal_bytes{instance=~\"$node:$port\",job=~\"$job\"} 4. Total RootFS 根文件系统总空间 type:Singlestat Unit: bytes metrics: node_filesystem_size_bytes{instance=~\"$node:$port\",job=~\"$job\",mountpoint=\"/\",fstype!=\"rootfs\"} 5. System Load (1m avg) 系统一分钟内的负载 cat /proc/loadavg 第一列,单核 cpu 的load小于1 表示没有等待的任务, 等于1 表示系统 已经没有额外的资源跑更多进程了,大于1表示进程拥堵在等待资源 type:Singlestat Unit: short metrics: node_load1{instance=~\"$node:$port\",job=~\"$job\"} 6. Uptime 系统正常运行的时间 type:Singlestat Unit: seconds (s) metrics: node_time_seconds{instance=~\"$node:$port\",job=~\"$job\"} - node_boot_time_seconds{instance=~\"$node:$port\",job=~\"$job\"} node_time_seconds 当前系统时间 node_boot_time_seconds 系统启动时间
Basic cpu / Mem / disk Gauge 1. cpu Busy :收集所有 cpu 内核 busy 状态占比 type: Singlestat Unit: perent(0-100) (所有 cpu使用情况 - 5分钟内 cpu 空闲的平均值) / 所有 cpu使用情况 metrics: (((count(count(node_cpu_seconds_total{instance=~\"$node:$port\",job=~\"$job\"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',instance=~\"$node:$port\",job=~\"$job\"}[5m])))) * 100) / count(count(node_cpu_seconds_total{instance=~\"$node:$port\",job=~\"$job\"}) by (cpu)) 最大值: 100% 2. Used RAM Memory free -m type: Singlestat Unit: perent(0-100) 已使用的内存占比(包括Buffer缓存和Cached缓存) metrics: ((node_memory_MemTotal_bytes{instance=~\"$node:$port\",job=~\"$job\"} - node_memory_MemFree_bytes{instance=~\"$node:$port\",job=~\"$job\"}) / (node_memory_MemTotal_bytes{instance=~\"$node:$port\",job=~\"$job\"} )) * 100 node_memory_MemFree_bytes 空闲内存 已使用的内存占比(不包括Buffer缓存和Cached缓存) metrics: 100 - ((node_memory_MemAvailable_bytes{instance=~"$node:$port",job=~"$job"} * 100) / node_memory_MemTotal_bytes{instance=~"$node:$port",job=~"$job"}) MemAvailable: Free + Buffers + Cached - 不可回收的部分。不可回收部分包括:共享内存段,tmpfs,ramfs等 3. Used SWAP: 交换分区使用率 type: Singlestat Unit: perent(0-100) metrics: ((node_memory_SwapTotal_bytes{instance=~\"$node:$port\",job=~\"$job\"} - node_memory_SwapFree_bytes{instance=~\"$node:$port\",job=~\"$job\"}) / (node_memory_SwapTotal_bytes{instance=~\"$node:$port\",job=~\"$job\"} )) * 100 node_memory_SwapFree_bytes 交换分区的空闲大小 4. Used Root FS 根文件系统使用率 type: Singlestat Unit: perent(0-100) metrics: 100 - ((node_filesystem_avail_bytes{instance=~"$node:$port",job=~"$job",mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{instance=~"$node:$port",job=~"$job",mountpoint="/",fstype!="rootfs"}) node_filesystem_avail_bytes 文件系统可用空间 5. cpu System Load (1m avg) 一分钟内 cpu 所有内核的平均负载率 type: Singlestat Unit: perent(0-100) metrics: avg(node_load1{instance=~"$node:$port",job=~"$job"}) / count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu)) * 100 node_load1 : 系统一分钟内的负载 6. cpu System Load (5m avg) 五分钟内 cpu 所有内核的平均负载率 type: Singlestat Unit: perent(0-100) metrics: avg(node_load5{instance=~"$node:$port",job=~"$job"}) / count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu)) * 100 node_load5 : 指5分钟内cpu的负载
Basic cpu / Mem Graph 1. cpu Basic cpu 的基本信息 /proc/stat type: Graph Unit: short Busy System: cpu 处于核心态的占比 metrics: sum by (instance)(rate(node_cpu_seconds_total{mode="system",instance=~"$node:$port",job=~"$job"}[5m])) * 100 Busy User: cpu 处于用户态的占比 metrics: sum by (instance)(rate(node_cpu_seconds_total{mode='user',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Busy Iowait: cpu 处于 io 等待的时间占比 metrics: sum by (instance)(rate(node_cpu_seconds_total{mode='iowait',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Busy IRQs: cpu 处于中断状态占比 metrics: sum by (instance)(rate(node_cpu_seconds_total{mode=~".*irq",instance=~"$node:$port",job=~"$job"}[5m])) * 100 Idle: cpu 处于空闲状态占比 metrics: sum by (mode)(rate(node_cpu_seconds_total{mode='idle',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Busy Other: cpu 处于其他状态占比(非系统状态、非用户态、非io等待状态、非空闲态、非中断状态) metrics: sum (rate(node_cpu_seconds_total{mode!='idle',mode!='user',mode!='system',mode!='iowait',mode!='irq',mode!='softirq',instance=~"$node:$port",job=~"$job"}[5m])) * 100 2. Memory Basic 内存基本信息 type: Graph Unit: short RAM Total: 内存大小 metrics: node_memory_MemTotal_bytes{instance=~"$node:$port",job=~"$job"} RAM Used: 已使用的内存大小(内存总量-空闲的内存大小-Buffer缓存和Cached缓存占的内存大小) metrics: node_memory_MemTotal_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_MemFree_bytes{instance=~"$node:$port",job=~"$job"} - (node_memory_Cached_bytes{instance=~"$node:$port",job=~"$job"} + node_memory_Buffers_bytes{instance=~"$node:$port",job=~"$job"}) RAM Cache + Buffer: Cached缓存占的内存大小 metrics: node_memory_Cached_bytes{instance=~"$node:$port",job=~"$job"} + node_memory_Buffers_bytes{instance=~"$node:$port",job=~"$job"} RAM Free: 空闲的内存大小 metrics: node_memory_MemFree_bytes{instance=~"$node:$port",job=~"$job"} SWAP Used: 已使用的交换内存的大小 metrics: (node_memory_SwapTotal_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_SwapFree_bytes{instance=~"$node:$port",job=~"$job"}) 交换分区的总大小- 空闲大小
Basic Net / disk Info 1. Network Traffic Basic 每个接口的基本网络信息 type: Graph Unit: bytes recv {{device}} 各个网络接口的下载量 recv lo: 本地环回接口 recv eth0: 以太网接口 recv docker0: docker0 网络接口 metrics: rate(node_network_receive_bytes_total{instance=~"$node:$port",job=~"$job"}[5m]) trans {{device}} 各个网络接口的上传量 metrics: rate(node_network_transmit_bytes_total{instance=~"$node:$port",job=~"$job"}[5m]) 2. disk Space Used Basic 所有挂载的文件系统的磁盘空间占比 type: Graph unit: perent(0-100) metrics: 100 - ((node_filesystem_avail_bytes{instance=~"$node:$port",job=~"$job",device!~'rootfs'} * 100) / node_filesystem_size_bytes{instance=~"$node:$port",job=~"$job",device!~'rootfs'})
cpu Memory Net disk 1. cpu type: Graph Unit: short max: "100" min: "0" Label: Percentage System - cpu 在内核模式下执行的进程占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode="system",instance=~"$node:$port",job=~"$job"}[5m])) * 100 User - cpu 在用户模式下执行的正常进程占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode='user',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Nice - cpu 在用户模式下执行的 nice 进程占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode='nice',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Idle - cpu 在空闲模式下的占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode='idle',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Iowait - cpu 在 io 等待的占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode='iowait',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Irq - cpu 在服务中断的占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode='irq',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Softirq - cpu 在服务软中断的占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode='softirq',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Steal - 在 VM 中运行时其他 VM 占用的本 VM 的 cpu 的占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode='steal',instance=~"$node:$port",job=~"$job"}[5m])) * 100 Guest - 运行各种 VM 使用的 cpu 占比 metrics: sum by (mode)(irate(node_cpu_seconds_total{mode='guest',instance=~"$node:$port",job=~"$job"}[5m])) * 100 2. Memory Stack 内存堆栈 /proc/meminfo type: Graph Unit: bytes min: "0" Label: Bytes Apps - 用户空间应用程序使用的内存 metrics: node_memory_MemTotal_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_MemFree_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_Buffers_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_Cached_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_Slab_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_PageTables_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_SwapCached_bytes{instance=~"$node:$port",job=~"$job"} PageTables - 用于在虚拟和物理内存地址之间映射的内存 metrics: node_memory_PageTables_bytes{instance=~"$node:$port",job=~"$job"} SwapCache - 用于跟踪已从交换区中提取出来但尚未修改的页面的内存 metrics: node_memory_SwapCached_bytes{instance=~"$node:$port",job=~"$job"} Slab - 内核用于缓存数据结构以供自己使用的内存(如 inode,dentry 等缓存) metrics: node_memory_Slab_bytes{instance=~"$node:$port",job=~"$job"} Cache - 频繁访问的文件数据或内容的缓存 metrics: node_memory_Cached_bytes{instance=~"$node:$port",job=~"$job"} Buffers - 块设备(例如硬盘)缓存 metrics: node_memory_Buffers_bytes{instance=~"$node:$port",job=~"$job"} Unused - 未使用的内存大小 metrics: node_memory_MemFree_bytes{instance=~"$node:$port",job=~"$job"} Swap - 交换分区使用的空间 metrics: (node_memory_SwapTotal_bytes{instance=~"$node:$port",job=~"$job"} - node_memory_SwapFree_bytes{instance=~"$node:$port",job=~"$job"}) Harware Corrupted - 内核识别为已损坏或不工作的内存量 metrics: node_memory_HardwareCorrupted_bytes{instance=~"$node:$port",job=~"$job"} 3. Network Traffic 各个网络接口的传输速率 type: Graph Unit: bytes/sec Label: Bytes out(-)/in(+) {{device}} - Receive 各个网络接口下载速率 metrics: irate(node_network_receive_bytes_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Transmit 各个网络接口上传速率 metrics: irate(node_network_transmit_bytes_total{instance=~"$node:$port",job=~"$job"}[5m]) 4. disk Space Used 所有挂载的文件系统的磁盘空间大小 type: Graph Unit: bytes min: "0" Label: Bytes metrics: node_filesystem_size_bytes{instance=~"$node:$port",job=~"$job",device!~'rootfs'} - node_filesystem_avail_bytes{instance=~"$node:$port",job=~"$job",device!~'rootfs'} 5. disk IOps 磁盘读写 type: Graph Unit: I/O ops/sec (iops) Label: IO read(-)/write(+) {{device}} - Reads completed 磁盘的读取速率(五分钟内) metrics: irate(node_disk_reads_completed_total{instance=~"$node:$port",job=~"$job",device=~"[a-z]*[a-z]"}[5m]) {{device}} - Writes completed 磁盘的写入速率(五分钟内) metrics: irate(node_disk_writes_completed_total{instance=~"$node:$port",job=~"$job",device=~"[a-z]*[a-z]"}[5m]) 6. I/O Usage Read / Write type: Graph Unit: bytes Label: Bytes read(-)/write(+) 成功读取的字节数(五分钟内) metrics: irate(node_disk_read_bytes_total{instance=~"$node:$port",job=~"$job",device=~"[a-z]*[a-z]"}[5m]) 成功写入的字节数(五分钟内) metrics: irate(node_disk_written_bytes_total{instance=~"$node:$port",job=~"$job",device=~"[a-z]*[a-z]"}[5m]) 7. I/O Usage Times 使用 I/O 的毫秒数 type: Graph Unit: ms Label: Milliseconds metrics: irate(node_disk_io_time_seconds_total{instance=~"$node:$port",job=~"$job",device=~"[a-z]*[a-z]"} [5m])
Memory Detail Meminfo /proc/meminfo 1. Memory Active / Inactive type: Graph Unit: bytes Label: Bytes Inactive - 最近使用较少的内存, 优先被回收利用 /proc/meminfo Inactive metrics: node_memory_Inactive_bytes{instance=~"$node:$port",job=~"$job"} Active - 最近被频繁使用的内存,除非绝对必要,否则通常不会回收 /proc/meminfo Active metrics: node_memory_Active_bytes{instance=~"$node:$port",job=~"$job"} 2. Memory Commited type: Graph Unit: bytes Label: Bytes Committed_AS - 当前系统已经分配的内存量,包括已分配但尚未使用的内存大小 /proc/meminfo Committed_AS metrics: node_memory_Committed_AS_bytes{instance=~"$node:$port",job=~"$job"} CommitLimit - 当前系统可分配的内存量 /proc/meminfo CommitLimit metrics: node_memory_CommitLimit_bytes{instance=~"$node:$port",job=~"$job"} 3. Memory Active / Inactive Detail type: Graph Unit: bytes Label: Bytes Inactive_file - LRU list 上长时间未被访问过的与文件对应的内存页 /proc/meminfo LRU_INACTIVE_FILE metrics: node_memory_Inactive_file_bytes{instance=~"$node:$port",job=~"$job"} Inactive_anon - 上长时间未被访问过的匿名页和交换区缓存(包括 tmpfs) /proc/meminfo LRU_INACTIVE_ANON metrics: node_memory_Inactive_anon_bytes{instance=~"$node:$port",job=~"$job"} Active_file - LRU list 最近被访问过的与文件对应的内存页 /proc/meminfo LRU_ACTIVE_FILE metrics: node_memory_Active_file_bytes{instance=~"$node:$port",job=~"$job"} Active_anon - 最近被访问过的匿名页和交换区缓存(包括 tmpfs) /proc/meminfo LRU_ACTIVE_ANON metrics: node_memory_Active_anon_bytes{instance=~"$node:$port",job=~"$job"} 4. Memory Writeback and Dirty type: Graph Unit: bytes Label: Bytes Writeback - 正准备主动回写硬盘的缓存页 /proc/meminfo Writeback metrics: node_memory_Writeback_bytes{instance=~"$node:$port",job=~"$job"} WritebackTmp - FUSE用于临时写回缓冲区的内存 /proc/meminfo WritebackTmp metrics: node_memory_WritebackTmp_bytes{instance=~"$node:$port",job=~"$job"} Dirty - 需要写回磁盘的数据大小 /proc/meminfo Dirty metrics: node_memory_Dirty_bytes{instance=~"$node:$port",job=~"$job"} 5. Memory Shared and Mapped type: Graph Unit: bytes Label: Bytes Mapped - mapped 缓存页占用的内存 /proc/meminfo Mapped metrics: node_memory_Mapped_bytes{instance=~"$node:$port",job=~"$job"} Shmem - 共享内存 /proc/meminfo Shared metrics: node_memory_Shmem_bytes{instance=~"$node:$port",job=~"$job"} 6. Memory Slab type: Graph Unit: bytes Label: Bytes SUnreclaim - 通过slab分配的内存中不可回收的部分 /proc/meminfo SUnreclaim metrics: node_memory_SUnreclaim_bytes{instance=~"$node:$port",job=~"$job"} SReclaimable - 通过slab分配的内存中可回收的部分 /proc/meminfo SReclaimable metrics: node_memory_SReclaimable_bytes{instance=~"$node:$port",job=~"$job"} 7. Memory Vmalloc type: Graph Unit: bytes Label: Bytes VmallocChunk - vmalloc 可分配的最大的逻辑连续的内存大小 /proc/meminfo VmallocChunk metrics: node_memory_VmallocChunk_bytes{instance=~"$node:$port",job=~"$job"} VmallocTotal - vmalloc 可使用的总内存大小 /proc/meminfo VmallocTotal metrics: node_memory_VmallocTotal_bytes{instance=~"$node:$port",job=~"$job"} VmallocUsed - vmalloc 已用的总内存大小 /proc/meminfo VmallocUsed metrics: node_memory_VmallocUsed_bytes{instance=~"$node:$port",job=~"$job"} 8. Memory Bounce /proc/meminfo Bounce type: Graph Unit: bytes Label: Bytes Bounce - bounce buffers 占用的内存 metrics: node_memory_Bounce_bytes{instance=~"$node:$port",job=~"$job"} 9. Memory Anonymous type: Graph Unit: bytes Label: Bytes AnonHugePages - AnonHugePages 占用的内存大小 /proc/meminfo AnonHugePages metrics: node_memory_AnonHugePages_bytes{instance=~"$node:$port",job=~"$job"} AnonPages - 用户进程中匿名内存页大小 /proc/meminfo AnonPages metrics: node_memory_AnonPages_bytes{instance=~"$node:$port",job=~"$job"} 10. Memory Kernel /proc/meminfo KernelStack type: Graph Unit: bytes Label: Bytes KernelStack - 内核栈大小(常驻内存,不可回收) metrics: node_memory_KernelStack_bytes{instance=~"$node:$port",job=~"$job"} 11. Memory HugePages Counter type: Graph Unit: short Label: Pages HugePages_Free - 系统当前总共拥有的空闲 HugePages 数目 /proc/meminfo HugePages_Free metrics: node_memory_HugePages_Free{instance=~"$node:$port",job=~"$job"} HugePages_Rsvd - 系统当前总共保留的HugePages数目,更具体点就是指程序已经向系统申请,但是由于程序还没有实质的HugePages读写操作,因此系统尚未实际分配给程序的HugePages数目 /proc/meminfo HugePages_Rsvd metrics: node_memory_HugePages_Rsvd{instance=~"$node:$port",job=~"$job"} HugePages_Surp - 指超过系统设定的常驻HugePages数目的数目 /proc/meminfo HugePages_Surp metrics: node_memory_HugePages_Surp{instance=~"$node:$port",job=~"$job"} 12. Memory HugePages Size type: Graph Unit: bytes Label: Bytes HugePages - 系统当前总共拥有的HugePages数目 /proc/meminfo HugePages metrics: node_memory_HugePages_Total{instance=~"$node:$port",job=~"$job"} Hugepagesize - 每一页 HugePages 的大小 /proc/meminfo Hugepagesize metrics: node_memory_Hugepagesize_bytes{instance=~"$node:$port",job=~"$job"} 13. Memory DirectMap /proc/meminfo DirectMap type: Graph Unit: bytes Label: Bytes DirectMap1G - 映射为 1G 的内存页的内存数量 metrics: node_memory_DirectMap1G{instance=~"$node:$port",job=~"$job"} DirectMap2M - 映射为 2M 的内存页的内存数量 metrics: node_memory_DirectMap2M_bytes{instance=~"$node:$port",job=~"$job"} DirectMap4K - 映射为 4kB 的内存页的内存数量 metrics: node_memory_DirectMap4k_bytes{instance=~"$node:$port",job=~"$job"} 14. Memory Unevictable and mlocked type: Graph Unit: bytes Label: Bytes Unevictable - 不可被回收的内存 /proc/meminfo Unevictable metrics: node_memory_Unevictable_bytes{instance=~"$node:$port",job=~"$job"} mlocked - 被mlock()系统调用锁定的内存大小 /proc/meminfo mlocked metrics: node_memory_mlocked_bytes{instance=~"$node:$port",job=~"$job"} 15. Memory NFS /proc/meminfo NFS_Unstable type: Graph Unit: bytes Label: Bytes NFS Unstable - 发给NFS server但尚未写入硬盘的缓存页 metrics: node_memory_NFS_Unstable_bytes{instance=~"$node:$port",job=~"$job"}
Memory Detail Vmstat 1. Memory Pages In / Out type: Graph Unit: short Label: Pages Pagesin - 数据从硬盘读到物理内存的速率(5分钟内) /proc/vmstat pgpgin metrics: irate(node_vmstat_pgpgin{instance=~"$node:$port",job=~"$job"}[5m]) Pagesout - 数据从物理内存写到硬盘的速率(5分钟内) /proc/vmstat pgpgout metrics: irate(node_vmstat_pgpgout{instance=~"$node:$port",job=~"$job"}[5m]) 2. Memory Pages Swap In / Out type: Graph Unit: short Label: Pages Pswpin - 数据从磁盘交换区装入内存的速率(5分钟内) /proc/vmstat pswpin metrics: irate(node_vmstat_pswpin{instance=~"$node:$port",job=~"$job"}[5m]) Pswpout - 数据从内存转储到磁盘交换区的速率(5分钟内) /proc/vmstat pswpout metrics: irate(node_vmstat_pswpout{instance=~"$node:$port",job=~"$job"}[5m]) 3. Memory Page Operations type: Graph Unit: short Label: Pages Pgdeactivate - 激活的平均页数(5分钟内) /proc/vmstat pgdeactivate metrics: irate(node_vmstat_pgdeactivate{instance=~"$node:$port",job=~"$job"}[5m]) Pgfree - 释放的平均页数(5分钟内) /proc/vmstat pgfree metrics: irate(node_vmstat_pgfree{instance=~"$node:$port",job=~"$job"}[5m]) Pgactivate - 未激活的平均页数(5分钟内) /proc/vmstat pgactivate metrics: irate(node_vmstat_pgactivate{instance=~"$node:$port",job=~"$job"}[5m]) 4. Memory Page Faults type: Graph Unit: short Label: Faults Pgfault - 一级页面和二级页面的平均错误数(5分钟内) /proc/vmstat pgfault metrics: irate(node_vmstat_pgfault{instance=~"$node:$port",job=~"$job"}[5m]) pgmajfault - 一级页面的平均错误数(5分钟内) /proc/vmstat pgmajfault metrics: irate(node_vmstat_pgmajfault{instance=~"$node:$port",job=~"$job"}[5m]) pgminfault - 二级页面的平均错误数(5分钟内) metrics: irate(node_vmstat_pgfault{instance=~"$node:$port",job=~"$job"}[5m]) - irate(node_vmstat_pgmajfault{instance=~"$node:$port",job=~"$job"}[5m]) 5. Memory Pages Reclaimed type: Graph Unit: short Label: Pages Kswapd_inodesteal - 由 kswapd 回收用于其它目的的平均页面数(5分钟内) /proc/vmstat kswapd_inodesteal metrics: irate(node_vmstat_kswapd_inodesteal{instance=~"$node:$port",job=~"$job"}[5m]) Pgindesteal - 由 inode 释放回收的平均页面数(5分钟内) /proc/vmstat pgindesteal metrics: irate(node_vmstat_pginodesteal{instance=~"$node:$port",job=~"$job"}[5m]) 6. Memory Calls Reclaimed type: Graph Unit: short Label: Cells Pageoutrun - 由 kswapd调用来回收的平均页面数(5分钟内) /proc/vmstatpageoutrun metrics: irate(node_vmstat_pageoutrun{instance=~"$node:$port",job=~"$job"}[5m]) Allocstall - 请求直接回收的平均页面数(5分钟内) /proc/vmstat allocstall metrics: irate(node_vmstat_allocstall{instance=~"$node:$port",job=~"$job"}[5m]) Zone_reclaim_Failed - 内存域回收失败的平均页面数(5分钟内) /proc/vmstat zone_reclaim_Failed metrics: irate(node_vmstat_zone_reclaim_Failed{instance=~"$node:$port",job=~"$job"}[5m]) 7. Memory Page Rotate /proc/vmstat pgrotated type: Graph Unit: short Label: Pages Pgrotated - 轮换的平均页面数(5分钟内) metrics: irate(node_vmstat_pgrotated{instance=~"$node:$port",job=~"$job"}[5m]) 8. Memory Page Drop type: Graph Unit: short Label: Cells Drop_pagecache - 调用释放缓存的平均页面数(5分钟内) /proc/vmstat drop_pagecache metrics: node_vmstat_drop_pagecache{instance=~"$node:$port",job=~"$job"} Drop_slab - 调用释放 slab 缓存的平均页面数(5分钟内) /proc/vmstat drop_slab metrics: node_vmstat_drop_slab{instance=~"$node:$port",job=~"$job"} 9. Memory Scan Slab /proc/vmstat slabs_scanned type: Graph Unit: short Slabs_scanned - 被扫描的 Slab 页面的平均页面数(5分钟内) metrics: irate(node_vmstat_slabs_scanned{instance=~"$node:$port",job=~"$job"}[5m]) 10. Memory Unevictable Pages type: Graph Unit: short Label: Pages Unevictable_pgs_cleared - Unevictable pages cleared metrics: irate(node_vmstat_unevictable_pgs_cleared{instance=~"$node:$port",job=~"$job"}[5m]) Unevictable_pgs_culled - Unevictable pages culled metrics: irate(node_vmstat_unevictable_pgs_culled{instance=~"$node:$port",job=~"$job"}[5m]) Unevictable_pgs_mlocked - Unevictable pages mlocked metrics: irate(node_vmstat_unevictable_pgs_mlocked{instance=~"$node:$port",job=~"$job"}[5m]) Unevictable_pgs_munlocked - Unevictable pages munlocked metrics: irate(node_vmstat_unevictable_pgs_munlocked{instance=~"$node:$port",job=~"$job"}[5m]) Unevictable_pgs_rescued- Unevictable pages rescued metrics: irate(node_vmstat_unevictable_pgs_rescued{instance=~"$node:$port",job=~"$job"}[5m]) Unevictable_pgs_scanned - Unevictable pages scanned metrics: irate(node_vmstat_unevictable_pgs_scanned{instance=~"$node:$port",job=~"$job"}[5m]) unevictable_pgs_stranded - Unevictable pages stranded metrics: irate(node_vmstat_unevictable_pgs_stranded{instance=~"$node:$port",job=~"$job"}[5m]) 11. Memory Page Allocation type: Graph Unit: short Label: Pages Pgalloc_dma - DMA 存储区分配的平均页数(5分钟内) /proc/vmstat pgalloc_dma metrics: irate(node_vmstat_pgalloc_dma{instance=~"$node:$port",job=~"$job"}[5m]) Pgalloc_dma32 - DMA32 存储区分配的平均页数(5分钟内) /proc/vmstat pgalloc_dma32 metrics: irate(node_vmstat_pgalloc_dma32{instance=~"$node:$port",job=~"$job"}[5m]) Pgalloc_movable - movable 存储区分配的平均页数(5分钟内) /proc/vmstat pgalloc_movable metrics: irate(node_vmstat_pgalloc_movable{instance=~"$node:$port",job=~"$job"}[5m]) Pgalloc_normal - 普通存储区分配的平均页数(5分钟内) /proc/vmstat pgalloc_normal metrics: irate(node_vmstat_pgalloc_normal{instance=~"$node:$port",job=~"$job"}[5m]) 12. Memory Page Refill type: Graph Unit: short Label: Pages Pgrefill_dma - DMA 再填充的平均页数(5分钟内) /proc/vmstat pgrefill_dma metrics: irate(node_vmstat_pgrefill_dma{instance=~"$node:$port",job=~"$job"}[5m]) Pgrefill_dma32 - DMA32 存储区再填充的平均页数(5分钟内) /proc/vmstat pgrefill_dma32 metrics: irate(node_vmstat_pgrefill_dma32{instance=~"$node:$port",job=~"$job"}[5m]) Pgrefill_movable - movable 存储区再填充的平均页数(5分钟内) /proc/vmstat pgrefill_movable metrics: irate(node_vmstat_pgrefill_movable{instance=~"$node:$port",job=~"$job"}[5m]) Pgrefill_normal - 普通存储区再填充的平均页数(5分钟内) /proc/vmstat pgrefill_normal metrics: irate(node_vmstat_pgrefill_normal{instance=~"$node:$port",job=~"$job"}[5m]) 13. Memory Page Steal Direct type: Graph Unit: short Label: Pages Pgsteal_direct_dma - DMA 存储区被直接回收用于其它目的的平均页面数(5分钟内) /proc/vmstat pgsteal_direct_dma metrics: irate(node_vmstat_pgsteal_direct_dma{instance=~"$node:$port",job=~"$job"}[5m]) Pgsteal_direct_dma32 - DMA32 存储区被直接回收用于其它目的的平均页面数(5分钟内) /proc/vmstat pgsteal_direct_dma32 metrics: irate(node_vmstat_pgsteal_direct_dma32{instance=~"$node:$port",job=~"$job"}[5m]) Pgsteal_direct_movable - movable 存储区被直接回收用于其它目的的平均页面数(5分钟内) /proc/vmstat pgsteal_direct_movable metrics: irate(node_vmstat_pgsteal_direct_movable{instance=~"$node:$port",job=~"$job"}[5m]) Pgsteal_direct_normal - 普通存储区被直接回收用于其它目的的平均页面数(5分钟内) /proc/vmstat pgsteal_direct_normal metrics: irate(node_vmstat_pgsteal_direct_normal{instance=~"$node:$port",job=~"$job"}[5m]) 14. Memory Page Steal Kswapd type: Graph Unit: short Label: Pages Pgsteal_kswapd_dma - kswapd 后台进程回收 DMA 存储区用于其它目的的平均页面数(5分钟内) /proc/vmstat pgsteal_kswapd_dma metrics: irate(node_vmstat_pgsteal_kswapd_dma{instance=~"$node:$port",job=~"$job"}[5m]) Pgsteal_kswapd_dma32 - kswapd 后台进程回收 DMA32 存储区用于其它目的的平均页面数(5分钟内) /proc/vmstat pgsteal_kswapd_dma32 metrics: irate(node_vmstat_pgsteal_kswapd_dma32{instance=~"$node:$port",job=~"$job"}[5m]) Pgsteal_kswapd_movable - kswapd 后台进程回收 movable 存储区用于其它目的的平均页面数(5分钟内 /proc/vmstat pgsteal_kswapd_movable metrics: irate(node_vmstat_pgsteal_kswapd_movable{instance=~"$node:$port",job=~"$job"}[5m]) Pgsteal_kswapd_normal - swapd后台进程回收普通存储区用于其它目的的平均页面数(5分钟内 /proc/vmstat pgsteal_kswapd_normal metrics: irate(node_vmstat_pgsteal_kswapd_normal{instance=~"$node:$port",job=~"$job"}[5m]) 15. Memory Scan Direct type: Graph Unit: short Label: Pages Pgscan_direct_dma - DMA 存储区被直接回收的平均页面数(5分钟内) /proc/vmstat pgscan_direct_dma metrics: irate(node_vmstat_pgscan_direct_dma{instance=~"$node:$port",job=~"$job"}[5m]) Pgscan_direct_dma32 - DMA32 存储区被直接回收的平均页面数(5分钟内) /proc/vmstat pgscan_direct_dma32 metrics: irate(node_vmstat_pgscan_direct_dma32{instance=~"$node:$port",job=~"$job"}[5m]) Pgscan_direct_movable - movable 存储区被直接回收的平均页面数(5分钟内) /proc/vmstat pgscan_direct_movable metrics: irate(node_vmstat_pgscan_direct_movable{instance=~"$node:$port",job=~"$job"}[5m]) Pgscan_direct_normal - 普通存储区被直接回收的平均页面数(5分钟内) /proc/vmstat pgscan_direct_normal metrics: irate(node_vmstat_pgscan_direct_normal{instance=~"$node:$port",job=~"$job"}[5m]) Pgscan_direct_throttle - throttle 存储区被直接回收的平均页面数(5分钟内) /proc/vmstat pgscan_direct_throttle metrics: irate(node_vmstat_pgscan_direct_throttle{instance=~"$node:$port",job=~"$job"}[5m]) 16. Memory Scan Kswapd type: Graph Unit: short Label: Pages Pgscan_kswapd_dma - kswapd 后台进程扫描的 DMA 存储区平均页面数(5分钟内) /proc/vmstat pgscan_kswapd_dma metrics: irate(node_vmstat_pgscan_kswapd_dma{instance=~"$node:$port",job=~"$job"}[5m]) Pgscan_kswapd_dma32 - kswapd 后台进程扫描的 DMA32 存储区平均页面数(5分钟内) /proc/vmstat pgscan_kswapd_dma32 metrics: irate(node_vmstat_pgscan_kswapd_dma32{instance=~"$node:$port",job=~"$job"}[5m]) Pgscan_kswapd_movable - kswapd 后台进程扫描的 movable 存储区平均页面数(5分钟内) /proc/vmstat pgscan_kswapd_movable metrics: irate(node_vmstat_pgscan_kswapd_movable{instance=~"$node:$port",job=~"$job"}[5m]) Pgscan_kswapd_normal - kswapd 后台进程扫描的普通存储区平均页面数(5分钟内) /proc/vmstat pgscan_kswapd_normal metrics: irate(node_vmstat_pgscan_kswapd_normal{instance=~"$node:$port",job=~"$job"}[5m]) 17. Memory Page Compact type: Graph Unit: short Label: Pages Compact_free_scanned - 扫描由压缩守护程序释放的页面 /proc/vmstat compact_free_scanned metrics: irate(node_vmstat_compact_free_scanned{instance=~"$node:$port",job=~"$job"}[5m]) Compact_isolated - 用于内存压缩隔离的页面 /proc/vmstat compact_isolated metrics: irate(node_vmstat_compact_isolated{instance=~"$node:$port",job=~"$job"}[5m]) Compact_migrate_scanned - 通过内存压缩守护程序扫描以进行迁移的页面 /proc/vmstat compact_migrate_scanned metrics: irate(node_vmstat_compact_migrate_scanned{instance=~"$node:$port",job=~"$job"}[5m]) 18. Memory Compactions 内存紧缩 type: Graph Unit: short Label: Compactions Compact_fail - 高阶分配的内存碎片整理失败的页面数(5分钟内) /proc/vmstat compact_fail metrics: irate(node_vmstat_compact_fail{instance=~"$node:$port",job=~"$job"}[5m]) Compact_stall - 开始执行内存碎片失败的页面数(5分钟内) /proc/vmstat compact_stall metrics: irate(node_vmstat_compact_stall{instance=~"$node:$port",job=~"$job"}[5m]) Compact_sucess - 高阶分配的内存碎片整理成功的页面数(5分钟内) metrics: irate(node_vmstat_compact_success{instance=~"$node:$port",job=~"$job"}[5m]) 19. Memory Kswapd Watermark type: Graph Unit: short Label: Counter Kswapd_high_wmark_hit_quickly - 剩余内存达到 high 的水位线的时间 /proc/vmstat kswapd_high_wmark_hit_quickly metrics: node_vmstat_kswapd_high_wmark_hit_quickly{instance=~"$node:$port",job=~"$job"} Kswapd_low_wmark_hit_quickly - - 剩余内存达到 low 的水位线的时间 /proc/vmstat kswapd_low_wmark_hit_quickly metrics: node_vmstat_kswapd_low_wmark_hit_quickly{instance=~"$node:$port",job=~"$job"} 20. Memory Buddy Alloc type: Graph Unit: short Label: Allocations Htlb_buddy_alloc_fail - buddy 给 hugetlb 分配失败的次数 /proc/vmstat htlb_buddy_alloc_fail metrics: node_vmstat_htlb_buddy_alloc_fail{instance=~"$node:$port",job=~"$job"} Htlb_buddy_alloc_success - buddy 给 hugetlb 分配成功的次数 /proc/vmstat htlb_buddy_alloc_success metrics: node_vmstat_htlb_buddy_alloc_success{instance=~"$node:$port",job=~"$job"} 21. Memory Numa Allocations type: Graph Unit: short Label: Allocations Numa_foreign - 计划使用其他节点内存但是却使用本地内存次数 /proc/vmstat numa_foreign metrics: irate(node_vmstat_numa_foreign{instance=~"$node:$port",job=~"$job"}[5m]) Numa_hit - 使用本节点内存次数 /proc/vmstat numa_hit metrics: irate(node_vmstat_numa_hit{instance=~"$node:$port",job=~"$job"}[5m]) Numa_interleave - 交叉分配使用的内存中使用本节点的内存次数 /proc/vmstat numa_interleave metrics: irate(node_vmstat_numa_interleave{instance=~"$node:$port",job=~"$job"}[5m]) Numa_local - 在本节点运行的程序使用本节点内存次数 /proc/vmstat numa_local metrics: irate(node_vmstat_numa_local{instance=~"$node:$port",job=~"$job"}[5m]) Numa_miss - 计划使用本节点内存而被调度到其他节点次数 /proc/vmstat numa_miss metrics: irate(node_vmstat_numa_miss{instance=~"$node:$port",job=~"$job"}[5m]) Numa_other - 在其他节点运行的程序使用本节点内存次数 /proc/vmstat numa_other metrics: irate(node_vmstat_numa_other{instance=~"$node:$port",job=~"$job"}[5m]) 22. Memory Numa Page Migrations type: Graph Unit: short Label: Pages Numa_pages_migrated - NUMA page 数 /proc/vmstat numa_pages_migrated metrics: irate(node_vmstat_numa_pages_migrated{instance=~"$node:$port",job=~"$job"}[5m]) pgmigrate_fail - 迁移失败的页面数 /proc/vmstat pgmigrate_fail metrics: irate(node_vmstat_pgmigrate_fail{instance=~"$node:$port",job=~"$job"}[5m]) pgmigrate_success - 成功迁移的页面数 /proc/vmstat pgmigrate_success metrics: irate(node_vmstat_pgmigrate_success{instance=~"$node:$port",job=~"$job"}[5m]) 23. Memory Numa Hints type: Graph Unit: short Label: Hints Numa_hint_faults - NUMA hint faults trapped metrics: irate(node_vmstat_numa_hint_faults{instance=~"$node:$port",job=~"$job"}[5m]) Numa_hint_faults_local - Hinting faults to local nodes metrics: irate(node_vmstat_numa_hint_faults_local{instance=~"$node:$port",job=~"$job"}[5m]) 24. Memory Numa Table Updates type: Graph Unit: short Label: Updates Numa_pte_updates - NUMA page table entry updates metrics: irate(node_vmstat_numa_pte_updates{instance=~"$node:$port",job=~"$job"}[5m]) Numa_huge_pte_updates - NUMA huge page table entry updates metrics: irate(node_vmstat_numa_huge_pte_updates{instance=~"$node:$port",job=~"$job"}[5m]) 25. Memory THP Splits type: Graph Unit: short Label: Splits Thp_split - 大型页面分割成多个常规页面 /proc/vmstat thp_split metrics: irate(node_vmstat_thp_split{instance=~"$node:$port",job=~"$job"}[5m]) 26. Memory Workingset type: Graph Unit: short Label: Counter Workingset_activate - Page activations to form the working set metrics: irate(node_vmstat_workingset_activate{instance=~"$node:$port",job=~"$job"}[5m]) Workingset_nodereclaim - NUMA node working set page reclaims metrics: irate(node_vmstat_workingset_nodereclaim{instance=~"$node:$port",job=~"$job"}[5m]) Workingset_refault - Refaults of prevIoUsly evicted pages metrics: irate(node_vmstat_workingset_refault{instance=~"$node:$port",job=~"$job"}[5m]) 27. Memory THP Allocations type: Graph Unit: short Label: Allocations Thp_collapse_alloc - Transparent huge page collapse allocations metrics: irate(node_vmstat_thp_collapse_alloc{instance=~"$node:$port",job=~"$job"}[5m]) Thp_collapse_alloc_Failed - Transparent huge page collapse allocation failures metrics: irate(node_vmstat_thp_collapse_alloc_Failed{instance=~"$node:$port",job=~"$job"}[5m]) Thp_zero_page_alloc - Transparent huge page zeroed page allocations metrics: irate(node_vmstat_thp_zero_page_alloc{instance=~"$node:$port",job=~"$job"}[5m]) Thp_zero_page_alloc_Failed - Transparent huge page zeroed page allocation failures metrics: irate(node_vmstat_thp_zero_page_alloc_Failed{instance=~"$node:$port",job=~"$job"}[5m]) Thp_fault_alloc - Transparent huge page fault allocations metrics: irate(node_vmstat_thp_fault_alloc{instance=~"$node:$port",job=~"$job"}[5m]) Thp_fault_fallback - Transparent huge page fault fallbacks metrics: irate(node_vmstat_thp_fault_fallback{instance=~"$node:$port",job=~"$job"}[5m])
Memory Detail Vmstat Counters 1. Memory Page Active type: Graph Unit: short Label: Pages Active_anon - pages最近被使用过的匿名虚拟内存页 /proc/vmstat nr_active_anon metrics: node_vmstat_nr_active_anon{instance=~"$node:$port",job=~"$job"} Active_file - 最近被使用过的文件虚拟内存页 /proc/vmstat nr_active_file metrics: node_vmstat_nr_active_file{instance=~"$node:$port",job=~"$job"} 2. Memory Page Reclaimed / Unreclaimed type: Graph Unit: short Label: Pages Reclaimable - 可回收的 slab 虚拟内存页 /proc/vmstat nr_slab_reclaimable metrics: node_vmstat_nr_slab_reclaimable{instance=~"$node:$port",job=~"$job"} Unreclaimable - 不可回收的 slab 虚拟内存页 /proc/vmstat nr_slab_unreclaimable metrics: node_vmstat_nr_slab_unreclaimable{instance=~"$node:$port",job=~"$job"} 3. Memory Page Inactive type: Graph Unit: short Label: Pages Inactive_anon - 每个 NUMA node 的每个域中的长时间未被访问过的匿名内存页 /proc/vmstat nr_inactive_anon metrics: node_vmstat_nr_inactive_anon{instance=~"$node:$port",job=~"$job"} Inactive_file - 每个 NUMA node 的每个域中的长时间未被访问过的与文件对应的内存页 /proc/vmstat nr_inactive_file metrics: node_vmstat_nr_inactive_file{instance=~"$node:$port",job=~"$job"} 4. Memory Page Dirty / Bounce type: Graph Unit: short Label: Pages Dirty - 脏页数 /proc/vmstat nr_dirty metrics: node_vmstat_nr_dirty{instance=~"$node:$port",job=~"$job"} Bounce - Bounce buffer 页面数 /proc/vmstat nr_bounce metrics: node_vmstat_nr_bounce{instance=~"$node:$port",job=~"$job"} 5. Memory Page Free / Written type: Graph Unit: short Label: Pages Free_pages - 空闲页数 /proc/vmstat nr_free_pages metrics: node_vmstat_nr_free_pages{instance=~"$node:$port",job=~"$job"} Written - 每个 NUMA node 的每个域中写出的页面 /proc/vmstat nr_written metrics: node_vmstat_nr_written{instance=~"$node:$port",job=~"$job"} 6.Memory Page Shmem / Mapped type: Graph Unit: short Label: Pages Shmem - 共享内存页数 /proc/vmstat nr_shmem metrics: node_vmstat_nr_shmem{instance=~"$node:$port",job=~"$job"} Mapped - 每个 NUMA node 的每个域 mapped 缓存页的页数 /proc/vmstat nr_mapped metrics: node_vmstat_nr_mapped{instance=~"$node:$port",job=~"$job"} 7.Memory Page Unevictable / mlock type: Graph Unit: short Label: Pages Unevictable - 不可回收的页数 /proc/vmstat nr_unevictable metrics: node_vmstat_nr_unevictable{instance=~"$node:$port",job=~"$job"} mlock - 被 mlock()系统调用锁定的页数 /proc/vmstat nr_mlock metrics: node_vmstat_nr_mlock{instance=~"$node:$port",job=~"$job"} 8.Memory Page Writeback type: Graph Unit: short Label: Pages Writeback - 回写页数 /proc/vmstat nr_writeback metrics: node_vmstat_nr_writeback{instance=~"$node:$port",job=~"$job"} Writeback_temp - 临时回写页数 /proc/vmstat nr_writeback_temp metrics: node_vmstat_nr_writeback_temp{instance=~"$node:$port",job=~"$job"} 9.Memory Page Kernel_stack type: Graph Unit: short Label: Pages Kernel_stack - 内核栈的页数 /proc/vmstat nr_kernel_stack metrics: node_vmstat_nr_kernel_stack{instance=~"$node:$port",job=~"$job"} 10.Memory Page Dirty Threshold type: Graph Unit: short Label: Pages Dirty_background_threshold - 脏页后台回写阈值 /proc/vmstat nr_dirty_background_threshold metrics: node_vmstat_nr_dirty_background_threshold{instance=~"$node:$port",job=~"$job"} Dirty_threshold - 脏页限制阈值 /proc/vmstat nr_dirty_threshold metrics: node_vmstat_nr_dirty_threshold{instance=~"$node:$port",job=~"$job"} 11.Memory Page File_pages type: Graph Unit: short Label: Pages File_pages - 每个 NUMA node 的每个域文件缓存页的页数 /proc/vmstat nr_file_pages metrics: node_vmstat_nr_file_pages{instance=~"$node:$port",job=~"$job"} 12.Memory Page Page_table_pages type: Graph Unit: short Label: Pages Page_table_pages - 每个 NUMA node 的每个域页面表的页数 /proc/vmstat nr_page_table_pages metrics: node_vmstat_nr_page_table_pages{instance=~"$node:$port",job=~"$job"} 13.Memory Page Unstable / Dirtied type: Graph Unit: short Label: Pages Unstable - 每个 NUMA node 的每个域中处于不稳定页面的页数 /proc/vmstat nr_unstable metrics: node_vmstat_nr_unstable{instance=~"$node:$port",job=~"$job"} Dirtied - 每个 NUMA node 的每个域中进入脏页面的页数 /proc/vmstat nr_dirtied metrics: node_vmstat_nr_dirtied{instance=~"$node:$port",job=~"$job"} 14.Memory Page Isolated type: Graph Unit: short Label: Pages Isolated_anon - 每个 NUMA node 的每个域中隔离的匿名内存页面的页数 /proc/vmstat nr_isolated_anon metrics: node_vmstat_nr_isolated_anon{instance=~"$node:$port",job=~"$job"} Isolated_file - 每个 NUMA node 的每个域中隔离的文件存储页面的页数 /proc/vmstat nr_isolated_file metrics: node_vmstat_nr_isolated_file{instance=~"$node:$port",job=~"$job"} 15.Memory Page Alloc_batch type: Graph Unit: short Label: Pages Alloc_batch - 每个 NUMA node 的每个域中由于内存不足分配给其他域的页面 /proc/vmstat nr_alloc_batch metrics: node_vmstat_nr_alloc_batch{instance=~"$node:$port",job=~"$job"} 16.Memory Page Misc type: Graph Unit: short Label: Pages Free_cma - 每个 NUMA node 的每个域中空闲的连续内存分配器页面 /proc/vmstat nr_free_cma metrics: node_vmstat_nr_free_cma{instance=~"$node:$port",job=~"$job"} Vmscan_write - LRU 内存回收写入的页面 /proc/vmstat nr_vmscan_write metrics: node_vmstat_nr_vmscan_write{instance=~"$node:$port",job=~"$job"} Immediate_reclaim - 每个 NUMA node 的每个域中当回写结束时优先回收的页面 /proc/vmstat nr_vmscan_immediate_reclaim metrics: node_vmstat_nr_vmscan_immediate_reclaim{instance=~"$node:$port",job=~"$job"} 17.Memory Page Anon type: Graph Unit: short Label: Pages Anon_pages - 每个 NUMA node 的每个域中匿名 mapped 缓存页 /proc/vmstat nr_anon_pages metrics: node_vmstat_nr_anon_pages{instance=~"$node:$port",job=~"$job"} Anon_transparent_hugepages - 每个 NUMA node 的每个域中 THP(Transparent Huge Pages) /proc/vmstat nr_anon_transparent_hugepages metrics: node_vmstat_nr_anon_transparent_hugepages{instance=~"$node:$port",job=~"$job"}
System Detail 1. Context Switches / Interrupts type: Graph Unit: short Label: Counter Context switches - cpu 的 context switch 平均次数(5分钟内) metrics: irate(node_context_switches_total{instance=~"$node:$port",job=~"$job"}[5m]) Interrupts - 服务的平均中断总数(5分钟内) metrics: irate(node_intr_total{instance=~"$node:$port",job=~"$job"}[5m]) 2. System Load type: Graph Unit: short Label: Load Load 1m - 系统1分钟内的平均负载 metrics: node_load1{instance=~"$node:$port",job=~"$job"} Load 5m - 系统5分钟内的平均负载 metrics: node_load5{instance=~"$node:$port",job=~"$job"} Load 15m - 系统15分钟内的平均负载 metrics: node_load15{instance=~"$node:$port",job=~"$job"} 3. Interrupts Detail /proc/interrupts type: Graph Unit: short Label: Counter {{ type }} - {{ info }} - 当前系统的软中断列表和对应的中断号平均中断次数(5分钟内) metrics: irate(node_interrupts_total{instance=~"$node:$port",job=~"$job"}[5m]) 4. File Descriptors type: Graph Unit: short Label: Descriptors Maximum open file descriptors - 最大打开文件描述符数 metrics: process_max_fds{instance=~"$node:$port",job=~"$job"} Open file descriptors - 打开文件描述符的数量 metrics: process_open_fds{instance=~"$node:$port",job=~"$job"} 5. Entropy type: Graph Unit: short Label: Entropy Entropy available to random number generators metrics: node_entropy_available_bits{instance=~"$node:$port",job=~"$job"} 6. Processes State type: Graph Unit: short Label: Processes Processes blocked - 当前被阻塞的任务的数目 /proc/stat procs_blocked metrics: node_procs_blocked{instance=~"$node:$port",job=~"$job"} Processes in runnable state - 当前运行队列的任务的数目 /proc/stat procs_running metrics: node_procs_running{instance=~"$node:$port",job=~"$job"} 7. Processes Forks type: Graph Unit: short Label: Forks / sec Processes forks second - 每秒创建的进程个数 metrics: rate(node_forks_total{instance=~"$node:$port",job=~"$job"}[5m]) 8. Processes Memory type: Graph Unit: bytes Label: Bytes 进程占用的虚拟内存的大小: metrics: process_virtual_memory_bytes{instance=~"$node:$port",job=~"$job"} 进程常驻内存大小: metrics: process_resident_memory_bytes{instance=~"$node:$port",job=~"$job"} 9. Time Syncronized Status type: Graph Unit: short Label: Counter 时钟是否与一个可靠的服务器同步: metrics: node_timex_sync_status{instance=~"$node:$port",job=~"$job"} 本地时钟调整频率: metrics: node_timex_frequency_adjustment_ratio{instance=~"$node:$port",job=~"$job"} 10. Time Syncronized Drift type: Graph Unit: seconds Label: Seconds 估算误差(秒): metrics: node_timex_estimated_error_seconds{instance=~"$node:$port",job=~"$job"} 本地系统和参考时钟之间的时间偏移: metrics: node_timex_offset_seconds{instance=~"$node:$port",job=~"$job"} 最大误差(秒): metrics: node_timex_maxerror_seconds{instance=~"$node:$port",job=~"$job"} 11. Hardware temperature monitor 硬件的温度监控 type: Graph Unit: Celsius(摄氏度) Label: Temperature {{ chip }} {{ sensor }} temp - metrics: node_hwmon_temp_celsius{instance=~"$node:$port",job=~"$job"} {{ chip }} {{ sensor }} Critical Alarm metrics: node_hwmon_temp_crit_alarm_celsius{instance=~"$node:$port",job=~"$job"} {{ chip }} {{ sensor }} Critical metrics: node_hwmon_temp_crit_celsius{instance=~"$node:$port",job=~"$job"} {{ chip }} {{ sensor }} Critical Historical metrics: node_hwmon_temp_crit_hyst_celsius{instance=~"$node:$port",job=~"$job"} {{ chip }} {{ sensor }} Max metrics: node_hwmon_temp_max_celsius{instance=~"$node:$port",job=~"$job"}
disk Datail /proc/diskstats 1. disk IOps Completed type: Graph Unit: I/O ops/sec(iops) Label: IO read(-)/write(+) {{device}} - Reads completed: 每个磁盘分区每秒读完成次数 metrics: irate(node_disk_reads_completed_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Writes completed: 每个磁盘分区每秒写完成次数 metrics: irate(node_disk_writes_completed_total{instance=~"$node:$port",job=~"$job"}[5m]) 2. disk R/W Data type: Graph Unit: bytes/sec Label: Bytes read(-)/write(+) {{device}} - Read bytes 每个磁盘分区每秒读取的比特数 metrics: irate(node_disk_read_bytes_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Written bytes 每个磁盘分区每秒写入的比特数 metrics: irate(node_disk_written_bytes_total{instance=~"$node:$port",job=~"$job"}[5m]) 3. disk R/W Time type: Graph Unit: Milliseconds(ms) Label: Millisec. read(-)/write(+) {{device}} - Read time ms 每个磁盘分区读花费的毫秒数 metrics: irate(node_disk_read_time_seconds_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Write time ms 每个磁盘分区写操作花费的毫秒数 metrics: irate(node_disk_write_time_seconds_total{instance=~"$node:$port",job=~"$job"}[5m]) 4. disk IOs Weighted type: Graph Unit: Milliseconds(ms) Label: Milliseconds {{device}} - IO time weighted 每个磁盘分区输入/输出操作花费的加权毫秒数 metrics: irate(node_disk_io_time_weighted_seconds_total{instance=~"$node:$port",job=~"$job"}[5m]) 5. disk R/W Merged type: Graph Unit: I/O ops/sec(iops) Label: I/Os {{device}} - Read merged 每个磁盘分区每秒合并读完成次数 metrics: irate(node_disk_reads_merged_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Write merged 每个磁盘分区每秒合并写完成次数 metrics: irate(node_disk_writes_merged_total{instance=~"$node:$port",job=~"$job"}[5m]) 6. Milliseconds Spent Doing I/Os type: Graph Unit: Milliseconds(ms) Label: Milliseconds {{device}} - IO time ms 每个磁盘分区输入/输出操作花费的毫秒数 metrics: irate(node_disk_io_time_seconds_total{instance=~"$node:$port",job=~"$job"}[5m]) 7. disk IOs Current in Progress type: Graph Unit: I/O ops/sec(iops) Label: I/Os {{device}} - IO Now 每个磁盘分区每秒正在处理的输入/输出请求数 metrics: irate(node_disk_io_Now{instance=~"$node:$port",job=~"$job"}[5m]) 8. Open Error File type: Graph Unit: short Label: Errors Textfile scrape error (1 = true) 打开文件错误的个数 metrics: node_textfile_scrape_error{instance=~"$node:$port",job=~"$job"}
FileSystem Detail /proc/filesystems 1. Filesystem space available type: Graph Unit: bytes Label: Bytes {{mountpoint}} - 挂载的文件系统可用空间 metrics: node_filesystem_avail_bytes{instance=~"$node:$port",job=~"$job",device!~'rootfs'} {{mountpoint}} - 挂载的文件系统剩余空间 metrics: node_filesystem_free_bytes{instance=~"$node:$port",job=~"$job",device!~'rootfs'} {{mountpoint}} - 挂载的文件系统占用空间 metrics: node_filesystem_size_bytes{instance=~"$node:$port",job=~"$job",device!~'rootfs'} 2. File Nodes Free type: Graph Unit: short Label: File Nodes {{mountpoint}} - 挂载的文件系统空闲的文件节点个数 metrics: node_filesystem_files_free{instance=~"$node:$port",job=~"$job",device!~'rootfs'} 3. File Descriptor type: Graph Unit: short Label: Files 最大打开文件描述符数: metrics: node_filefd_maximum{instance=~"$node:$port",job=~"$job"} 打开文件描述符数: metrics: node_filefd_allocated{instance=~"$node:$port",job=~"$job"} 4. File Nodes Size type: Graph Unit: short Label: File Nodes {{mountpoint}} - File nodes total:挂载的文件系统的文件节点大小 metrics: node_filesystem_files{instance=~"$node:$port",job=~"$job",device!~'rootfs'} 5. Filesystem in ReadOnly type: Graph Unit: short Label: Read Only {{mountpoint}} - ReadOnly 只读模式挂载的文件系统 metrics: node_filesystem_readonly{instance=~"$node:$port",job=~"$job",device!~'rootfs'}
Network Traffic Detail /proc/net/dev 1. Network Traffic by Packets type: Graph Unit: packets/sec Label: Packets out (-) / in (+) {{device}} - Receive 各个接口每秒接收的数据包总数 metrics: irate(node_network_receive_packets_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Transmit 各个接口每秒发送的数据包总数 metrics: irate(node_network_transmit_packets_total{instance=~"$node:$port",job=~"$job"}[5m]) 2. Network Traffic Errors type: Graph Unit: packets/sec Label: Packets out (-) / in (+) {{device}} - Receive errors 监测到各个接口每秒接收的错误数据包总数 metrics: irate(node_network_receive_errs_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Rransmit errors 监测到各个接口每秒发送的错误数据包总数 metrics: irate(node_network_transmit_errs_total{instance=~"$node:$port",job=~"$job"}[5m]) 3. Network Traffic Drop type: Graph Unit: packets/sec Label: Packets out (-) / in (+) {{device}} - Receive drop 各个接口每秒接收的丢弃的数据包总数 metrics: irate(node_network_receive_drop_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Transmit drop 各个接口每秒发送的丢弃的数据包总数 metrics: irate(node_network_transmit_drop_total{instance=~"$node:$port",job=~"$job"}[5m]) 4. Network Traffic Compressed type: Graph Unit: packets/sec Label: Packets out (-) / in (+) {{device}} - Receive compressed 各个接口每秒接收的压缩数据包总数 metrics: irate(node_network_receive_compressed_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Transmit compressed 各个接口每秒发送的压缩数据包总数 metrics: irate(node_network_transmit_compressed_total{instance=~"$node:$port",job=~"$job"}[5m]) 5. Network Traffic Multicast type: Graph Unit: packets/sec Label: Packets out (-) / in (+) {{device}} - Receive multicast 各个接口每秒接收的多播包数 metrics: irate(node_network_receive_multicast_total{instance=~"$node:$port",job=~"$job"}[5m]) 6. Network Traffic Fifo type: Graph Unit: packets/sec Label: Packets out (-) / in (+) {{device}} - Receive fifo 各个接口每秒接收的 fifo 包总数 metrics: irate(node_network_receive_fifo_total{instance=~"$node:$port",job=~"$job"}[5m]) {{device}} - Transmit fifo 各个接口每秒发送的 fifo 包总数 metrics: irate(node_network_transmit_fifo_total{instance=~"$node:$port",job=~"$job"}[5m]) 7. Network Traffic Frame type: Graph Unit: packets/sec Label: Packets out (-) / in (+) {{device}} - Receive frame 各个接口每秒接收的帧数 metrics: irate(node_network_receive_frame_total{instance=~"$node:$port",job=~"$job"}[5m]) 8. Network Traffic Carrier type: Graph Unit: short Label: Counter {{device}} - Statistic transmit_carrier 由各个接口检测到的载波损耗的数量 metrics: irate(node_network_transmit_carrier_total{instance=~"$node:$port",job=~"$job"}[5m]) 9. Network Traffic Colls type: Graph Unit: short Label: Counter {{device}} - Transmit colls 各个接口上检测到的冲突数 metrics: irate(node_network_transmit_colls_total{instance=~"$node:$port",job=~"$job"}[5m]) 10. NF Contrack type: Graph Unit: short Label: Entries NF conntrack entries 跟踪连接数 metrics: node_nf_conntrack_entries{instance=~"$node:$port",job=~"$job"} NF conntrack limit metrics: node_nf_conntrack_entries_limit{instance=~"$node:$port",job=~"$job"} 11. ARP Entries type: Graph Unit: short Label: Entries {{ device }} - ARP entries 各个接口上 ARP 表中包的统计 metrics: node_arp_entries{instance=~"$node:$port",job=~"$job"}
Network Sockstat proc/net/sockstat 1. Sockstat TCP type: Graph Unit: short Label: Sockets TCP_alloc - 已分配(已建立、已申请到sk_buff)的TCP套接字数量 metrics: node_sockstat_TCP_alloc{instance=~"$node:$port",job=~"$job"} TCP_inuse - 正在使用(正在侦听)的TCP套接字数量 metrics: node_sockstat_TCP_inuse{instance=~"$node:$port",job=~"$job"} TCP_mem - TCP 套接字缓冲区使用量 metrics: node_sockstat_TCP_mem{instance=~"$node:$port",job=~"$job"} TCP_orphan - 无主(不属于任何进程)的TCP连接数(无用、待销毁的TCP socket数) metrics: node_sockstat_TCP_orphan{instance=~"$node:$port",job=~"$job"} TCP_tw - 等待关闭的TCP连接数 metrics: node_sockstat_TCP_tw{instance=~"$node:$port",job=~"$job"} 2. Sockstat UDP type: Graph Unit: short Label: Sockets UdplITE_inuse - 正在使用的 UDP-Lite 套接字数量 metrics: node_sockstat_UdplITE_inuse{instance=~"$node:$port",job=~"$job"} UDP_inuse - 正在使用的 UDP 套接字数量 metrics: node_sockstat_UDP_inuse{instance=~"$node:$port",job=~"$job"} UDP_mem - UDP 套接字缓冲区使用量 metrics: node_sockstat_UDP_mem{instance=~"$node:$port",job=~"$job"} 3. Sockstat Used type: Graph Unit: short Label: Sockets Sockets_used - 已使用的所有协议套接字总量 metrics: node_sockstat_sockets_used{instance=~"$node:$port",job=~"$job"} 4. Sockstat Memory Size type: Graph Unit: bytes Label: Bytes TCP_mem_bytes - TCP 套接字缓冲区比特数 metrics: node_sockstat_TCP_mem_bytes{instance=~"$node:$port",job=~"$job"} UDP_mem_bytes - UDP 套接字缓冲区比特数 metrics: node_sockstat_UDP_mem_bytes{instance=~"$node:$port",job=~"$job"} 5. Sockstat FRAG / RAW type: Graph Unit: short Label: Sockets FRAG_inuse - 正在使用的 Frag 套接字数量 metrics: node_sockstat_FRAG_inuse{instance=~"$node:$port",job=~"$job"} FRAG_memory - 使用的 Frag 缓冲区 metrics: node_sockstat_FRAG_memory{instance=~"$node:$port",job=~"$job"} RAW_inuse - 正在使用的 Raw 套接字数量 metrics: node_sockstat_RAW_inuse{instance=~"$node:$port",job=~"$job"}
Network Netstat /proc/net/netstat 1. Netstat IP In / Out type: Graph Unit: short Label: Datagrams out (-) / in (+) InReceives - 接收到的 ip 数据报 metrics: irate(node_netstat_Ip_InReceives{instance=~"$node:$port",job=~"$job"}[5m]) DefaultTTL - 接收的默认生存时间的 IP 数据报 metrics: irate(node_netstat_Ip_DefaultTTL{instance=~"$node:$port",job=~"$job"}[5m]) InDelivers - 传递的 IP 数据报 metrics: irate(node_netstat_Ip_InDelivers{instance=~"$node:$port",job=~"$job"}[5m]) OutRequests - 发送的 ip 数据报 metrics: irate(node_netstat_Ip_OutRequests{instance=~"$node:$port",job=~"$job"}[5m]) 2. Netstat IP In / Out type: Graph Unit: short Label: Octets out (-) / in (+) InOctets - 接收到的 ip 数据报(octets) metrics: irate(node_netstat_IpExt_InOctets{instance=~"$node:$port",job=~"$job"}[5m]) OutOctets - 发送的 ip 数据报(octets) metrics: irate(node_netstat_IpExt_OutOctets{instance=~"$node:$port",job=~"$job"}[5m]) 3. Netstat IP Bcast type: Graph Unit: short Label: Datagrams out (-) / in (+) InBcastPkts - 接收的 IP 广播数据报报文 metrics: irate(node_netstat_IpExt_InBcastPkts{instance=~"$node:$port",job=~"$job"}[5m]) OutBcastPkts - 发送的 IP 广播数据报报文 metrics: irate(node_netstat_IpExt_OutBcastPkts{instance=~"$node:$port",job=~"$job"}[5m]) 4. Netstat IP Bcast Octets type: Graph Unit: short Label: Octets out (-) / in (+) InBcastOctets - 接收的 IP 广播数据报 octet 数 metrics: irate(node_netstat_IpExt_InBcastOctets{instance=~"$node:$port",job=~"$job"}[5m]) OutBcastOctets - 发送的 IP 广播数据报 octet 数 metrics: irate(node_netstat_IpExt_OutBcastOctets{instance=~"$node:$port",job=~"$job"}[5m]) 5. Netstat IP Mcast type: Graph Unit: short Label: Datagrams out (-) / in (+) InMcastPkts - 接收的 IP 多播数据报报文 metrics: irate(node_netstat_IpExt_InMcastPkts{instance=~"$node:$port",job=~"$job"}[5m]) OutMcastPkts - 发送的 IP 多播数据报报文 metrics: irate(node_netstat_IpExt_OutMcastPkts{instance=~"$node:$port",job=~"$job"}[5m]) 6. Netstat IP Mcast Octets type: Graph Unit: short Label: Octets out (-) / in (+) InMcastOctets - 接收的 IP 多播数据报octet 数 metrics: irate(node_netstat_IpExt_InMcastOctets{instance=~"$node:$port",job=~"$job"}[5m]) OutMcastOctets - 发送的 IP 多播数据报报文 octet 数 metrics: irate(node_netstat_IpExt_OutMcastOctets{instance=~"$node:$port",job=~"$job"}[5m]) 7. Netstat IP Forwarding type: Graph Unit: short Label: Datagrams ForwDatagrams - IP 转发报文数 metrics: irate(node_netstat_Ip_ForwDatagrams{instance=~"$node:$port",job=~"$job"}[5m]) Forwarding - IP 转发 metrics: irate(node_netstat_Ip_Forwarding{instance=~"$node:$port",job=~"$job"}[5m]) 8. Netstat IP Fragmented type: Graph Unit: short Label: Datagrams FragCreates - 创建的 IP 分片报文数 metrics: irate(node_netstat_Ip_FragCreates{instance=~"$node:$port",job=~"$job"}[5m]) FragFails - 失败的 IP 分片报文数 metrics: irate(node_netstat_Ip_FragFails{instance=~"$node:$port",job=~"$job"}[5m]) FragOKs - 成功的 IP 分片报文数 metrics: irate(node_netstat_Ip_FragOKs{instance=~"$node:$port",job=~"$job"}[5m]) 9. Netstat IP ECT / CEP type: Graph Unit: short Label: Datagrams InCEPkts - 拥塞转发的数据报 metrics: irate(node_netstat_IpExt_InCEPkts{instance=~"$node:$port",job=~"$job"}[5m]) InECT0Pkts - 接收到的带有 ECT(0) 代码点的 ip 数据报 metrics: irate(node_netstat_IpExt_InECT0Pkts{instance=~"$node:$port",job=~"$job"}[5m]) InECT1Pkt - 接收到的带有 ECT(1) 代码点的 ip 数据报 metrics: irate(node_netstat_IpExt_InECT1Pkts{instance=~"$node:$port",job=~"$job"}[5m]) InNoECTPkts - 接收到的带有 NOECT 的 ip 数据报 metrics: irate(node_netstat_IpExt_InNoECTPkts{instance=~"$node:$port",job=~"$job"}[5m]) 10. Netstat IP Reasambled type: Graph Unit: short Label: Datagrams ReasmFails - IP 重组失败的数据报 metrics: irate(node_netstat_Ip_ReasmFails{instance=~"$node:$port",job=~"$job"}[5m]) ReasmOKs - IP 重组成功的数据报 metrics: irate(node_netstat_Ip_ReasmOKs{instance=~"$node:$port",job=~"$job"}[5m]) ReasmReqds - 需要进行 IP 重组的数据报 metrics: irate(node_netstat_Ip_ReasmReqds{instance=~"$node:$port",job=~"$job"}[5m]) ReasmTimeout - IP 重组超时的数据报 metrics: irate(node_netstat_Ip_ReasmTimeout{instance=~"$node:$port",job=~"$job"}[5m]) 11. Netstat IP Errors / discards type: Graph Unit: short Label: Datagrams out (-) / in (+) Indiscards - 接收的丢弃的 ip 数据报 metrics: irate(node_netstat_Ip_Indiscards{instance=~"$node:$port",job=~"$job"}[5m]) InHdrErrors - IP inhdrerrors metrics: irate(node_netstat_Ip_InHdrErrors{instance=~"$node:$port",job=~"$job"}[5m]) InUnkNownProtos - 由于未知协议而丢弃的 IP 数据报 metrics: irate(node_netstat_Ip_InUnkNownProtos{instance=~"$node:$port",job=~"$job"}[5m]) Outdiscards - IP outdiscards metrics: irate(node_netstat_Ip_Outdiscards{instance=~"$node:$port",job=~"$job"}[5m]) Outnoroutes - 由于没有输出路由而丢弃的 IP 数据报 metrics: irate(node_netstat_Ip_Outnoroutes{instance=~"$node:$port",job=~"$job"}[5m]) Innoroutes - 由于转发路径中没有路由而丢弃的 IP 数据报 metrics: irate(node_netstat_IpExt_Innoroutes{instance=~"$node:$port",job=~"$job"}[5m]) InCsumErrors - 具有校验和错误的 IP 数据报 metrics: irate(node_netstat_IpExt_InCsumErrors{instance=~"$node:$port",job=~"$job"}[5m]) InTruncatedPkts - 由于帧没有携带足够的数据而丢弃的 IP 数据报 metrics: irate(node_netstat_IpExt_InTruncatedPkts{instance=~"$node:$port",job=~"$job"}[5m]) InAddrErrors - 由于内部地址错误而丢弃的 IP 数据报 metrics: irate(node_netstat_Ip_InAddrErrors{instance=~"$node:$port",job=~"$job"}[5m])
Network Netstat TCP /proc/net/snmp 1. TCP Segments type: Graph Unit: short Label: Segments out (-) / in (+) InCsumErrors - 接收的带有校验和错误的报文数(5分钟内) metrics: irate(node_netstat_Tcp_InCsumErrors{instance=~"$node:$port",job=~"$job"}[5m]) InErrs - TCP 接收的错误报文数(5分钟内)(例如:错误的校验和) metrics: irate(node_netstat_Tcp_InErrs{instance=~"$node:$port",job=~"$job"}[5m]) InSegs - TCP 接收的目前所有建立连接的错误报文数(5分钟内)(例如:错误的校验和) metrics: irate(node_netstat_Tcp_InSegs{instance=~"$node:$port",job=~"$job"}[5m]) OutRsts - TCP 发送的报文数(5分钟内)(包括 RST flag) metrics: irate(node_netstat_Tcp_OutRsts{instance=~"$node:$port",job=~"$job"}[5m]) OutSegs - TCP 发送的报文数(5分钟内)(包括当前连接的段但是不包括重传的段) metrics: irate(node_netstat_Tcp_OutSegs{instance=~"$node:$port",job=~"$job"}[5m]) RetransSegs - TCP 重传报文数(5分钟内) metrics: irate(node_netstat_Tcp_RetransSegs{instance=~"$node:$port",job=~"$job"}[5m]) 2. TCP Connections type: Graph Unit: short Label: Connections CurrEstab - 当前状态为 ESTABLISHED 或 CLOSE-WAIT 的 TCP 连接数 metrics: node_netstat_Tcp_CurrEstab{instance=~"$node:$port",job=~"$job"} MaxConn - 限制实体可以支持的 TCP 最大连接总数 metrics: node_netstat_Tcp_MaxConn{instance=~"$node:$port",job=~"$job"} 3. TCP Retransmission type: Graph Unit: milliseconds Label: Milliseconds RtoAlgorithm - TCP 重传超时时间 metrics: node_netstat_Tcp_RtoAlgorithm{instance=~"$node:$port",job=~"$job"} RtoMax - TCP允许的重传超时的最大值,以毫秒为单位 metrics: node_netstat_Tcp_RtoMax{instance=~"$node:$port",job=~"$job"} RtoMin - TCP允许的重传超时的最小值,以毫秒为单位 metrics: node_netstat_Tcp_RtoMin{instance=~"$node:$port",job=~"$job"} 4. TCP Segments type: Graph Unit: short Label: Connections ActiveOpens - 已从 CLOSED 状态直接转换到 SYN-SENT 状态的 TCP 平均连接数(5分钟内) metrics: irate(node_netstat_Tcp_ActiveOpens{instance=~"$node:$port",job=~"$job"}[5m]) AttemptFails - 从 SYN-SENT 和 SYN-RCVD 转换到 CLOSED 状态的 TCP 平均连接数(5分钟内) metrics: irate(node_netstat_Tcp_AttemptFails{instance=~"$node:$port",job=~"$job"}[5m]) EstabResets - 从 ESTABLISHED 状态或 CLOSE-WAIT 状态直接转换到 CLOSED 状态的 TCP 平均连接数(5分钟内) metrics: irate(node_netstat_Tcp_EstabResets{instance=~"$node:$port",job=~"$job"}[5m]) PassiveOpens - 已从 LISTEN 状态直接转换到 SYN-RCVD 状态的 TCP 平均连接数(5分钟内) metrics: irate(node_netstat_Tcp_PassiveOpens{instance=~"$node:$port",job=~"$job"}[5m])
Network Netstat TCP Linux MIPs 1. TCP Aborts / Tiemouts type: Graph Unit: short Label: Connections TCPAbortOnClose - 由于用户关闭中止的连接数 metrics: irate(node_netstat_TcpExt_TCPAbortOnClose{instance=~"$node:$port",job=~"$job"}[5m]) TCPAbortOnData - 由于意外数据而中止的连接数 metrics: irate(node_netstat_TcpExt_TCPAbortOnData{instance=~"$node:$port",job=~"$job"}[5m]) TCPAbortOnLinger - 关闭后,在徘徊状态中止的连接数 metrics: irate(node_netstat_TcpExt_TCPAbortOnLinger{instance=~"$node:$port",job=~"$job"}[5m]) TCPAbortOnMemory - 连接到 socket 之前中止的连接数 metrics: irate(node_netstat_TcpExt_TCPAbortOnMemory{instance=~"$node:$port",job=~"$job"}[5m]) TCPAbortOnTimeout - 由于超时中止的连接数 metrics: irate(node_netstat_TcpExt_TCPAbortOnTimeout{instance=~"$node:$port",job=~"$job"}[5m]) TCPAbortFailed - 由于内存不足,连接中止但未发送RST的连接数 metrics: irate(node_netstat_TcpExt_TCPAbortFailed{instance=~"$node:$port",job=~"$job"}[5m]) TCPTimeouts - 其他 TCP 连接超时的连接数 metrics: irate(node_netstat_TcpExt_TCPTimeouts{instance=~"$node:$port",job=~"$job"}[5m]) 2. TCP Delayed ACK type: Graph Unit: short Label: Counter DelayedACKLocked - 由于 socket 锁定 延时ACK 进一步延迟的数量 metrics: irate(node_netstat_TcpExt_DelayedACKLocked{instance=~"$node:$port",job=~"$job"}[5m]) DelayedACKLost - 快速回复 ACK 模式被激活的数量 metrics: irate(node_netstat_TcpExt_DelayedACKLost{instance=~"$node:$port",job=~"$job"}[5m]) DelayedACKs - 发送延迟 AC K的数量 metrics: irate(node_netstat_TcpExt_DelayedACKs{instance=~"$node:$port",job=~"$job"}[5m]) 3. TCP SynCookie / Challenge type: Graph Unit: short Label: Counter out (-) / in (+) SyncookiesFailed - 接收的无效的 SYN cookies 的数量 metrics: irate(node_netstat_TcpExt_SyncookiesFailed{instance=~"$node:$port",job=~"$job"}[5m]) SyncookiesRecv - 接收的 SYN cookies 的数量 metrics: irate(node_netstat_TcpExt_SyncookiesRecv{instance=~"$node:$port",job=~"$job"}[5m]) SyncookiesSent - 发送的 SYN cookies 的数量 metrics: irate(node_netstat_TcpExt_SyncookiesSent{instance=~"$node:$port",job=~"$job"}[5m]) SynChallenge - 发送的 SYNChallenge 数量 metrics: irate(node_netstat_TcpExt_TcpsYNChallenge{instance=~"$node:$port",job=~"$job"}[5m]) TCPChallengeACK - 发送的 Challenge ACK 数量 metrics: irate(node_netstat_TcpExt_TCPChallengeACK{instance=~"$node:$port",job=~"$job"}[5m]) 4. TCP LOSS type: Graph Unit: short Label: Counter TCPLossFailures - 处于 Loss 状态下的 TCP 包数量 metrics: irate(node_netstat_TcpExt_TCPLossFailures{instance=~"$node:$port",job=~"$job"}[5m]) TCPLossprobeRecovery - 恢复的 TCP 丢失探测定时器的数量 metrics: irate(node_netstat_TcpExt_TCPLossprobeRecovery{instance=~"$node:$port",job=~"$job"}[5m]) TCPLossprobes - 发送的 TCP 丢失探测定时器的数量 metrics: irate(node_netstat_TcpExt_TCPLossprobes{instance=~"$node:$port",job=~"$job"}[5m]) TCPLossUndo - 在部分确认后,拥塞窗口没有缓慢启动而恢复的数量 metrics: irate(node_netstat_TcpExt_TCPLossUndo{instance=~"$node:$port",job=~"$job"}[5m]) TCPLostRetransmit - TCP 包丢失重传的数量 metrics: irate(node_netstat_TcpExt_TCPLostRetransmit{instance=~"$node:$port",job=~"$job"}[5m]) 5. TCP DROPS type: Graph Unit: short Label: Counter ListenDrops - 监听队列连接丢弃数 metrics: irate(node_netstat_TcpExt_ListenDrops{instance=~"$node:$port",job=~"$job"}[5m]) LockDroppedIcmps - 因 socket 锁定而丢弃的 ICMP 数据包数量 metrics: irate(node_netstat_TcpExt_LockDroppedIcmps{instance=~"$node:$port",job=~"$job"}[5m]) TCPDeferAcceptDrop - 在 SYN_RECV 状态下由 socket 接收的丢弃的 ACK 帧 metrics: irate(node_netstat_TcpExt_TCPDeferAcceptDrop{instance=~"$node:$port",job=~"$job"}[5m]) TCPBacklogDrop - 由于 socket 接收队列已满,丢弃的TCP数据包数量 metrics: irate(node_netstat_TcpExt_TCPBacklogDrop{instance=~"$node:$port",job=~"$job"}[5m]) OutOfWindowIcmps - 由于 out-of-window 丢弃的 ICMP 包数量 metrics: irate(node_netstat_TcpExt_OutOfWindowIcmps{instance=~"$node:$port",job=~"$job"}[5m]) TCPMinTTLDrop - 在 minTTL 条件下丢弃的 TCP数据包的数量 metrics: irate(node_netstat_TcpExt_TCPMinTTLDrop{instance=~"$node:$port",job=~"$job"}[5m]) 6. TCP Retrans type: Graph Unit: short Label: Counter TCPForwardRetrans - 使用 F-RTO 重新传输丢失的数据包的数量 metrics: irate(node_netstat_TcpExt_TCPForwardRetrans{instance=~"$node:$port",job=~"$job"}[5m]) TcpslowStartRetrans - 在慢启动后重传丢失的数据包数量 metrics: irate(node_netstat_TcpExt_TcpslowStartRetrans{instance=~"$node:$port",job=~"$job"}[5m]) TcpsynRetrans - SYN-SYN/ACK重传以分解 SYN 中的重传,快速/超时重传 metrics: irate(node_netstat_TcpExt_TcpsynRetrans{instance=~"$node:$port",job=~"$job"}[5m]) TcpspurIoUsRTOs - TCP 虚假 RTOs 数量 metrics: irate(node_netstat_TcpExt_TcpspurIoUsRTOs{instance=~"$node:$port",job=~"$job"}[5m]) TcpspurIoUsRtxHostQueues - Times detected that the fast clone is not yet freed in tcp_transmit_skb() metrics: irate(node_netstat_TcpExt_TcpspurIoUsRtxHostQueues{instance=~"$node:$port",job=~"$job"}[5m]) TCPFullUndo - 重传 undoRetransmits that undid the CWND reduction metrics: irate(node_netstat_TcpExt_TCPFullUndo{instance=~"$node:$port",job=~"$job"}[5m]) TCPRetransFail - tcp_retransmit_skb() 调用失败的数量 metrics: irate(node_netstat_TcpExt_TCPRetransFail{instance=~"$node:$port",job=~"$job"}[5m]) TCPPartialUndo - 使用 Hoe heuristic 部分恢复拥塞窗口 metrics: irate(node_netstat_TcpExt_TCPPartialUndo{instance=~"$node:$port",job=~"$job"}[5m]) 7. TCP Pruned type: Graph Unit: short Label: Counter PruneCalled - 由于 socket 缓冲区溢出而从接收队列中删除的数据包数量 metrics: irate(node_netstat_TcpExt_PruneCalled{instance=~"$node:$port",job=~"$job"}[5m]) RcvPruned - 从接收队列中删除的数据包数量 metrics: irate(node_netstat_TcpExt_RcvPruned{instance=~"$node:$port",job=~"$job"}[5m]) OfoPruned - 由于 socket 缓冲区溢出,从无序队列中删除的数据包数量 metrics: irate(node_netstat_TcpExt_OfoPruned{instance=~"$node:$port",job=~"$job"}[5m]) 8. TCP Direct copy type: Graph Unit: short Label: Counter TCPDirectcopyFromBacklog - 接收的来自 accept queue 的数据包 metrics: irate(node_netstat_TcpExt_TCPDirectcopyFromBacklog{instance=~"$node:$port",job=~"$job"}[5m]) TCPDirectcopyFromPrequeue - 接收的来自 TCP prequeue 的数据包 metrics: irate(node_netstat_TcpExt_TCPDirectcopyFromPrequeue{instance=~"$node:$port",job=~"$job"}[5m]) 9. TCP TimeWait type: Graph Unit: short Label: Counter TW - 在快速计时器中完成 TIME_WAITTCP 套接字 metrics: irate(node_netstat_TcpExt_TW{instance=~"$node:$port",job=~"$job"}[5m]) TWKilled - 在慢速计时器中完成 TIME_WAITTCP 套接字 metrics: irate(node_netstat_TcpExt_TWKilled{instance=~"$node:$port",job=~"$job"}[5m]) TWRecycled - 按时间戳回收的 TIME_WAIT 套接字 metrics: irate(node_netstat_TcpExt_TWRecycled{instance=~"$node:$port",job=~"$job"}[5m]) TCPTimeWaitOverflow - 发生 TIME_WAIT 溢出的数量 metrics: irate(node_netstat_TcpExt_TCPTimeWaitOverflow{instance=~"$node:$port",job=~"$job"}[5m]) 10. TCP PAWS type: Graph Unit: short Label: Counter PAWSActive - 由于 TCP 时间戳PAWS而拒绝激活的连接数 metrics: irate(node_netstat_TcpExt_PAWSActive{instance=~"$node:$port",job=~"$job"}[5m]) PAWSEstab - 由于 TCP 时间戳PAWS而拒绝建立连接的数据包数量 metrics: irate(node_netstat_TcpExt_PAWSEstab{instance=~"$node:$port",job=~"$job"}[5m]) PAWSPassive - 由于 TCP 时间戳PAWS而被拒绝的被动连接数 metrics: irate(node_netstat_TcpExt_PAWSPassive{instance=~"$node:$port",job=~"$job"}[5m]) 11. TCP SACK type: Graph Unit: short Label: Counter TcpsackRecovery - 使用 Sack 恢复丢失的包 metrics: irate(node_netstat_TcpExt_TcpsackRecovery{instance=~"$node:$port",job=~"$job"}[5m]) TcpsackRecoveryFail - 使用 Sack 恢复丢失的包失败 metrics: irate(node_netstat_TcpExt_TcpsackRecoveryFail{instance=~"$node:$port",job=~"$job"}[5m]) TcpsackShiftFallback metrics: irate(node_netstat_TcpExt_TcpsackShiftFallback{instance=~"$node:$port",job=~"$job"}[5m]) TcpsackShifted metrics: irate(node_netstat_TcpExt_TcpsackShifted{instance=~"$node:$port",job=~"$job"}[5m]) Tcpsackdiscard metrics: irate(node_netstat_TcpExt_TcpsACKdiscard{instance=~"$node:$port",job=~"$job"}[5m]) TcpsackFailures metrics: irate(node_netstat_TcpExt_TcpsackFailures{instance=~"$node:$port",job=~"$job"}[5m]) TcpsackMerged metrics: irate(node_netstat_TcpExt_TcpsackMerged{instance=~"$node:$port",job=~"$job"}[5m]) TcpsACKReneging metrics: irate(node_netstat_TcpExt_TcpsACKReneging{instance=~"$node:$port",job=~"$job"}[5m]) TcpsACKReorder metrics: irate(node_netstat_TcpExt_TcpsACKReorder{instance=~"$node:$port",job=~"$job"}[5m]) 12. TCP DSACK type: Graph Unit: short Label: Counter TCPDSACKIgnoredOld - 在重新传输时丢弃具有重复 SACK 的数据包 metrics: irate(node_netstat_TcpExt_TCPDSACKIgnoredOld{instance=~"$node:$port",job=~"$job"}[5m]) TCPDSACKOfoRecv - 接收到无序的 DSACK 数据包 metrics: irate(node_netstat_TcpExt_TCPDSACKOfoRecv{instance=~"$node:$port",job=~"$job"}[5m]) TCPDSACKOfoSent - 发送的无序的 DSACK 数据包 metrics: irate(node_netstat_TcpExt_TCPDSACKOfoSent{instance=~"$node:$port",job=~"$job"}[5m]) TCPDSACKOldSent - 发送的旧 DSACKs 数据包 metrics: irate(node_netstat_TcpExt_TCPDSACKOldSent{instance=~"$node:$port",job=~"$job"}[5m]) TCPDSACKRecv - 接收的 DSACK 数据包 metrics: irate(node_netstat_TcpExt_TCPDSACKRecv{instance=~"$node:$port",job=~"$job"}[5m]) TCPDSACKUndo metrics: irate(node_netstat_TcpExt_TCPDSACKUndo{instance=~"$node:$port",job=~"$job"}[5m]) TCPDSACKIgnorednoUndo metrics: irate(node_netstat_TcpExt_TCPDSACKIgnorednoUndo{instance=~"$node:$port",job=~"$job"}[5m]) 13. TCP FastOpen / FastRetrans type: Graph Unit: short Label: Counter TCPFastOpenActive - 成功的出站 TFO 连接 metrics: irate(node_netstat_TcpExt_TCPFastOpenActive{instance=~"$node:$port",job=~"$job"}[5m]) TCPFastOpenActiveFail - 收到的 SYN-ACK 数据包未确认 SYN 数据包中发送的数据,并导致无 SYN 数据的重传 metrics: irate(node_netstat_TcpExt_TCPFastOpenActiveFail{instance=~"$node:$port",job=~"$job"}[5m]) TCPFastOpenCookieReqd - 请求设置 TFO 但没有 cookie 的入站 SYN 数据包 metrics: irate(node_netstat_TcpExt_TCPFastOpenCookieReqd{instance=~"$node:$port",job=~"$job"}[5m]) TCPFastOpenListenOverflow - TFO 监听队列溢出 metrics: irate(node_netstat_TcpExt_TCPFastOpenListenOverflow{instance=~"$node:$port",job=~"$job"}[5m]) TCPFastOpenPassive - 成功的入站 TFO 连接 metrics: irate(node_netstat_TcpExt_TCPFastOpenPassive{instance=~"$node:$port",job=~"$job"}[5m]) TCPFastOpenPassiveFail - 带有TFO cookie 的无效的入站 SYN 数据包 metrics: irate(node_netstat_TcpExt_TCPFastOpenPassiveFail{instance=~"$node:$port",job=~"$job"}[5m]) TCPFastRetrans - 丢失快速重传的数据包 metrics: irate(node_netstat_TcpExt_TCPFastRetrans{instance=~"$node:$port",job=~"$job"}[5m]) 14. TCP HP type: Graph Unit: short Label: Counter TCPHPAcks - 接收到的不包含数据的 Acks metrics: irate(node_netstat_TcpExt_TCPHPAcks{instance=~"$node:$port",job=~"$job"}[5m]) TCPHPHits - HP 数据包 metrics: irate(node_netstat_TcpExt_TCPHPHits{instance=~"$node:$port",job=~"$job"}[5m]) TCPHPHitsToUser metrics: irate(node_netstat_TcpExt_TCPHPHitsToUser{instance=~"$node:$port",job=~"$job"}[5m]) 15. TCP ZeroWindow type: Graph Unit: short Label: Counter TCPToZeroWindowAdv metrics: irate(node_netstat_TcpExt_TCPToZeroWindowAdv{instance=~"$node:$port",job=~"$job"}[5m]) TCPWantZeroWindowAdv metrics: irate(node_netstat_TcpExt_TCPWantZeroWindowAdv{instance=~"$node:$port",job=~"$job"}[5m]) TCPFromZeroWindowAdv metrics: irate(node_netstat_TcpExt_TCPFromZeroWindowAdv{instance=~"$node:$port",job=~"$job"}[5m]) 16. TCP Reorder type: Graph Unit: short Label: Counter TCPFACKReorder - 如果在需要更新时判断支持FACK,使用 TCPFACKReorder 计数器 metrics: irate(node_netstat_TcpExt_TCPFACKReorder{instance=~"$node:$port",job=~"$job"}[5m]) TCPTSReorder - 如果是被一个partial ack确认后需要更新reorder值,使用 TCPTSReorder 计数器 metrics: irate(node_netstat_TcpExt_TCPTSReorder{instance=~"$node:$port",job=~"$job"}[5m]) 17. TCP Reno type: Graph Unit: short Label: Counter TCPRenoFailures - reno 后快速重传超时的数量 metrics: irate(node_netstat_TcpExt_TCPRenoFailures{instance=~"$node:$port",job=~"$job"}[5m]) TCPRenorecovery metrics: irate(node_netstat_TcpExt_TCPRenorecovery{instance=~"$node:$port",job=~"$job"}[5m]) TCPRenorecoveryFail metrics: irate(node_netstat_TcpExt_TCPRenorecoveryFail{instance=~"$node:$port",job=~"$job"}[5m]) TCPRenoreorder metrics: irate(node_netstat_TcpExt_TCPRenoreorder{instance=~"$node:$port",job=~"$job"}[5m]) 18. TCP ReqQ type: Graph Unit: short Label: Counter TCPReqQFullDoCookies metrics: irate(node_netstat_TcpExt_TCPReqQFullDoCookies{instance=~"$node:$port",job=~"$job"}[5m]) TCPReqQFullDrop metrics: irate(node_netstat_TcpExt_TCPReqQFullDrop{instance=~"$node:$port",job=~"$job"}[5m]) 19. TCP Out of order type: Graph Unit: short Label: Counter TCPOFODrop - 在 OFO 中排队但由于达到了 socket rcvbuf 限制而丢弃的数据包 metrics: irate(node_netstat_TcpExt_TCPOFODrop{instance=~"$node:$port",job=~"$job"}[5m]) TCPOFOMerge - OFO 中与其他数据包合并的数据包 metrics: irate(node_netstat_TcpExt_TCPOFOMerge{instance=~"$node:$port",job=~"$job"}[5m]) TCPOFOQueue - OFO 队列的数据包 metrics: irate(node_netstat_TcpExt_TCPOFOQueue{instance=~"$node:$port",job=~"$job"}[5m]) 20. TCP MD5 type: Graph Unit: short Label: Counter TCPMD5NotFound - 希望收到带 MD5 选项的包,但是包里面没有 MD5 选项 metrics: irate(node_netstat_TcpExt_TCPMD5NotFound{instance=~"$node:$port",job=~"$job"}[5m]) TCPMD5Unexpected - 不希望收到带 MD5 选项的包,但是包里面有 MD5 选项 metrics: irate(node_netstat_TcpExt_TCPMD5Unexpected{instance=~"$node:$port",job=~"$job"}[5m]) 21. TCP Prequeued type: Graph Unit: short Label: Counter TCPPrequeued metrics: irate(node_netstat_TcpExt_TCPPrequeued{instance=~"$node:$port",job=~"$job"}[5m]) TCPPrequeueDropped - prequeue 队列丢弃的数据包 metrics: irate(node_netstat_TcpExt_TCPPrequeueDropped{instance=~"$node:$port",job=~"$job"}[5m]) 22. TCP Rcv type: Graph Unit: short Label: Counter TCPRcvCoalesce - 在接收队列中崩溃的数据包 metrics: irate(node_netstat_TcpExt_TCPRcvCoalesce{instance=~"$node:$port",job=~"$job"}[5m]) TCPRcvCollapsed - 由于低的 socket 缓冲区,在接收队列中崩溃的数据包 metrics: irate(node_netstat_TcpExt_TCPRcvCollapsed{instance=~"$node:$port",job=~"$job"}[5m]) 23. TCP Original Data type: Graph Unit: short Label: Counter TCPOrigDataSent - 带有原始数据的传出数据包 metrics: irate(node_netstat_TcpExt_TCPOrigDataSent{instance=~"$node:$port",job=~"$job"}[5m]) 24. TCP Filters type: Graph Unit: short Label: Counter ArpFilter - 过滤的 Arp 数据包 metrics: irate(node_netstat_TcpExt_ArpFilter{instance=~"$node:$port",job=~"$job"}[5m]) IPReversePathFilter - 从非直连网络到达的数据包 metrics: irate(node_netstat_TcpExt_IPReversePathFilter{instance=~"$node:$port",job=~"$job"}[5m]) 25. TCP Pure ACK type: Graph Unit: short Label: Counter TCPPureAcks - 接收到不包含的数据负载的 ACKs metrics: irate(node_netstat_TcpExt_TCPPureAcks{instance=~"$node:$port",job=~"$job"}[5m]) 26. TCP Auto Corking type: Graph Unit: short Label: Counter TCPAutoCorking - Tcp 自动闭塞 metrics: irate(node_netstat_TcpExt_TCPAutoCorking{instance=~"$node:$port",job=~"$job"}[5m]) 27. TCP Issues type: Graph Unit: short Label: Counter BusyPollRxPackets - 低延迟应用程序获取的数据包 metrics: irate(node_netstat_TcpExt_BusyPollRxPackets{instance=~"$node:$port",job=~"$job"}[5m]) EmbryonicRsts - Resets received for embryonic SYN_RECV sockets metrics: irate(node_netstat_TcpExt_EmbryonicRsts{instance=~"$node:$port",job=~"$job"}[5m]) ListenOverflows - 监听 socket 的队列溢出 metrics: irate(node_netstat_TcpExt_ListenOverflows{instance=~"$node:$port",job=~"$job"}[5m]) TcpschedulerFailed metrics: irate(node_netstat_TcpExt_TcpschedulerFailed{instance=~"$node:$port",job=~"$job"}[5m]) TCPMemoryPressures metrics: irate(node_netstat_TcpExt_TCPMemoryPressures{instance=~"$node:$port",job=~"$job"}[5m])
Network Netstat UDP /proc/net/snmp 1. UDP In / Out type: Graph Unit: short Label: Datagrams out (-) / in (+) InDatagrams - 平均接收的 UDP 数据包(5分钟内) metrics: irate(node_netstat_Udp_InDatagrams{instance=~"$node:$port",job=~"$job"}[5m]) OutDatagrams - 平均发送的 UDP 数据包(5分钟内) metrics: irate(node_netstat_Udp_OutDatagrams{instance=~"$node:$port",job=~"$job"}[5m]) 2. UDP Errors type: Graph Unit: short Label: Datagrams out (-) / in (+) InCsumErrors - 具有校验和错误的 UDP 数据包的平均数(5分钟内) metrics: irate(node_netstat_Udp_InCsumErrors{instance=~"$node:$port",job=~"$job"}[5m]) InErrors - 本机端口未监听之外的其他原因引起的 UDP 入包无法送达(应用层)的平均数(5分钟内) metrics: irate(node_netstat_Udp_InErrors{instance=~"$node:$port",job=~"$job"}[5m]) RcvbufErrors - 接收缓冲区溢出的 UDP 包的平均数(5分钟内) metrics: irate(node_netstat_Udp_RcvbufErrors{instance=~"$node:$port",job=~"$job"}[5m]) SndbufErrors - 发送缓冲区溢出的 UDP 包的平均数(5分钟内) metrics: irate(node_netstat_Udp_SndbufErrors{instance=~"$node:$port",job=~"$job"}[5m]) noports - 未知端口接收 UDP 数据包的平均数(5分钟内) metrics: irate(node_netstat_Udp_noports{instance=~"$node:$port",job=~"$job"}[5m]) 3. UDP Lite In / Out type: Graph Unit: short Label: Datagrams out (-) / in (+) InDatagrams - 平均接收的 UDP-Lite 数据包(5分钟内) metrics: irate(node_netstat_Udplite_InDatagrams{instance=~"$node:$port",job=~"$job"}[5m]) OutDatagrams - 平均发送的 UDP-Lite 数据包(5分钟内) metrics: irate(node_netstat_Udplite_OutDatagrams{instance=~"$node:$port",job=~"$job"}[5m]) 4. UDP Lite Errors type: Graph Unit: short Label: Datagrams out (-) / in (+) InCsumErrors - 具有校验和错误的 UDP-Lite 数据包的平均数(5分钟内) metrics: irate(node_netstat_Udplite_InCsumErrors{instance=~"$node:$port",job=~"$job"}[5m]) InErrors - 本机端口未监听之外的其他原因引起的 UDP-Lite 入包无法送达(应用层)的平均数(5分钟内) metrics: irate(node_netstat_Udplite_InErrors{instance=~"$node:$port",job=~"$job"}[5m]) RcvbufErrors - 接收缓冲区溢出的 UDP-Lite 包的平均数(5分钟内) metrics: irate(node_netstat_Udplite_RcvbufErrors{instance=~"$node:$port",job=~"$job"}[5m]) SndbufErrors - 发送缓冲区溢出的 UDP-Lite 包的平均数(5分钟内) metrics: irate(node_netstat_Udplite_InErrors{instance=~"$node:$port",job=~"$job"}[5m]) noports - 未知端口接收 UDP-Lite 数据包的平均数(5分钟内) metrics: irate(node_netstat_Udplite_noports{instance=~"$node:$port",job=~"$job"}[5m])
Network Netstat ICMP /proc/net/snmp 1. ICMP Errors 1 type: Graph Unit: short Label: Messages out (-) / in (+) InErrors - 接收的 ICMP 错误的报文(例如ICMP校验和错误、长度错误等) metrics: irate(node_netstat_Icmp_InErrors{instance=~"$node:$port",job=~"$job"}[5m]) OutErrors - 由于 ICMP 错误未发送的报文(例如缺少缓存 metrics: irate(node_netstat_Icmp_OutErrors{instance=~"$node:$port",job=~"$job"}[5m]) InDestUnreachs - 接收终点不可达的报文 metrics: irate(node_netstat_Icmp_InDestUnreachs{instance=~"$node:$port",job=~"$job"}[5m]) OutDestUnreachs - 发送终点不可达的报文 metrics: irate(node_netstat_Icmp_OutDestUnreachs{instance=~"$node:$port",job=~"$job"}[5m]) InType11 - 时间超时报文 metrics: irate(node_netstat_IcmpMsg_InType11{instance=~"$node:$port",job=~"$job"}[5m]) . ICMP Errors 2 type: Graph Unit: short Label: Messages out (-) / in (+) InCsumErrors - 具有校验和错误 ICMP 报文 metrics: irate(node_netstat_Icmp_InCsumErrors{instance=~"$node:$port",job=~"$job"}[5m]) InTimeExcds - 接收时间超时报文 metrics: irate(node_netstat_Icmp_InTimeExcds{instance=~"$node:$port",job=~"$job"}[5m]) OutTimeExcds - 发送时间超时报文 metrics: irate(node_netstat_Icmp_OutTimeExcds{instance=~"$node:$port",job=~"$job"}[5m]) InParmProbs - 接收参数错误报文 metrics: irate(node_netstat_Icmp_InParmProbs{instance=~"$node:$port",job=~"$job"}[5m]) OutParmProb - 发送参数错误报文 metrics: irate(node_netstat_Icmp_OutParmProbs{instance=~"$node:$port",job=~"$job"}[5m]) InSrcQuenchs - 接收源点抑制报文 metrics: irate(node_netstat_Icmp_InSrcQuenchs{instance=~"$node:$port",job=~"$job"}[5m]) OutSrcQuenchs - 发送源点抑制报文 metrics: irate(node_netstat_Icmp_OutSrcQuenchs{instance=~"$node:$port",job=~"$job"}[5m]) 3. ICMP In / Out - Messages / Redirects type: Graph Unit: short Label: Messages out (-) / in (+) InMsgs - 接收的报文数 metrics: irate(node_netstat_Icmp_InMsgs{instance=~"$node:$port",job=~"$job"}[5m]) InRedirects - 接收的 ICMP 重定向报文 metrics: irate(node_netstat_Icmp_InRedirects{instance=~"$node:$port",job=~"$job"}[5m]) OutMsgs - 发送的报文数 metrics: irate(node_netstat_Icmp_OutMsgs{instance=~"$node:$port",job=~"$job"}[5m]) OutRedirects - 发送的 ICMP 重定向报文 metrics: irate(node_netstat_Icmp_OutRedirects{instance=~"$node:$port",job=~"$job"}[5m]) 4. ICMP Timestamps type: Graph Unit: short Label: Messages out (-) / in (+) InTimestampReps - 接收(应答)时间戳 metrics: irate(node_netstat_Icmp_InTimestampReps{instance=~"$node:$port",job=~"$job"}[5m]) InTimestamps - 接收(请求)时间戳 metrics: irate(node_netstat_Icmp_InTimestamps{instance=~"$node:$port",job=~"$job"}[5m]) utTimestampReps - 发送(应答)时间戳 metrics: irate(node_netstat_Icmp_OutTimestampReps{instance=~"$node:$port",job=~"$job"}[5m]) OutTimestamps - 发送(请求)时间戳 metrics: irate(node_netstat_Icmp_OutTimestamps{instance=~"$node:$port",job=~"$job"}[5m]) 5. ICMP Echos type: Graph Unit: short Label: Messages out (-) / in (+) InEchoReps - 接收回显(应答)消息 metrics: irate(node_netstat_Icmp_InEchoReps{instance=~"$node:$port",job=~"$job"}[5m]) InEchos - 接收回显(请求)消息 metrics: irate(node_netstat_Icmp_InTimestamps{instance=~"$node:$port",job=~"$job"}[5m]) OutEchoReps - 发送回显(应答)消息 metrics: irate(node_netstat_Icmp_OutEchoReps{instance=~"$node:$port",job=~"$job"}[5m]) OutEchos - 发送回显(请求)消息 metrics: irate(node_netstat_Icmp_OutEchos{instance=~"$node:$port",job=~"$job"}[5m]) 6. ICMP Masks type: Graph Unit: short Label: Messages out (-) / in (+) InAddrMaskReps - 接收地址掩码(应答)消息 metrics: irate(node_netstat_Icmp_InAddrMaskReps{instance=~"$node:$port",job=~"$job"}[5m]) InAddrMasks - 接收地址掩码(请求)消息 metrics: irate(node_netstat_Icmp_InAddrMasks{instance=~"$node:$port",job=~"$job"}[5m]) OutAddrMaskReps - 发送地址掩码(应答)消息 metrics: irate(node_netstat_Icmp_OutAddrMaskReps{instance=~"$node:$port",job=~"$job"}[5m]) OutAddrMasks - 发送地址掩码(请求)消息 metrics: irate(node_netstat_Icmp_OutAddrMasks{instance=~"$node:$port",job=~"$job"}[5m])
Node Exporter 1. Node Exporter Scrape Time type: Graph Unit: seconds Label: Seconds {{collector}} - 各个收集器持续时间 metrics: node_scrape_collector_duration_seconds{instance=~"$node:$port",job=~"$job"} 2. Node Exporter Scrape Success type: Graph Unit: short Label: Counter {{collector}} - 各个收集器正常工作数量 metrics: node_scrape_collector_success{instance=~"$node:$port",job=~"$job"}