普罗米修斯直方图指标不准确

问题描述

我们有Nginx日志由Logstash处理到ES中。我们可以从该数据中填充响应时间图。但是我们只能在有限的时间内保留日志。因此，普罗米修斯。使用logstash-output-prometheus插件，我将request_time值作为直方图度量标准发送。

        prometheus {
          timer => {
            HTTP_Request_duration_seconds => {
              description => "HTTP request_time from Nginx logs"
              value => "%{[request_time]}"
              type => "histogram"
              buckets => [ 0.005,0.01,0.025,0.05,0.1,0.2,0.3,04,0.5,0.6,0.7,0.8,0.9,1,1.5,2,3,4,5,10,60,120,300,600 ]
              labels => {
                api => "%{api}"
                method => "%{method}"
                status => "%{status_agg}"
                path => "%{uri_name}"
                host => "%{host}"
              }
            }
          }
        }

当我比较Logstash和Prometheus的第95个百分点值时，它们有时会非常不同。

例如，

在上图中，底部的图是从Logstash ES填充的。对于2台主机，该值约为10ms；对于其余4台主机，该值约为11ms。

（ES查询：type:"Nginx" AND host:"prod-lb" AND uri_name:"/api/status" AND method:"GET" AND status:"200"）

Prometheus显示2台主机的值约为10ms（✅），而其余4台主机的值则在18ms左右！

（Prometheus查询：histogram_quantile(0.95,sum by(le,path,host,method,status) (rate(HTTP_Request_duration_seconds_bucket{path="/api/status",host=~".*lb.*"}[2m])))）

（请注意，这些值附近的存储桶分别为5ms，10ms，25ms）

这是巨大的偏差吗？

如何使它更准确？

还有其他方法可以保留更准确的响应时间吗？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

httpresponse logstash prometheus