diagnostics-prometheus 插件公开诊断指标。它会监听受信任的诊断事件以及核心发出的 gateway 稳定性事件,然后在以下地址渲染 Prometheus 文本端点:
text/plain; version=0.0.4; charset=utf-8,这是标准的 Prometheus exposition 格式。
关于 traces、logs、OTLP push 以及 OpenTelemetry GenAI 语义属性,请参见 OpenTelemetry 导出。
快速开始
diagnostics.enabled: true 是必需的。否则,插件仍会注册 HTTP 路由,但不会有诊断事件流入导出器,因此响应会为空。导出的指标
| Metric | Type | Labels |
|---|---|---|
openclaw_run_completed_total | counter | channel, model, outcome, provider, trigger |
openclaw_run_duration_seconds | histogram | channel, model, outcome, provider, trigger |
openclaw_model_call_total | counter | api, error_category, model, outcome, provider, transport |
openclaw_model_call_duration_seconds | histogram | api, error_category, model, outcome, provider, transport |
openclaw_model_failover_total | counter | from_model, from_provider, lane, reason, suspended, to_model, to_provider |
openclaw_model_tokens_total | counter | agent, channel, model, provider, token_type |
openclaw_gen_ai_client_token_usage | histogram | model, provider, token_type |
openclaw_model_cost_usd_total | counter | agent, channel, model, provider |
openclaw_skill_used_total | counter | activation, agent, skill, source |
openclaw_tool_execution_total | counter | error_category, outcome, params_kind, tool, tool_owner, tool_source |
openclaw_tool_execution_duration_seconds | histogram | error_category, outcome, params_kind, tool, tool_owner, tool_source |
openclaw_tool_execution_blocked_total | counter | denied_reason, params_kind, tool, tool_owner, tool_source |
openclaw_harness_run_total | counter | channel, error_category, harness, model, outcome, phase, plugin, provider |
openclaw_harness_run_duration_seconds | histogram | channel, error_category, harness, model, outcome, phase, plugin, provider |
openclaw_webhook_received_total | counter | channel, webhook |
openclaw_webhook_error_total | counter | channel, webhook |
openclaw_webhook_duration_seconds | histogram | channel, webhook |
openclaw_message_received_total | counter | channel, source |
openclaw_message_dispatch_started_total | counter | channel, source |
openclaw_message_dispatch_completed_total | counter | channel, outcome, reason, source |
openclaw_message_dispatch_duration_seconds | histogram | channel, outcome, reason, source |
openclaw_message_processed_total | counter | channel, outcome, reason |
openclaw_message_processed_duration_seconds | histogram | channel, outcome, reason |
openclaw_message_delivery_started_total | counter | channel, delivery_kind |
openclaw_message_delivery_total | counter | channel, delivery_kind, error_category, outcome |
openclaw_message_delivery_duration_seconds | histogram | channel, delivery_kind, error_category, outcome |
openclaw_talk_event_total | counter | brain, event_type, mode, provider, transport |
openclaw_talk_event_duration_seconds | histogram | brain, event_type, mode, provider, transport |
openclaw_talk_audio_bytes | histogram | brain, event_type, mode, provider, transport |
openclaw_queue_lane_size | gauge | lane |
openclaw_queue_lane_wait_seconds | histogram | lane |
openclaw_session_state_total | counter | reason, state |
openclaw_session_queue_depth | gauge | state |
openclaw_session_turn_created_total | counter | agent, channel, trigger |
openclaw_session_stuck_total | counter | reason, state |
openclaw_session_stuck_age_seconds | histogram | reason, state |
openclaw_session_recovery_total | counter | action, active_work_kind, state, status |
openclaw_session_recovery_age_seconds | histogram | action, active_work_kind, state, status |
openclaw_liveness_warning_total | counter | reason |
openclaw_liveness_sessions | gauge | state |
openclaw_liveness_event_loop_delay_p99_seconds | histogram | reason |
openclaw_liveness_event_loop_delay_max_seconds | histogram | reason |
openclaw_liveness_event_loop_utilization_ratio | histogram | reason |
openclaw_liveness_cpu_core_ratio | histogram | reason |
openclaw_payload_large_total | counter | action, channel, plugin, reason, surface |
openclaw_payload_large_bytes | histogram | action, channel, plugin, reason, surface |
openclaw_memory_bytes | gauge | kind |
openclaw_memory_rss_bytes | histogram | none |
openclaw_memory_pressure_total | counter | level, reason |
openclaw_telemetry_exporter_total | counter | exporter, reason, signal, status |
openclaw_prometheus_series_dropped_total | counter | none |
标签策略
受限、低基数的标签
受限、低基数的标签
Prometheus 标签会保持受限且低基数。导出器不会输出原始诊断标识符,例如
runId、sessionKey、sessionId、callId、toolCallId、消息 ID、聊天 ID 或提供方请求 ID。标签值会被脱敏,并且必须符合 OpenClaw 的低基数字符策略。不符合该策略的值会根据指标不同被替换为 unknown、other 或 none。看起来像带作用域的 agent session key 的标签也会被替换为 unknown。系列上限与溢出计数
系列上限与溢出计数
导出器将内存中保留的时间序列总数上限设为 2048,该上限适用于计数器、仪表和直方图的总和。超过该上限的新系列会被丢弃,并且每发生一次,
openclaw_prometheus_series_dropped_total 就会加一。请将这个计数器视为上游属性泄露高基数值的硬性信号。导出器不会自动取消该上限;如果它持续上升,应修复数据源,而不是禁用上限。Prometheus 输出中绝不会出现的内容
Prometheus 输出中绝不会出现的内容
- prompt 文本、response 文本、工具输入、工具输出、系统 prompt
- Talk 转录、音频负载、call id、room id、handoff token、turn id 以及原始 session id
- 原始提供方请求 ID(仅在适用时,span 上使用有界哈希,指标中绝不会出现)
- session key 和 session ID
- 主机名、文件路径、密钥值
PromQL 示例
在 Prometheus 和 OpenTelemetry 导出之间进行选择
OpenClaw 独立支持这两种方式。你可以运行其中一种、两种都运行,或者都不运行。- diagnostics-prometheus
- diagnostics-otel
- 拉取 模式:Prometheus 抓取
/api/diagnostics/prometheus。 - 不需要外部 collector。
- 通过正常的 Gateway auth 进行认证。
- 该方式仅提供指标(不包含 traces 或 logs)。
- 适合已经标准化为 Prometheus + Grafana 的技术栈。
故障排查
响应体为空
响应体为空
- 检查配置中的
diagnostics.enabled: true。 - 确认插件已启用并通过
openclaw plugins list --enabled加载。 - 生成一些流量;计数器和直方图只有在至少发生一次事件后才会输出行。
401 / 未授权
401 / 未授权
该端点要求 Gateway operator 范围(
auth: "gateway" 且 gatewayRuntimeScopeSurface: "trusted-operator")。请使用 Prometheus 访问任何其他 Gateway operator 路由时所用的同一 token 或密码。没有公开的、无需认证的模式。`openclaw_prometheus_series_dropped_total` 持续上升
`openclaw_prometheus_series_dropped_total` 持续上升
某个新属性正在超过 2048 个 series 的上限。请检查最近的指标中是否有意外的高基数标签,并在源头修复。导出器会有意丢弃新 series,而不是悄悄重写标签。
重启后 Prometheus 显示陈旧的 series
重启后 Prometheus 显示陈旧的 series
插件只在内存中保留状态。Gateway 重启后,计数器会重置为零,而 gauge 会从其下次报告的值重新开始。请使用 PromQL 的
rate() 和 increase() 以正确处理重置。相关
- 诊断导出 — 用于支持包的本地诊断 zip
- 健康和就绪 —
/healthz和/readyz探测 - 日志记录 — 基于文件的日志记录
- OpenTelemetry 导出 — 用于跟踪、指标和日志的 OTLP 推送