Monitoring and Observability

vibeD exposes Prometheus metrics, health endpoints, and structured logs for production observability.

Prometheus Metrics

vibeD exposes metrics at /metrics on port 8080. This endpoint is always open (no authentication required) to allow Prometheus scraping without credential management.

When metrics.enabled: true (the default), the Helm chart adds standard Prometheus annotations to the pod:

prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"

Most Prometheus installations with annotation-based discovery will scrape vibeD automatically.

Available Metrics

Deployment Metrics

Metric	Type	Labels	Description
`vibed_deploys_total`	Counter	`status`, `target`	Total deployments
`vibed_deploy_duration_seconds`	Histogram	`status`, `target`	Deploy duration (buckets: 1s, 2s, 5s, 10s, 30s, 60s)

Artifact Metrics

Metric	Type	Labels	Description
`vibed_artifacts_active`	Gauge	`target`	Currently active artifacts by deployment target
`vibed_deletes_total`	Counter	`status`	Total artifact deletions

MCP Tool Metrics

Metric	Type	Labels	Description
`vibed_mcp_tool_calls_total`	Counter	`tool`, `status`	MCP tool invocations
`vibed_mcp_tool_call_duration_seconds`	Histogram	`tool`	MCP tool call duration (default Prometheus buckets)

Garbage Collector Metrics

Metric	Type	Labels	Description
`vibed_gc_resources_cleaned_total`	Counter	`type`	Total resources cleaned by garbage collector

The type label values are: job, configmap, deployment, service, and sandbox.

The GC runs periodically (default: every 1 hour) and removes orphaned Kubernetes resources whose artifact no longer exists in the store. See Configuration Reference for GC settings.

Warm Pool Metrics

Emitted by the warm pools that back the deploy path.

Metric	Type	Labels	Description
`vibed_pool_runners_idle`	Gauge	`language`	Warm idle runner pods available to claim
`vibed_pool_claims_total`	Counter	`language`, `source`	Runner claims, by `source` (`warm` pool hit vs `cold` on-demand)
`vibed_pool_claim_duration_seconds`	Histogram	`language`, `source`	Time to obtain a runner
`vibed_pool_runners_created_total`	Counter	`language`, `status`	Runner pods created, by outcome (`ready` vs `failed` warm-up)

A healthy pool keeps vibed_pool_runners_idle at the configured pool size and serves most claims from the warm source; a rising cold claim rate means the pool is being drained faster than it replenishes.

HTTP API Metrics

Metric	Type	Labels	Description
`vibed_http_requests_total`	Counter	`method`, `path`, `status_code`	HTTP API requests
`vibed_http_request_duration_seconds`	Histogram	`method`, `path`	HTTP request duration (default Prometheus buckets)

HTTP paths are normalized to prevent high cardinality (e.g., /api/artifacts/:id instead of individual artifact IDs).

SSE Metrics

Metric	Type	Labels	Description
`vibed_sse_connections_active`	Gauge	-	Number of active Server-Sent Events connections

The SSE endpoint (GET /api/events) streams real-time artifact lifecycle events to connected dashboard clients. This gauge tracks how many clients are currently connected.

Rate Limiting Metrics

Metric	Type	Labels	Description
`vibed_http_rate_limited_total`	Counter	`client_type`	HTTP requests rejected by rate limiting

The client_type label is apikey when the client is authenticated or ip when identified by IP address. See Configuration Reference for rate limit settings.

Governance Metrics

Metric	Type	Labels	Description
`vibed_quota_rejections_total`	Counter	`scope`	Deploys rejected by quota, by the ceiling that tripped (`owner` or `department`)
`vibed_audit_events_total`	Counter	`action`, `outcome`	Audit events recorded, by action (`deploy`/`delete`/`rollback`) and outcome (`ok`/`denied`/`error`)

These are served on the main server's /metrics (:8080). The bring-your-own base-image validator emits one more, but on the controller's metrics endpoint (:8081):

Metric	Type	Labels	Description
`vibed_template_validation`	Gauge	`template`, `result`	Per-slot base-image validation state; alert on `vibed_template_validation{result="invalid"} == 1`

Label Values

Label	Possible Values
`status`	`success`, `error`
`language`	`nodejs`, `python`, `go`, `static`
`target`	`sandbox`, `kubernetes`
`tool`	`deploy_artifact`, `update_artifact`, `list_artifacts`, `get_artifact_status`, `get_artifact_logs`, `delete_artifact`, `list_deployment_targets`

Scraping with Prometheus

Annotation-Based Discovery (Default)

If you use kube-prometheus-stack or a similar Prometheus Operator setup with annotation-based pod discovery, vibeD is scraped automatically. No additional configuration is needed.

ServiceMonitor (Prometheus Operator)

For explicit scrape configuration with the Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vibed
  namespace: vibed-system
  labels:
    release: prometheus    # Must match your Prometheus Operator's selector
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: vibed
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

PodMonitor (Alternative)

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: vibed
  namespace: vibed-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: vibed
  podMetricsEndpoints:
    - port: http
      path: /metrics
      interval: 30s

Example Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vibed-alerts
  namespace: vibed-system
spec:
  groups:
    - name: vibed.rules
      rules:
        - alert: VibeDHighDeployFailureRate
          expr: rate(vibed_deploys_total{status="error"}[5m]) > 0.1
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "Deploy failure rate is elevated"

        - alert: VibeDHighArtifactCount
          expr: sum(vibed_artifacts_active) > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High number of active artifacts ({{ $value }})"

        - alert: VibeDGCHighCleanupRate
          expr: rate(vibed_gc_resources_cleaned_total[1h]) > 10
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "GC is cleaning many orphaned resources ({{ $value }}/hr)"

Health Endpoints

vibeD exposes two health endpoints that are always open (no authentication required):

Endpoint	Purpose	Used By
`/healthz`	Liveness probe	Kubernetes restarts the pod if this fails
`/readyz`	Readiness probe	Kubernetes removes the pod from service if this fails

Both return JSON responses:

// GET /healthz
{
  "status": "ok",
  "uptime": "2h30m15s"
}

// GET /readyz
{
  "status": "ready",
  "components": {
    "store": "ok",
    "kubernetes": "ok"
  }
}

The Helm chart configures these probes with sensible defaults:

Probe	Initial Delay	Period	Timeout
Liveness (`/healthz`)	5s	30s	3s
Readiness (`/readyz`)	3s	10s	3s

Grafana Dashboard

The testbed/observability stack ships a ready-made vibeD Overview dashboard (testbed/observability/dashboards/vibed-overview.json) — installed automatically by make install-observability. You can also build your own from the metrics above. Recommended panels:

Deploy Success Rate - rate(vibed_deploys_total{status="success"}[5m]) / rate(vibed_deploys_total[5m])
Active Artifacts - sum(vibed_artifacts_active) by target
MCP Tool Usage - rate(vibed_mcp_tool_calls_total[5m]) by tool
HTTP Request Rate - rate(vibed_http_requests_total[5m]) by status_code
HTTP Latency P99 - histogram_quantile(0.99, rate(vibed_http_request_duration_seconds_bucket[5m]))
GC Cleanup Rate - rate(vibed_gc_resources_cleaned_total[1h]) by type
SSE Connections - vibed_sse_connections_active
Idle Runners - vibed_pool_runners_idle by language
Runner Claims (warm vs cold) - rate(vibed_pool_claims_total[5m]) by source
Claim Latency P99 - histogram_quantile(0.99, sum(rate(vibed_pool_claim_duration_seconds_bucket[5m])) by (le))

Distributed Tracing (OpenTelemetry)

vibeD supports OpenTelemetry distributed tracing, providing end-to-end visibility into the deploy pipeline. Each deploy produces a trace with child spans for build, push, and deploy steps.

Enabling Tracing

tracing:
  enabled: true
  endpoint: "http://jaeger:4317"   # OTLP gRPC endpoint
  sampleRate: 1.0                  # 1.0 = sample all, 0.1 = 10%

Or via environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317   # Also enables tracing
export VIBED_TRACING_SAMPLE_RATE=1.0

Exporters

Configuration	Behavior
`endpoint` set	Sends traces via OTLP gRPC to the specified collector (Jaeger, Tempo, etc.)
`endpoint` empty	Prints traces to stdout in pretty-print format (development mode)
`enabled: false`	No-op tracer, zero overhead

Trace Structure

A deploy operation produces spans like:

orchestrator.Deploy (root)
  +-- builder.Build
  +-- deployer.Deploy

Update and rollback operations are similarly instrumented. HTTP requests are traced via the otelhttp middleware, which extracts and injects traceparent headers.

Viewing Traces

Any OpenTelemetry-compatible backend works: Jaeger, Grafana Tempo, Datadog, Honeycomb, or New Relic. For the dev setup the testbed/observability chart bundles Tempo (plus Loki for logs and Prometheus for metrics); use the stdout exporter for quick debugging without a backend.

Prometheus Metrics​

Available Metrics​

Deployment Metrics​

Artifact Metrics​

MCP Tool Metrics​

Garbage Collector Metrics​

Warm Pool Metrics​

HTTP API Metrics​

SSE Metrics​

Rate Limiting Metrics​

Governance Metrics​

Label Values​

Scraping with Prometheus​

Annotation-Based Discovery (Default)​

ServiceMonitor (Prometheus Operator)​

PodMonitor (Alternative)​

Example Alert Rules​

Health Endpoints​

Grafana Dashboard​

Distributed Tracing (OpenTelemetry)​

Enabling Tracing​

Exporters​

Trace Structure​

Viewing Traces​