Skip to main content

Monitoring and Observability

vibeD exposes Prometheus metrics, health endpoints, and structured logs for production observability.

Prometheus Metrics

vibeD exposes metrics at /metrics on port 8080. This endpoint is always open (no authentication required) to allow Prometheus scraping without credential management.

When metrics.enabled: true (the default), the Helm chart adds standard Prometheus annotations to the pod:

prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"

Most Prometheus installations with annotation-based discovery will scrape vibeD automatically.

Available Metrics

Deployment Metrics

MetricTypeLabelsDescription
vibed_deploys_totalCounterstatus, targetTotal deployments
vibed_deploy_duration_secondsHistogramstatus, targetDeploy duration (buckets: 1s, 2s, 5s, 10s, 30s, 60s)

Artifact Metrics

MetricTypeLabelsDescription
vibed_artifacts_activeGaugetargetCurrently active artifacts by deployment target
vibed_deletes_totalCounterstatusTotal artifact deletions

MCP Tool Metrics

MetricTypeLabelsDescription
vibed_mcp_tool_calls_totalCountertool, statusMCP tool invocations
vibed_mcp_tool_call_duration_secondsHistogramtoolMCP tool call duration (default Prometheus buckets)

Garbage Collector Metrics

MetricTypeLabelsDescription
vibed_gc_resources_cleaned_totalCountertypeTotal resources cleaned by garbage collector

The type label values are: job, configmap, deployment, service, and sandbox.

The GC runs periodically (default: every 1 hour) and removes orphaned Kubernetes resources whose artifact no longer exists in the store. See Configuration Reference for GC settings.

Warm Pool Metrics

Emitted by the warm pools that back the deploy path.

MetricTypeLabelsDescription
vibed_pool_runners_idleGaugelanguageWarm idle runner pods available to claim
vibed_pool_claims_totalCounterlanguage, sourceRunner claims, by source (warm pool hit vs cold on-demand)
vibed_pool_claim_duration_secondsHistogramlanguage, sourceTime to obtain a runner
vibed_pool_runners_created_totalCounterlanguage, statusRunner pods created, by outcome (ready vs failed warm-up)

A healthy pool keeps vibed_pool_runners_idle at the configured pool size and serves most claims from the warm source; a rising cold claim rate means the pool is being drained faster than it replenishes.

HTTP API Metrics

MetricTypeLabelsDescription
vibed_http_requests_totalCountermethod, path, status_codeHTTP API requests
vibed_http_request_duration_secondsHistogrammethod, pathHTTP request duration (default Prometheus buckets)

HTTP paths are normalized to prevent high cardinality (e.g., /api/artifacts/:id instead of individual artifact IDs).

SSE Metrics

MetricTypeLabelsDescription
vibed_sse_connections_activeGauge-Number of active Server-Sent Events connections

The SSE endpoint (GET /api/events) streams real-time artifact lifecycle events to connected dashboard clients. This gauge tracks how many clients are currently connected.

Rate Limiting Metrics

MetricTypeLabelsDescription
vibed_http_rate_limited_totalCounterclient_typeHTTP requests rejected by rate limiting

The client_type label is apikey when the client is authenticated or ip when identified by IP address. See Configuration Reference for rate limit settings.

Governance Metrics

MetricTypeLabelsDescription
vibed_quota_rejections_totalCounterscopeDeploys rejected by quota, by the ceiling that tripped (owner or department)
vibed_audit_events_totalCounteraction, outcomeAudit events recorded, by action (deploy/delete/rollback) and outcome (ok/denied/error)

These are served on the main server's /metrics (:8080). The bring-your-own base-image validator emits one more, but on the controller's metrics endpoint (:8081):

MetricTypeLabelsDescription
vibed_template_validationGaugetemplate, resultPer-slot base-image validation state; alert on vibed_template_validation{result="invalid"} == 1

Label Values

LabelPossible Values
statussuccess, error
languagenodejs, python, go, static
targetsandbox, kubernetes
tooldeploy_artifact, update_artifact, list_artifacts, get_artifact_status, get_artifact_logs, delete_artifact, list_deployment_targets

Scraping with Prometheus

Annotation-Based Discovery (Default)

If you use kube-prometheus-stack or a similar Prometheus Operator setup with annotation-based pod discovery, vibeD is scraped automatically. No additional configuration is needed.

ServiceMonitor (Prometheus Operator)

For explicit scrape configuration with the Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vibed
namespace: vibed-system
labels:
release: prometheus # Must match your Prometheus Operator's selector
spec:
selector:
matchLabels:
app.kubernetes.io/name: vibed
endpoints:
- port: http
path: /metrics
interval: 30s

PodMonitor (Alternative)

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: vibed
namespace: vibed-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: vibed
podMetricsEndpoints:
- port: http
path: /metrics
interval: 30s

Example Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: vibed-alerts
namespace: vibed-system
spec:
groups:
- name: vibed.rules
rules:
- alert: VibeDHighDeployFailureRate
expr: rate(vibed_deploys_total{status="error"}[5m]) > 0.1
for: 10m
labels:
severity: critical
annotations:
summary: "Deploy failure rate is elevated"

- alert: VibeDHighArtifactCount
expr: sum(vibed_artifacts_active) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High number of active artifacts ({{ $value }})"

- alert: VibeDGCHighCleanupRate
expr: rate(vibed_gc_resources_cleaned_total[1h]) > 10
for: 30m
labels:
severity: warning
annotations:
summary: "GC is cleaning many orphaned resources ({{ $value }}/hr)"

Health Endpoints

vibeD exposes two health endpoints that are always open (no authentication required):

EndpointPurposeUsed By
/healthzLiveness probeKubernetes restarts the pod if this fails
/readyzReadiness probeKubernetes removes the pod from service if this fails

Both return JSON responses:

// GET /healthz
{
"status": "ok",
"uptime": "2h30m15s"
}

// GET /readyz
{
"status": "ready",
"components": {
"store": "ok",
"kubernetes": "ok"
}
}

The Helm chart configures these probes with sensible defaults:

ProbeInitial DelayPeriodTimeout
Liveness (/healthz)5s30s3s
Readiness (/readyz)3s10s3s

Grafana Dashboard

The testbed/observability stack ships a ready-made vibeD Overview dashboard (testbed/observability/dashboards/vibed-overview.json) — installed automatically by make install-observability. You can also build your own from the metrics above. Recommended panels:

  • Deploy Success Rate - rate(vibed_deploys_total{status="success"}[5m]) / rate(vibed_deploys_total[5m])
  • Active Artifacts - sum(vibed_artifacts_active) by target
  • MCP Tool Usage - rate(vibed_mcp_tool_calls_total[5m]) by tool
  • HTTP Request Rate - rate(vibed_http_requests_total[5m]) by status_code
  • HTTP Latency P99 - histogram_quantile(0.99, rate(vibed_http_request_duration_seconds_bucket[5m]))
  • GC Cleanup Rate - rate(vibed_gc_resources_cleaned_total[1h]) by type
  • SSE Connections - vibed_sse_connections_active
  • Idle Runners - vibed_pool_runners_idle by language
  • Runner Claims (warm vs cold) - rate(vibed_pool_claims_total[5m]) by source
  • Claim Latency P99 - histogram_quantile(0.99, sum(rate(vibed_pool_claim_duration_seconds_bucket[5m])) by (le))

Distributed Tracing (OpenTelemetry)

vibeD supports OpenTelemetry distributed tracing, providing end-to-end visibility into the deploy pipeline. Each deploy produces a trace with child spans for build, push, and deploy steps.

Enabling Tracing

tracing:
enabled: true
endpoint: "http://jaeger:4317" # OTLP gRPC endpoint
sampleRate: 1.0 # 1.0 = sample all, 0.1 = 10%

Or via environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317   # Also enables tracing
export VIBED_TRACING_SAMPLE_RATE=1.0

Exporters

ConfigurationBehavior
endpoint setSends traces via OTLP gRPC to the specified collector (Jaeger, Tempo, etc.)
endpoint emptyPrints traces to stdout in pretty-print format (development mode)
enabled: falseNo-op tracer, zero overhead

Trace Structure

A deploy operation produces spans like:

orchestrator.Deploy (root)
+-- builder.Build
+-- deployer.Deploy

Update and rollback operations are similarly instrumented. HTTP requests are traced via the otelhttp middleware, which extracts and injects traceparent headers.

Viewing Traces

Any OpenTelemetry-compatible backend works: Jaeger, Grafana Tempo, Datadog, Honeycomb, or New Relic. For the dev setup the testbed/observability chart bundles Tempo (plus Loki for logs and Prometheus for metrics); use the stdout exporter for quick debugging without a backend.