Live App →

Observability

Three pillars: logs, metrics, and traces. One pane of glass for platform health.


Architecture

Services ──► OpenTelemetry Collector ──►
    ├──► CloudWatch Logs (raw logs)
    ├──► CloudWatch Metrics (aggregates)
    ├──► X-Ray (distributed traces)
    └──► Grafana (dashboards)

Logging

Source Destination Retention Format
Application CloudWatch Logs 30 days Structured JSON
Audit PostgreSQL 7 years Immutable rows
Access CloudFront + ALB 90 days Apache-style
Error Sentry (optional) 90 days Stack traces

Log format:

{
  "timestamp": "2026-05-15T10:00:00Z",
  "level": "INFO",
  "service": "nexus-backend",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "Document extraction completed",
  "attributes": {
    "document_id": "doc_789",
    "user_id": "usr_456",
    "duration_ms": 45000
  }
}

Metrics

Category Metric Alert Threshold
Availability /health uptime < 99.5% for 5 min
Latency p95 API response time > 2s for 5 min
Throughput Documents processed / min Drop > 50%
Errors 5xx rate > 1% for 5 min
Queue Celery queue depth > 100 jobs
Cost LLM token spend / hour > 2× baseline

Dashboards:

  • Platform Overview: Uptime, latency, error rate, traffic
  • Nexus Pipeline: Queue depth, processing time, accuracy
  • Zen Chat: Sessions, messages, token usage, guardrail triggers
  • Cost: Per-tenant spend, per-model usage, cache hit rate

Tracing

OpenTelemetry traces span across services:

Trace: chat_session_abc
├── Span: frontend_request (50ms)
├── Span: studio_auth (15ms)
├── Span: zen_process (1200ms)
│   ├── Span: supervisor_classify (200ms)
│   ├── Span: kb_retrieval (150ms)
│   └── Span: llm_invoke (800ms)
└── Span: response_stream (300ms)

Trace IDs are propagated via traceparent header and included in all logs.


Alerting

Severity Channel Response
P1 (critical) PagerDuty + Slack + Email On-call engineer
P2 (major) Slack + Email Team lead
P3 (minor) Email Next business day

Runbook links: Every alert includes a link to the relevant runbook.