Observability

Three pillars: logs, metrics, and traces. One pane of glass for platform health.

Architecture

Services ──► OpenTelemetry Collector ──►
    ├──► CloudWatch Logs (raw logs)
    ├──► CloudWatch Metrics (aggregates)
    ├──► X-Ray (distributed traces)
    └──► Grafana (dashboards)

Logging

Source	Destination	Retention	Format
Application	CloudWatch Logs	30 days	Structured JSON
Audit	PostgreSQL	7 years	Immutable rows
Access	CloudFront + ALB	90 days	Apache-style
Error	Sentry (optional)	90 days	Stack traces

Log format:

{
  "timestamp": "2026-05-15T10:00:00Z",
  "level": "INFO",
  "service": "nexus-backend",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "Document extraction completed",
  "attributes": {
    "document_id": "doc_789",
    "user_id": "usr_456",
    "duration_ms": 45000
  }
}

Metrics

Category	Metric	Alert Threshold
Availability	`/health` uptime	< 99.5% for 5 min
Latency	p95 API response time	> 2s for 5 min
Throughput	Documents processed / min	Drop > 50%
Errors	5xx rate	> 1% for 5 min
Queue	Celery queue depth	> 100 jobs
Cost	LLM token spend / hour	> 2× baseline

Dashboards:

Platform Overview: Uptime, latency, error rate, traffic
Nexus Pipeline: Queue depth, processing time, accuracy
Zen Chat: Sessions, messages, token usage, guardrail triggers
Cost: Per-tenant spend, per-model usage, cache hit rate

Tracing

OpenTelemetry traces span across services:

Trace: chat_session_abc
├── Span: frontend_request (50ms)
├── Span: studio_auth (15ms)
├── Span: zen_process (1200ms)
│   ├── Span: supervisor_classify (200ms)
│   ├── Span: kb_retrieval (150ms)
│   └── Span: llm_invoke (800ms)
└── Span: response_stream (300ms)

Trace IDs are propagated via traceparent header and included in all logs.

Alerting

Severity	Channel	Response
P1 (critical)	PagerDuty + Slack + Email	On-call engineer
P2 (major)	Slack + Email	Team lead
P3 (minor)	Email	Next business day

Runbook links: Every alert includes a link to the relevant runbook.

Observability

Observability

Architecture

Logging

Metrics

Tracing

Alerting

Related