Developers
Observability
Logs, metrics, traces, dashboards, and alerting across the Sentinel platform.
Observability
Three pillars: logs, metrics, and traces. One pane of glass for platform health.
Architecture
Services ──► OpenTelemetry Collector ──►
├──► CloudWatch Logs (raw logs)
├──► CloudWatch Metrics (aggregates)
├──► X-Ray (distributed traces)
└──► Grafana (dashboards)
Logging
| Source | Destination | Retention | Format |
|---|---|---|---|
| Application | CloudWatch Logs | 30 days | Structured JSON |
| Audit | PostgreSQL | 7 years | Immutable rows |
| Access | CloudFront + ALB | 90 days | Apache-style |
| Error | Sentry (optional) | 90 days | Stack traces |
Log format:
{
"timestamp": "2026-05-15T10:00:00Z",
"level": "INFO",
"service": "nexus-backend",
"trace_id": "abc123",
"span_id": "def456",
"message": "Document extraction completed",
"attributes": {
"document_id": "doc_789",
"user_id": "usr_456",
"duration_ms": 45000
}
}
Metrics
| Category | Metric | Alert Threshold |
|---|---|---|
| Availability | /health uptime |
< 99.5% for 5 min |
| Latency | p95 API response time | > 2s for 5 min |
| Throughput | Documents processed / min | Drop > 50% |
| Errors | 5xx rate | > 1% for 5 min |
| Queue | Celery queue depth | > 100 jobs |
| Cost | LLM token spend / hour | > 2× baseline |
Dashboards:
- Platform Overview: Uptime, latency, error rate, traffic
- Nexus Pipeline: Queue depth, processing time, accuracy
- Zen Chat: Sessions, messages, token usage, guardrail triggers
- Cost: Per-tenant spend, per-model usage, cache hit rate
Tracing
OpenTelemetry traces span across services:
Trace: chat_session_abc
├── Span: frontend_request (50ms)
├── Span: studio_auth (15ms)
├── Span: zen_process (1200ms)
│ ├── Span: supervisor_classify (200ms)
│ ├── Span: kb_retrieval (150ms)
│ └── Span: llm_invoke (800ms)
└── Span: response_stream (300ms)
Trace IDs are propagated via traceparent header and included in all logs.
Alerting
| Severity | Channel | Response |
|---|---|---|
| P1 (critical) | PagerDuty + Slack + Email | On-call engineer |
| P2 (major) | Slack + Email | Team lead |
| P3 (minor) | Next business day |
Runbook links: Every alert includes a link to the relevant runbook.