Live App →

Incident Response

Structured response to platform incidents with clear severity levels and communication paths.


Severity Levels

Level Criteria Response Time Escalation
SEV-1 Platform down, data breach, security incident 15 min CTO + Legal
SEV-2 Major feature broken, partial outage 1 hour Engineering Manager
SEV-3 Degraded performance, non-critical bug 4 hours Team Lead
SEV-4 Cosmetic issue, minor inconvenience 24 hours On-call engineer

Response Protocol

1. Detect

  • Monitoring alert (CloudWatch, PagerDuty)
  • Customer report (Zendesk, Slack)
  • Automated health check failure

2. Triage

  • Assign severity
  • Create incident channel: #incident-YYYY-MM-DD-sev-N
  • Notify on-call engineer

3. Mitigate

  • Apply workaround or rollback
  • Communicate status to customers (status page)
  • Document timeline

4. Resolve

  • Deploy fix
  • Verify health checks
  • Close incident channel

5. Post-Mortem

  • Within 48 hours for SEV-1/2
  • Within 1 week for SEV-3

Post-Mortem Template

# Incident Post-Mortem: [Title]

## Summary
- Date: YYYY-MM-DD HH:MM IST
- Duration: X hours Y minutes
- Severity: SEV-N
- Impact: [Number of users, features affected]

## Timeline
- HH:MM - Detection
- HH:MM - Triage
- HH:MM - Mitigation applied
- HH:MM - Resolution

## Root Cause
[What happened and why]

## Lessons Learned
[What went well, what didn't]

## Action Items
- [ ] [Owner] [Due date] [Action]