Developers
Incident Response
Severity levels, communication protocols, and post-mortem templates.
Incident Response
Structured response to platform incidents with clear severity levels and communication paths.
Severity Levels
| Level | Criteria | Response Time | Escalation |
|---|---|---|---|
| SEV-1 | Platform down, data breach, security incident | 15 min | CTO + Legal |
| SEV-2 | Major feature broken, partial outage | 1 hour | Engineering Manager |
| SEV-3 | Degraded performance, non-critical bug | 4 hours | Team Lead |
| SEV-4 | Cosmetic issue, minor inconvenience | 24 hours | On-call engineer |
Response Protocol
1. Detect
- Monitoring alert (CloudWatch, PagerDuty)
- Customer report (Zendesk, Slack)
- Automated health check failure
2. Triage
- Assign severity
- Create incident channel:
#incident-YYYY-MM-DD-sev-N - Notify on-call engineer
3. Mitigate
- Apply workaround or rollback
- Communicate status to customers (status page)
- Document timeline
4. Resolve
- Deploy fix
- Verify health checks
- Close incident channel
5. Post-Mortem
- Within 48 hours for SEV-1/2
- Within 1 week for SEV-3
Post-Mortem Template
# Incident Post-Mortem: [Title]
## Summary
- Date: YYYY-MM-DD HH:MM IST
- Duration: X hours Y minutes
- Severity: SEV-N
- Impact: [Number of users, features affected]
## Timeline
- HH:MM - Detection
- HH:MM - Triage
- HH:MM - Mitigation applied
- HH:MM - Resolution
## Root Cause
[What happened and why]
## Lessons Learned
[What went well, what didn't]
## Action Items
- [ ] [Owner] [Due date] [Action]