Incident Response

Structured response to platform incidents with clear severity levels and communication paths.

Severity Levels

Level	Criteria	Response Time	Escalation
SEV-1	Platform down, data breach, security incident	15 min	CTO + Legal
SEV-2	Major feature broken, partial outage	1 hour	Engineering Manager
SEV-3	Degraded performance, non-critical bug	4 hours	Team Lead
SEV-4	Cosmetic issue, minor inconvenience	24 hours	On-call engineer

Response Protocol

1. Detect

Monitoring alert (CloudWatch, PagerDuty)
Customer report (Zendesk, Slack)
Automated health check failure

2. Triage

Assign severity
Create incident channel: #incident-YYYY-MM-DD-sev-N
Notify on-call engineer

3. Mitigate

Apply workaround or rollback
Communicate status to customers (status page)
Document timeline

4. Resolve

Deploy fix
Verify health checks
Close incident channel

5. Post-Mortem

Within 48 hours for SEV-1/2
Within 1 week for SEV-3

Post-Mortem Template

# Incident Post-Mortem: [Title]

## Summary
- Date: YYYY-MM-DD HH:MM IST
- Duration: X hours Y minutes
- Severity: SEV-N
- Impact: [Number of users, features affected]

## Timeline
- HH:MM - Detection
- HH:MM - Triage
- HH:MM - Mitigation applied
- HH:MM - Resolution

## Root Cause
[What happened and why]

## Lessons Learned
[What went well, what didn't]

## Action Items
- [ ] [Owner] [Due date] [Action]

Incident Response

Incident Response

Severity Levels

Response Protocol

1. Detect

2. Triage

3. Mitigate

4. Resolve

5. Post-Mortem

Post-Mortem Template

Related