Incident response
Severity levels
Section titled “Severity levels”| Level | Meaning | Response time |
|---|---|---|
| SEV1 | Customer-facing outage; data loss risk; security breach | 5 min |
| SEV2 | Degraded service; one or more features broken; elevated error rate | 15 min |
| SEV3 | Internal-only issue; workarounds exist; no customer impact | Next business day |
- Acknowledge — PagerDuty
ackwithin 5 min - Create a thread in
#incidents— title[SEV?] <short description> - Assign a commander — the person running the incident (usually you, the primary)
- Start a status doc — for SEV1/2, create a status doc in Google Docs
- Mitigate — get the bleeding stopped before rooting around for root cause
- Update status — every 15 min for SEV1, every 30 min for SEV2, in
#incidentsand the status doc - Resolve — declare it resolved in PagerDuty; keep monitoring 30 min
- Postmortem — within 48 hours for SEV1/2. See Postmortem template below.
Status doc template
Section titled “Status doc template”# [SEV2] Pipeline blocks spiking — 2026-04-18 14:23 CET
## What's happeningPipeline block rate on prod jumped from ~0.2% baseline to 18% at 14:15 CET.Affects all agents using the OpenAI provider.
## Impact- Affected customers: ~12 active (filtering audit trail by `provider=openai`)- User-visible: chat completions returning 403 with reason "policy_denied"
## Current statusINVESTIGATING — @jens rolling out hotfix to detect_injection threshold
## Timeline (newest first)- 14:34 — hotfix PR merged, deploy to prod in flight- 14:29 — identified regression in detect_injection (commit abc123)- 14:23 — page fired, primary acknowledged, investigation started- 14:15 — block rate spike begins (per Grafana)Postmortem template
Section titled “Postmortem template”# Postmortem — [SEV?] <short description>**Date:** YYYY-MM-DD**Duration:** HH:MM → HH:MM (X min)**Author:** @handle**Reviewers:** @handle, @handle
## Summary<2-3 sentences — what broke, who was affected, how we fixed it>
## Impact- Customers affected: <count / names>- User-visible symptom:- Data integrity: <none / suspected / confirmed>- Revenue impact: <none / estimated>
## Timeline<copy from the status doc>
## Root cause<what actually caused it, explained so a new hire would understand>
## What went well- …
## What went poorly- …
## Action items| # | Action | Owner | Due ||---|---|---|---|| 1 | <specific, testable> | @handle | YYYY-MM-DD |Postmortems are blameless. Fix the system, not the person.