Skip to content

Incident response

LevelMeaningResponse time
SEV1Customer-facing outage; data loss risk; security breach5 min
SEV2Degraded service; one or more features broken; elevated error rate15 min
SEV3Internal-only issue; workarounds exist; no customer impactNext business day
  1. Acknowledge — PagerDuty ack within 5 min
  2. Create a thread in #incidents — title [SEV?] <short description>
  3. Assign a commander — the person running the incident (usually you, the primary)
  4. Start a status doc — for SEV1/2, create a status doc in Google Docs
  5. Mitigate — get the bleeding stopped before rooting around for root cause
  6. Update status — every 15 min for SEV1, every 30 min for SEV2, in #incidents and the status doc
  7. Resolve — declare it resolved in PagerDuty; keep monitoring 30 min
  8. Postmortem — within 48 hours for SEV1/2. See Postmortem template below.
# [SEV2] Pipeline blocks spiking — 2026-04-18 14:23 CET
## What's happening
Pipeline block rate on prod jumped from ~0.2% baseline to 18% at 14:15 CET.
Affects all agents using the OpenAI provider.
## Impact
- Affected customers: ~12 active (filtering audit trail by `provider=openai`)
- User-visible: chat completions returning 403 with reason "policy_denied"
## Current status
INVESTIGATING — @jens rolling out hotfix to detect_injection threshold
## Timeline (newest first)
- 14:34 — hotfix PR merged, deploy to prod in flight
- 14:29 — identified regression in detect_injection (commit abc123)
- 14:23 — page fired, primary acknowledged, investigation started
- 14:15 — block rate spike begins (per Grafana)
# Postmortem — [SEV?] <short description>
**Date:** YYYY-MM-DD
**Duration:** HH:MM → HH:MM (X min)
**Author:** @handle
**Reviewers:** @handle, @handle
## Summary
<2-3 sentences — what broke, who was affected, how we fixed it>
## Impact
- Customers affected: <count / names>
- User-visible symptom:
- Data integrity: <none / suspected / confirmed>
- Revenue impact: <none / estimated>
## Timeline
<copy from the status doc>
## Root cause
<what actually caused it, explained so a new hire would understand>
## What went well
- …
## What went poorly
- …
## Action items
| # | Action | Owner | Due |
|---|---|---|---|
| 1 | <specific, testable> | @handle | YYYY-MM-DD |

Postmortems are blameless. Fix the system, not the person.