Incident response
We're a small team. There's no on-call rotation, no incident commander, no 5-minute ack SLA. When something breaks, whoever notices first runs the triage. This page tells you what to do.
The 10-minute triage
Section titled “The 10-minute triage”When you see 5xx rate climbing, a customer report, or the uptime pager firing:
1. Is the app actually down?
Section titled “1. Is the app actually down?”curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \ https://eu.tappass.ai/api/health/liveRun it five times. If it's green and fast every time, you're chasing a false positive → Cold-start / uptime alert.
2. What does prod's log look like right now?
Section titled “2. What does prod's log look like right now?”gcloud logging read \ 'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND (severity>=ERROR OR httpRequest.status>=500 OR textPayload=~"Memory limit|OOM")' \ --project=tappass-prod --limit=30 --freshness=10m \ --format='value(timestamp,severity,httpRequest.status,httpRequest.requestUrl,textPayload,jsonPayload.event,jsonPayload.error)'Look for the dominant pattern:
| What you see | Go to |
|---|---|
Memory limit of X MiB exceeded | OOM / crashloop |
| 503s with empty body + long latency | Instance saturation — scale bump, then Deploy core server if the latest deploy introduced it |
DB connection errors / OperationalError | Check Cloud SQL instance state in console |
OPA timeouts (opa_authz_unavailable_denied) | OPA sidecar health; see also OOM / crashloop — OPA too |
| Spike correlates with a recent deploy | Roll back Cloud Run |
3. Stop the bleeding before rooting around
Section titled “3. Stop the bleeding before rooting around”If the last deploy is the likely culprit, roll back first, then diagnose. A 30-second traffic flip beats 30 minutes of log-reading under pressure.
git log --oneline -5 # what shipped recentlygcloud run revisions list --service=tappass \ --project=tappass-prod --region=europe-west1 --limit=5Pick the prior healthy revision → Roll back Cloud Run.
4. Say something
Section titled “4. Say something”Post in #incidents (or wherever the team currently coordinates) with:
- What broke (one sentence)
- What you're trying right now
- What other people should not do (e.g. "don't deploy")
Update every 15 min or when state changes. Silence is worse than "still digging."
5. Keep monitoring after you think it's fixed
Section titled “5. Keep monitoring after you think it's fixed”Latency + error rate should stay flat for ≥ 30 min before you close it. Cloud Run scales instances lazily — a "fixed" state can revert when the scheduler evicts your warm replacement.
Customer-reported incidents
Section titled “Customer-reported incidents”If a customer reports an issue before our monitoring catches it:
- Treat as real until proven otherwise. Don't dismiss on no-repro.
- Run the same triage above.
- Reply to the customer within 1 hour with: "we see the report, we're investigating, will follow up by HH:MM."
- If they're on an Enterprise plan with a contractual SLA, check Support SLAs for the actual commitment.
After it's over
Section titled “After it's over”Write it down. Not a formal postmortem template with root-cause-five-whys
— just a Slack thread (or a short note in #incidents) covering:
- What broke — one paragraph.
- What fixed it — the actual commit or config change.
- What we'd do differently — one bullet, honest.
- Is there a new runbook worth writing? — if yes, draft it.
Every real incident this week made a runbook better. Keep that flywheel turning.
What we're deliberately not doing
Section titled “What we're deliberately not doing”- No PagerDuty / Opsgenie. Cloud Monitoring emails + the
#incidentsSlack channel are the whole alerting stack. If we grow past three on-call-able people, revisit. - No SEV1/2/3 tiering. Either it's customer-visible or it isn't.
- No incident commander role. Whoever's unblocked runs it.
- No postmortem template. A short thread with honest lessons beats a filled-in form nobody reads.
Re-evaluate these when team size or contract obligations force the issue.
Also see
Section titled “Also see”- Roll back Cloud Run — usually step 1.
- OOM / crashloop — the most common real-outage cause this month.
- Cold-start / uptime alert — when the pager lies.
- Monitoring & alerts — the actual tools we have (no Grafana / Alertmanager fiction).