Incident response

We're a small team. There's no on-call rotation, no incident commander, no 5-minute ack SLA. When something breaks, whoever notices first runs the triage. This page tells you what to do.

The 10-minute triage

When you see 5xx rate climbing, a customer report, or the uptime pager firing:

1. Is the app actually down?

curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \
  https://eu.tappass.ai/api/health/live

Run it five times. If it's green and fast every time, you're chasing a false positive → Cold-start / uptime alert.

2. What does prod's log look like right now?

gcloud logging read \
  'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND (severity>=ERROR OR httpRequest.status>=500 OR textPayload=~"Memory limit|OOM")' \
  --project=tappass-prod --limit=30 --freshness=10m \
  --format='value(timestamp,severity,httpRequest.status,httpRequest.requestUrl,textPayload,jsonPayload.event,jsonPayload.error)'

Look for the dominant pattern:

What you see	Go to
`Memory limit of X MiB exceeded`	OOM / crashloop
503s with empty body + long latency	Instance saturation — scale bump, then Deploy core server if the latest deploy introduced it
DB connection errors / `OperationalError`	Check Cloud SQL instance state in console
OPA timeouts (`opa_authz_unavailable_denied`)	OPA sidecar health; see also OOM / crashloop — OPA too
Spike correlates with a recent deploy	Roll back Cloud Run

3. Stop the bleeding before rooting around

If the last deploy is the likely culprit, roll back first, then diagnose. A 30-second traffic flip beats 30 minutes of log-reading under pressure.

git log --oneline -5    # what shipped recently
gcloud run revisions list --service=tappass \
  --project=tappass-prod --region=europe-west1 --limit=5

Pick the prior healthy revision → Roll back Cloud Run.

4. Say something

Post in #incidents (or wherever the team currently coordinates) with:

What broke (one sentence)
What you're trying right now
What other people should not do (e.g. "don't deploy")

Update every 15 min or when state changes. Silence is worse than "still digging."

5. Keep monitoring after you think it's fixed

Latency + error rate should stay flat for ≥ 30 min before you close it. Cloud Run scales instances lazily — a "fixed" state can revert when the scheduler evicts your warm replacement.

Customer-reported incidents

If a customer reports an issue before our monitoring catches it:

Treat as real until proven otherwise. Don't dismiss on no-repro.
Run the same triage above.
Reply to the customer within 1 hour with: "we see the report, we're investigating, will follow up by HH:MM."
If they're on an Enterprise plan with a contractual SLA, check Support SLAs for the actual commitment.

After it's over

Write it down. Not a formal postmortem template with root-cause-five-whys — just a Slack thread (or a short note in #incidents) covering:

What broke — one paragraph.
What fixed it — the actual commit or config change.
What we'd do differently — one bullet, honest.
Is there a new runbook worth writing? — if yes, draft it.

Every real incident this week made a runbook better. Keep that flywheel turning.

What we're deliberately not doing

No PagerDuty / Opsgenie. Cloud Monitoring emails + the #incidents Slack channel are the whole alerting stack. If we grow past three on-call-able people, revisit.
No SEV1/2/3 tiering. Either it's customer-visible or it isn't.
No incident commander role. Whoever's unblocked runs it.
No postmortem template. A short thread with honest lessons beats a filled-in form nobody reads.

Re-evaluate these when team size or contract obligations force the issue.

Also see

Roll back Cloud Run — usually step 1.
OOM / crashloop — the most common real-outage cause this month.
Cold-start / uptime alert — when the pager lies.
Monitoring & alerts — the actual tools we have (no Grafana / Alertmanager fiction).