Skip to content

Incident response

We're a small team. There's no on-call rotation, no incident commander, no 5-minute ack SLA. When something breaks, whoever notices first runs the triage. This page tells you what to do.

When you see 5xx rate climbing, a customer report, or the uptime pager firing:

Terminal window
curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \
https://eu.tappass.ai/api/health/live

Run it five times. If it's green and fast every time, you're chasing a false positive → Cold-start / uptime alert.

2. What does prod's log look like right now?

Section titled “2. What does prod's log look like right now?”
Terminal window
gcloud logging read \
'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND (severity>=ERROR OR httpRequest.status>=500 OR textPayload=~"Memory limit|OOM")' \
--project=tappass-prod --limit=30 --freshness=10m \
--format='value(timestamp,severity,httpRequest.status,httpRequest.requestUrl,textPayload,jsonPayload.event,jsonPayload.error)'

Look for the dominant pattern:

What you seeGo to
Memory limit of X MiB exceededOOM / crashloop
503s with empty body + long latencyInstance saturation — scale bump, then Deploy core server if the latest deploy introduced it
DB connection errors / OperationalErrorCheck Cloud SQL instance state in console
OPA timeouts (opa_authz_unavailable_denied)OPA sidecar health; see also OOM / crashloop — OPA too
Spike correlates with a recent deployRoll back Cloud Run

3. Stop the bleeding before rooting around

Section titled “3. Stop the bleeding before rooting around”

If the last deploy is the likely culprit, roll back first, then diagnose. A 30-second traffic flip beats 30 minutes of log-reading under pressure.

git log --oneline -5 # what shipped recently
gcloud run revisions list --service=tappass \
--project=tappass-prod --region=europe-west1 --limit=5

Pick the prior healthy revision → Roll back Cloud Run.

Post in #incidents (or wherever the team currently coordinates) with:

  • What broke (one sentence)
  • What you're trying right now
  • What other people should not do (e.g. "don't deploy")

Update every 15 min or when state changes. Silence is worse than "still digging."

5. Keep monitoring after you think it's fixed

Section titled “5. Keep monitoring after you think it's fixed”

Latency + error rate should stay flat for ≥ 30 min before you close it. Cloud Run scales instances lazily — a "fixed" state can revert when the scheduler evicts your warm replacement.

If a customer reports an issue before our monitoring catches it:

  1. Treat as real until proven otherwise. Don't dismiss on no-repro.
  2. Run the same triage above.
  3. Reply to the customer within 1 hour with: "we see the report, we're investigating, will follow up by HH:MM."
  4. If they're on an Enterprise plan with a contractual SLA, check Support SLAs for the actual commitment.

Write it down. Not a formal postmortem template with root-cause-five-whys — just a Slack thread (or a short note in #incidents) covering:

  • What broke — one paragraph.
  • What fixed it — the actual commit or config change.
  • What we'd do differently — one bullet, honest.
  • Is there a new runbook worth writing? — if yes, draft it.

Every real incident this week made a runbook better. Keep that flywheel turning.

  • No PagerDuty / Opsgenie. Cloud Monitoring emails + the #incidents Slack channel are the whole alerting stack. If we grow past three on-call-able people, revisit.
  • No SEV1/2/3 tiering. Either it's customer-visible or it isn't.
  • No incident commander role. Whoever's unblocked runs it.
  • No postmortem template. A short thread with honest lessons beats a filled-in form nobody reads.

Re-evaluate these when team size or contract obligations force the issue.