Cold-start / uptime alert
The Cloud Monitoring uptime check probes /api/health/live every
300s with a 30s timeout from three regions. When it fails, the
tappass downtime alert policy fires.
Most alerts resolve themselves within 90s. This runbook tells you which ones don't.
Pager payload
Section titled “Pager payload”Condition type Uptime Monitoring firedPolicy tappass downtime alert (app.tappass.ai)Project tappass-prodCondition Health check failingmetric monitoring.googleapis.com/uptime_check/check_passedFirst 60 seconds
Section titled “First 60 seconds”1. Is the user-facing service actually down?
Section titled “1. Is the user-facing service actually down?”for i in 1 2 3 4 5; do curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \ https://eu.tappass.ai/api/health/livedone| Result | Meaning |
|---|---|
All 200 under 500ms | False positive — service recovered. Go to Cold-start false positive. |
Mix of 200 and slow (>10s) or 5xx | Real outage or degradation — go to Real outage. |
All 5xx or timeouts | Real outage, full — declare SEV1, see Incident response. |
2. Check the alert history
Section titled “2. Check the alert history”Recent alerts often cluster around a specific revision rollover:
gcloud logging read 'resource.type="uptime_url"' \ --project=tappass-prod --limit=10 --freshness=30m --format=json | \ python3 -c "import sys,json;[print(e.get('timestamp','')[:19], e.get('labels',{}).get('activity_type_name','')) for e in json.load(sys.stdin)]"Open + AutoResolve pairs close together = flapping. Back-to-back opens without resolves = sustained outage.
Cold-start false positive
Section titled “Cold-start false positive”Signature
Section titled “Signature”/api/health/liveis healthy right now.- The alert fired once, auto-resolved within 90s.
- A Cloud Run revision rollover happened in the last few minutes.
Diagnose
Section titled “Diagnose”# Did a revision roll in the last 30 min?gcloud run revisions list --service=tappass \ --project=tappass-prod --region=europe-west1 --limit=5 \ --format='table(name,active,status.conditions[0].lastTransitionTime.date("%Y-%m-%d %H:%M"))'
# Find the slow /health/live probe that tripped the alertgcloud logging read \ 'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND httpRequest.requestUrl=~"health/live" AND httpRequest.userAgent=~"UptimeChecks"' \ --project=tappass-prod --limit=20 --freshness=30m \ --format='value(timestamp,httpRequest.latency)' | \ awk '$2 ~ /[1-9][0-9]*\.|^[1-9]s/'A latency > 10s from the UptimeChecks user-agent, correlated with a
revision lastTransitionTime a few minutes earlier, means the probe
hit a cold instance during startup (FastAPI + OPA sidecar + VPC
connector come up together → 8–12s on a warm build, longer on first
import of a large image).
Action
Section titled “Action”Nothing. The 30s timeout was tuned specifically to absorb this — if it's tripping anyway it's a one-off on an exceptionally slow cold start. Acknowledge the alert in PagerDuty and move on.
If the same cold-start pattern fires the alert three times in a
week, bump min_instances from 1 → 2 in
deploy/terraform/environments/gcp-prod/main.tf — trades ~$170/month
for no cold-start exposure. See the comment block in that file for
the cost trade-off already documented.
Real outage
Section titled “Real outage”Signature
Section titled “Signature”/api/health/livefrom your terminal is slow or failing.- 5xx rate in prod logs is elevated.
- Alert has not auto-resolved.
Triage
Section titled “Triage”# What's the current failure mode?gcloud logging read \ 'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND (httpRequest.status>=500 OR severity=ERROR OR textPayload=~"Memory limit|OOM")' \ --project=tappass-prod --limit=30 --freshness=10m \ --format='value(timestamp,severity,httpRequest.status,textPayload,jsonPayload.event,jsonPayload.error)'Common patterns and where to go:
| Pattern | Likely cause | Runbook |
|---|---|---|
Memory limit of X MiB exceeded | Container OOM | OOM / crashloop |
httpRequest.status=503 with no app log + very long latency | Cloud Run queue timeout (instance saturation) | Deploy core server → bump max_instances or roll back the bad deploy |
psycopg2.OperationalError or connection refused | Cloud SQL unreachable | Restore from backup + check Cloud SQL instance state |
opa_authz_unavailable_denied spiking | OPA sidecar crashlooping | Check OPA container logs; OOM / crashloop for OPA memory |
| Recent revision + elevated 5xx | Bad deploy | Roll back Cloud Run |
Declare an incident
Section titled “Declare an incident”If the outage lasts more than 5 minutes or affects /api broadly,
treat it as SEV1/SEV2 per Incident response.
The rollback is usually step 1 — root-cause after users are off the
fire.
After action
Section titled “After action”If this was a real outage (not a false positive):
- Post a timeline to
#incidents. - Open a postmortem doc within 48 hours — see the template in Incident response.
- If it was cold-start-adjacent, evaluate whether the uptime-check timeout needs another bump (it's already at 30s as of 2026-04-22).
Tuning history
Section titled “Tuning history”- 2026-04-22: timeout bumped 10s → 30s after perf-tuning
rollovers kept cold-starting new instances past the 10s probe
budget. See commit
02ccb7candmonitoring.tf. - 2026-04-23:
min_instancesdropped 2 → 1 for cost. Single warm instance absorbs most cold-start risk; raise back to 2 if the false-positive rate climbs.