Skip to content

Cold-start / uptime alert

The Cloud Monitoring uptime check probes /api/health/live every 300s with a 30s timeout from three regions. When it fails, the tappass downtime alert policy fires.

Most alerts resolve themselves within 90s. This runbook tells you which ones don't.

Condition type Uptime Monitoring fired
Policy tappass downtime alert (app.tappass.ai)
Project tappass-prod
Condition Health check failing
metric monitoring.googleapis.com/uptime_check/check_passed

1. Is the user-facing service actually down?

Section titled “1. Is the user-facing service actually down?”
Terminal window
for i in 1 2 3 4 5; do
curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \
https://eu.tappass.ai/api/health/live
done
ResultMeaning
All 200 under 500msFalse positive — service recovered. Go to Cold-start false positive.
Mix of 200 and slow (>10s) or 5xxReal outage or degradation — go to Real outage.
All 5xx or timeoutsReal outage, full — declare SEV1, see Incident response.

Recent alerts often cluster around a specific revision rollover:

Terminal window
gcloud logging read 'resource.type="uptime_url"' \
--project=tappass-prod --limit=10 --freshness=30m --format=json | \
python3 -c "import sys,json;[print(e.get('timestamp','')[:19], e.get('labels',{}).get('activity_type_name','')) for e in json.load(sys.stdin)]"

Open + AutoResolve pairs close together = flapping. Back-to-back opens without resolves = sustained outage.

  • /api/health/live is healthy right now.
  • The alert fired once, auto-resolved within 90s.
  • A Cloud Run revision rollover happened in the last few minutes.
Terminal window
# Did a revision roll in the last 30 min?
gcloud run revisions list --service=tappass \
--project=tappass-prod --region=europe-west1 --limit=5 \
--format='table(name,active,status.conditions[0].lastTransitionTime.date("%Y-%m-%d %H:%M"))'
# Find the slow /health/live probe that tripped the alert
gcloud logging read \
'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND httpRequest.requestUrl=~"health/live" AND httpRequest.userAgent=~"UptimeChecks"' \
--project=tappass-prod --limit=20 --freshness=30m \
--format='value(timestamp,httpRequest.latency)' | \
awk '$2 ~ /[1-9][0-9]*\.|^[1-9]s/'

A latency > 10s from the UptimeChecks user-agent, correlated with a revision lastTransitionTime a few minutes earlier, means the probe hit a cold instance during startup (FastAPI + OPA sidecar + VPC connector come up together → 8–12s on a warm build, longer on first import of a large image).

Nothing. The 30s timeout was tuned specifically to absorb this — if it's tripping anyway it's a one-off on an exceptionally slow cold start. Acknowledge the alert in PagerDuty and move on.

If the same cold-start pattern fires the alert three times in a week, bump min_instances from 1 → 2 in deploy/terraform/environments/gcp-prod/main.tf — trades ~$170/month for no cold-start exposure. See the comment block in that file for the cost trade-off already documented.

  • /api/health/live from your terminal is slow or failing.
  • 5xx rate in prod logs is elevated.
  • Alert has not auto-resolved.
Terminal window
# What's the current failure mode?
gcloud logging read \
'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND (httpRequest.status>=500 OR severity=ERROR OR textPayload=~"Memory limit|OOM")' \
--project=tappass-prod --limit=30 --freshness=10m \
--format='value(timestamp,severity,httpRequest.status,textPayload,jsonPayload.event,jsonPayload.error)'

Common patterns and where to go:

PatternLikely causeRunbook
Memory limit of X MiB exceededContainer OOMOOM / crashloop
httpRequest.status=503 with no app log + very long latencyCloud Run queue timeout (instance saturation)Deploy core server → bump max_instances or roll back the bad deploy
psycopg2.OperationalError or connection refusedCloud SQL unreachableRestore from backup + check Cloud SQL instance state
opa_authz_unavailable_denied spikingOPA sidecar crashloopingCheck OPA container logs; OOM / crashloop for OPA memory
Recent revision + elevated 5xxBad deployRoll back Cloud Run

If the outage lasts more than 5 minutes or affects /api broadly, treat it as SEV1/SEV2 per Incident response. The rollback is usually step 1 — root-cause after users are off the fire.

If this was a real outage (not a false positive):

  1. Post a timeline to #incidents.
  2. Open a postmortem doc within 48 hours — see the template in Incident response.
  3. If it was cold-start-adjacent, evaluate whether the uptime-check timeout needs another bump (it's already at 30s as of 2026-04-22).
  • 2026-04-22: timeout bumped 10s → 30s after perf-tuning rollovers kept cold-starting new instances past the 10s probe budget. See commit 02ccb7c and monitoring.tf.
  • 2026-04-23: min_instances dropped 2 → 1 for cost. Single warm instance absorbs most cold-start risk; raise back to 2 if the false-positive rate climbs.