Cold-start / uptime alert

The Cloud Monitoring uptime check probes /api/health/live every 300s with a 30s timeout from three regions. When it fails, the tappass downtime alert policy fires.

Most alerts resolve themselves within 90s. This runbook tells you which ones don't.

Condition type       Uptime Monitoring fired
Policy               tappass downtime alert (app.tappass.ai)
Project              tappass-prod
Condition            Health check failing
metric               monitoring.googleapis.com/uptime_check/check_passed

First 60 seconds

1. Is the user-facing service actually down?

for i in 1 2 3 4 5; do
  curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \
    https://eu.tappass.ai/api/health/live
done

Result	Meaning
All `200` under 500ms	False positive — service recovered. Go to Cold-start false positive.
Mix of `200` and slow (>10s) or `5xx`	Real outage or degradation — go to Real outage.
All `5xx` or timeouts	Real outage, full — declare SEV1, see Incident response.

2. Check the alert history

Recent alerts often cluster around a specific revision rollover:

gcloud logging read 'resource.type="uptime_url"' \
  --project=tappass-prod --limit=10 --freshness=30m --format=json | \
  python3 -c "import sys,json;[print(e.get('timestamp','')[:19], e.get('labels',{}).get('activity_type_name','')) for e in json.load(sys.stdin)]"

Open + AutoResolve pairs close together = flapping. Back-to-back opens without resolves = sustained outage.

Cold-start false positive

Signature

/api/health/live is healthy right now.
The alert fired once, auto-resolved within 90s.
A Cloud Run revision rollover happened in the last few minutes.

Diagnose

# Did a revision roll in the last 30 min?
gcloud run revisions list --service=tappass \
  --project=tappass-prod --region=europe-west1 --limit=5 \
  --format='table(name,active,status.conditions[0].lastTransitionTime.date("%Y-%m-%d %H:%M"))'

# Find the slow /health/live probe that tripped the alert
gcloud logging read \
  'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND httpRequest.requestUrl=~"health/live" AND httpRequest.userAgent=~"UptimeChecks"' \
  --project=tappass-prod --limit=20 --freshness=30m \
  --format='value(timestamp,httpRequest.latency)' | \
  awk '$2 ~ /[1-9][0-9]*\.|^[1-9]s/'

A latency > 10s from the UptimeChecks user-agent, correlated with a revision lastTransitionTime a few minutes earlier, means the probe hit a cold instance during startup (FastAPI + OPA sidecar + VPC connector come up together → 8–12s on a warm build, longer on first import of a large image).

Action

Nothing. The 30s timeout was tuned specifically to absorb this — if it's tripping anyway it's a one-off on an exceptionally slow cold start. Acknowledge the alert in PagerDuty and move on.

If the same cold-start pattern fires the alert three times in a week, bump min_instances from 1 → 2 in deploy/terraform/environments/gcp-prod/main.tf — trades ~$170/month for no cold-start exposure. See the comment block in that file for the cost trade-off already documented.

Real outage

Signature

/api/health/live from your terminal is slow or failing.
5xx rate in prod logs is elevated.
Alert has not auto-resolved.

Triage

# What's the current failure mode?
gcloud logging read \
  'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND (httpRequest.status>=500 OR severity=ERROR OR textPayload=~"Memory limit|OOM")' \
  --project=tappass-prod --limit=30 --freshness=10m \
  --format='value(timestamp,severity,httpRequest.status,textPayload,jsonPayload.event,jsonPayload.error)'

Common patterns and where to go:

Pattern	Likely cause	Runbook
`Memory limit of X MiB exceeded`	Container OOM	OOM / crashloop
`httpRequest.status=503` with no app log + very long latency	Cloud Run queue timeout (instance saturation)	Deploy core server → bump `max_instances` or roll back the bad deploy
`psycopg2.OperationalError` or `connection refused`	Cloud SQL unreachable	Restore from backup + check Cloud SQL instance state
`opa_authz_unavailable_denied` spiking	OPA sidecar crashlooping	Check OPA container logs; OOM / crashloop for OPA memory
Recent revision + elevated 5xx	Bad deploy	Roll back Cloud Run

Declare an incident

If the outage lasts more than 5 minutes or affects /api broadly, treat it as SEV1/SEV2 per Incident response. The rollback is usually step 1 — root-cause after users are off the fire.

After action

If this was a real outage (not a false positive):

Post a timeline to #incidents.
Open a postmortem doc within 48 hours — see the template in Incident response.
If it was cold-start-adjacent, evaluate whether the uptime-check timeout needs another bump (it's already at 30s as of 2026-04-22).

Tuning history

2026-04-22: timeout bumped 10s → 30s after perf-tuning rollovers kept cold-starting new instances past the 10s probe budget. See commit 02ccb7c and monitoring.tf.
2026-04-23: min_instances dropped 2 → 1 for cost. Single warm instance absorbs most cold-start risk; raise back to 2 if the false-positive rate climbs.

Cold-start / uptime alert

Pager payload

First 60 seconds

1. Is the user-facing service actually down?

2. Check the alert history

Cold-start false positive

Signature

Diagnose

Action

Real outage

Signature

Triage

Declare an incident

After action

Tuning history

Also see