OOM / crashloop

When the tappass container runs out of memory, Cloud Run prints a specific error to the platform logs and either fails the new revision outright or silently evicts and restarts the instance under load. This runbook covers the whole loop: identify → rollback if serving traffic → fix memory → redeploy → lock into terraform.

Recent precedent: on 2026-04-22 we shipped --workers 2 to uvicorn (to use the cpu=2 budget). Prod OOMed at the old 2Gi memory cap because each worker forks a full Python process and prod carries a large in-memory state. Fix: bumped to 4Gi, tracked in terraform cost(prod): bump memory 2Gi → 4Gi.

Detect

Platform-level OOM logs

The definitive signal is a line from Cloud Run itself (not the app):

gcloud logging read \
  'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND textPayload=~"Memory limit"' \
  --project=tappass-prod --limit=10 --freshness=30m \
  --format='value(timestamp,resource.labels.revision_name,textPayload)'

Example line:

Memory limit of 3072 MiB exceeded with 3211 MiB used.

The 3072 MiB is the pod-total limit (tappass container + OPA sidecar), not the tappass container on its own. A 2Gi + 1Gi config has a 3Gi pod ceiling.

Revision-level status

gcloud run revisions describe <revision-name> \
  --project=tappass-prod --region=europe-west1 \
  --format='value(status.conditions[0].status,status.conditions[0].message)'

An OOMing revision will typically say either The user-provided container ran out of memory. (startup OOM) or report True (ready) but still emit the "Memory limit exceeded" messages under load.

Triage

Case A — fresh revision OOMs on startup, traffic still on old one

You deployed with --no-traffic, the new revision is failing to start. Prod is unaffected. Proceed to Fix memory and redeploy — no rollback needed.

Case B — OOM revision is serving traffic

Users are hitting 503s. Roll back first, diagnose after. Follow Roll back Cloud Run. Then come back here to fix the memory before the next deploy.

Case C — intermittent OOMs on a historically-stable revision

Something changed that's not the memory limit — typically:

A background worker is accumulating state without bounds (health_score_worker, cache dicts that never evict).
A particular endpoint buffers a large response or file.
A memory leak in a library version bump.

Check for a recent deploy correlating with the OOM start. If yes, roll back and root-cause. If no, profile memory (see below) before bumping — the bump might just delay the inevitable.

Fix memory

Pick the new limit

Rules of thumb for the tappass container:

Scenario	Suggested tappass memory
staging, `--workers 1`	2Gi
staging, `--workers 2`	2Gi (staging data is small)
prod, `--workers 1`	2Gi
prod, `--workers 2`	4Gi
prod with KMS envelope encryption under heavy load	4Gi

Always leave ≥25% headroom over observed peak RSS. Cloud Run meters against the pod (tappass + OPA), so if you bump tappass to 4Gi and OPA stays at 1Gi, your pod cap becomes 5Gi.

Apply live (fastest)

gcloud run services update tappass \
  --project=tappass-prod --region=europe-west1 \
  --container=tappass --cpu=2 --memory=4Gi \
  --container=opa --cpu=1 --memory=1Gi

This creates a new revision with the bumped sizing. Wait for it to reach Ready, then cut traffic if it's not already there:

gcloud run services describe tappass --project=tappass-prod --region=europe-west1 \
  --format='value(status.latestReadyRevisionName,status.latestCreatedRevisionName)'

# If they differ, cut traffic explicitly:
gcloud run services update-traffic tappass --project=tappass-prod --region=europe-west1 \
  --to-revisions="<latestReadyRevisionName>=100"

Verify the fix

# Latency smooth
for i in 1 2 3 4 5; do
  curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \
    https://eu.tappass.ai/api/health/live
done

# No new OOMs on the new revision only
gcloud logging read \
  'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND resource.labels.revision_name="<new-revision>" AND textPayload=~"Memory limit"' \
  --project=tappass-prod --limit=5 --freshness=15m \
  --format='value(timestamp)'
# Expect empty

Lock it into terraform

Live fixes drift if nobody updates terraform. Persist every memory bump the same day:

module "tappass" {
  …
  memory_limit     = "4Gi"   # tappass container
  opa_memory_limit = "1Gi"   # OPA sidecar
  …
}

Commit with the reason in the message — a future-you hunting for why the bill jumped needs to find the explanation immediately:

fix(prod): bump memory 2Gi → 4Gi for uvicorn --workers 2

Each worker forks a full Python process — prod carries 9 providers
+ the full agent registry in memory, so 2 × ~1.5Gi overshot the old
2Gi cap and OOMed on rollout.

Common traps

Trap	Why it bites
Bumping only memory while keeping `--workers 2` after a failed deploy	The fix works, but staging still tests `--workers 2` at the old memory — you won't catch a regression until it OOMs in prod again
Reducing `--workers` back to 1 to "save memory"	Undoes the cpu=2 throughput win — prefer bumping memory
Assuming OOM = memory leak	Often it's just a new revision with fatter imports, a new provider, or more orgs
Raising memory without also raising the observed-peak + headroom	You'll OOM again the next time peak shifts (e.g. a tenant onboards a new provider)

Also see

Roll back Cloud Run — when the OOMing revision is serving traffic right now.
Deploy core server — how to ship the rebuilt image once the memory is right.
Cold-start / uptime alert — OOMs and cold starts both fire the uptime alert; this helps you tell them apart.