OOM / crashloop
When the tappass container runs out of memory, Cloud Run prints a specific error to the platform logs and either fails the new revision outright or silently evicts and restarts the instance under load. This runbook covers the whole loop: identify → rollback if serving traffic → fix memory → redeploy → lock into terraform.
Recent precedent: on 2026-04-22 we shipped --workers 2 to
uvicorn (to use the cpu=2 budget). Prod OOMed at the old 2Gi
memory cap because each worker forks a full Python process and prod
carries a large in-memory state. Fix: bumped to 4Gi, tracked in
terraform cost(prod): bump memory 2Gi → 4Gi.
Detect
Section titled “Detect”Platform-level OOM logs
Section titled “Platform-level OOM logs”The definitive signal is a line from Cloud Run itself (not the app):
gcloud logging read \ 'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND textPayload=~"Memory limit"' \ --project=tappass-prod --limit=10 --freshness=30m \ --format='value(timestamp,resource.labels.revision_name,textPayload)'Example line:
Memory limit of 3072 MiB exceeded with 3211 MiB used.The 3072 MiB is the pod-total limit (tappass container +
OPA sidecar), not the tappass container on its own. A 2Gi + 1Gi
config has a 3Gi pod ceiling.
Revision-level status
Section titled “Revision-level status”gcloud run revisions describe <revision-name> \ --project=tappass-prod --region=europe-west1 \ --format='value(status.conditions[0].status,status.conditions[0].message)'An OOMing revision will typically say either
The user-provided container ran out of memory. (startup OOM) or
report True (ready) but still emit the "Memory limit exceeded"
messages under load.
Triage
Section titled “Triage”Case A — fresh revision OOMs on startup, traffic still on old one
Section titled “Case A — fresh revision OOMs on startup, traffic still on old one”You deployed with --no-traffic, the new revision is failing to
start. Prod is unaffected. Proceed to Fix memory and
redeploy — no rollback needed.
Case B — OOM revision is serving traffic
Section titled “Case B — OOM revision is serving traffic”Users are hitting 503s. Roll back first, diagnose after. Follow Roll back Cloud Run. Then come back here to fix the memory before the next deploy.
Case C — intermittent OOMs on a historically-stable revision
Section titled “Case C — intermittent OOMs on a historically-stable revision”Something changed that's not the memory limit — typically:
- A background worker is accumulating state without bounds (
health_score_worker, cache dicts that never evict). - A particular endpoint buffers a large response or file.
- A memory leak in a library version bump.
Check for a recent deploy correlating with the OOM start. If yes, roll back and root-cause. If no, profile memory (see below) before bumping — the bump might just delay the inevitable.
Fix memory
Section titled “Fix memory”Pick the new limit
Section titled “Pick the new limit”Rules of thumb for the tappass container:
| Scenario | Suggested tappass memory |
|---|---|
staging, --workers 1 | 2Gi |
staging, --workers 2 | 2Gi (staging data is small) |
prod, --workers 1 | 2Gi |
prod, --workers 2 | 4Gi |
| prod with KMS envelope encryption under heavy load | 4Gi |
Always leave ≥25% headroom over observed peak RSS. Cloud Run meters against the pod (tappass + OPA), so if you bump tappass to 4Gi and OPA stays at 1Gi, your pod cap becomes 5Gi.
Apply live (fastest)
Section titled “Apply live (fastest)”gcloud run services update tappass \ --project=tappass-prod --region=europe-west1 \ --container=tappass --cpu=2 --memory=4Gi \ --container=opa --cpu=1 --memory=1GiThis creates a new revision with the bumped sizing. Wait for it to
reach Ready, then cut traffic if it's not already there:
gcloud run services describe tappass --project=tappass-prod --region=europe-west1 \ --format='value(status.latestReadyRevisionName,status.latestCreatedRevisionName)'
# If they differ, cut traffic explicitly:gcloud run services update-traffic tappass --project=tappass-prod --region=europe-west1 \ --to-revisions="<latestReadyRevisionName>=100"Verify the fix
Section titled “Verify the fix”# Latency smoothfor i in 1 2 3 4 5; do curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \ https://eu.tappass.ai/api/health/livedone
# No new OOMs on the new revision onlygcloud logging read \ 'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND resource.labels.revision_name="<new-revision>" AND textPayload=~"Memory limit"' \ --project=tappass-prod --limit=5 --freshness=15m \ --format='value(timestamp)'# Expect emptyLock it into terraform
Section titled “Lock it into terraform”Live fixes drift if nobody updates terraform. Persist every memory bump the same day:
module "tappass" { … memory_limit = "4Gi" # tappass container opa_memory_limit = "1Gi" # OPA sidecar …}Commit with the reason in the message — a future-you hunting for why the bill jumped needs to find the explanation immediately:
fix(prod): bump memory 2Gi → 4Gi for uvicorn --workers 2
Each worker forks a full Python process — prod carries 9 providers+ the full agent registry in memory, so 2 × ~1.5Gi overshot the old2Gi cap and OOMed on rollout.Common traps
Section titled “Common traps”| Trap | Why it bites |
|---|---|
Bumping only memory while keeping --workers 2 after a failed deploy | The fix works, but staging still tests --workers 2 at the old memory — you won't catch a regression until it OOMs in prod again |
Reducing --workers back to 1 to "save memory" | Undoes the cpu=2 throughput win — prefer bumping memory |
| Assuming OOM = memory leak | Often it's just a new revision with fatter imports, a new provider, or more orgs |
| Raising memory without also raising the observed-peak + headroom | You'll OOM again the next time peak shifts (e.g. a tenant onboards a new provider) |
Also see
Section titled “Also see”- Roll back Cloud Run — when the OOMing revision is serving traffic right now.
- Deploy core server — how to ship the rebuilt image once the memory is right.
- Cold-start / uptime alert — OOMs and cold starts both fire the uptime alert; this helps you tell them apart.