Skip to content

OOM / crashloop

When the tappass container runs out of memory, Cloud Run prints a specific error to the platform logs and either fails the new revision outright or silently evicts and restarts the instance under load. This runbook covers the whole loop: identify → rollback if serving traffic → fix memory → redeploy → lock into terraform.

Recent precedent: on 2026-04-22 we shipped --workers 2 to uvicorn (to use the cpu=2 budget). Prod OOMed at the old 2Gi memory cap because each worker forks a full Python process and prod carries a large in-memory state. Fix: bumped to 4Gi, tracked in terraform cost(prod): bump memory 2Gi → 4Gi.

The definitive signal is a line from Cloud Run itself (not the app):

Terminal window
gcloud logging read \
'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND textPayload=~"Memory limit"' \
--project=tappass-prod --limit=10 --freshness=30m \
--format='value(timestamp,resource.labels.revision_name,textPayload)'

Example line:

Memory limit of 3072 MiB exceeded with 3211 MiB used.

The 3072 MiB is the pod-total limit (tappass container + OPA sidecar), not the tappass container on its own. A 2Gi + 1Gi config has a 3Gi pod ceiling.

Terminal window
gcloud run revisions describe <revision-name> \
--project=tappass-prod --region=europe-west1 \
--format='value(status.conditions[0].status,status.conditions[0].message)'

An OOMing revision will typically say either The user-provided container ran out of memory. (startup OOM) or report True (ready) but still emit the "Memory limit exceeded" messages under load.

Case A — fresh revision OOMs on startup, traffic still on old one

Section titled “Case A — fresh revision OOMs on startup, traffic still on old one”

You deployed with --no-traffic, the new revision is failing to start. Prod is unaffected. Proceed to Fix memory and redeploy — no rollback needed.

Case B — OOM revision is serving traffic

Section titled “Case B — OOM revision is serving traffic”

Users are hitting 503s. Roll back first, diagnose after. Follow Roll back Cloud Run. Then come back here to fix the memory before the next deploy.

Case C — intermittent OOMs on a historically-stable revision

Section titled “Case C — intermittent OOMs on a historically-stable revision”

Something changed that's not the memory limit — typically:

  • A background worker is accumulating state without bounds (health_score_worker, cache dicts that never evict).
  • A particular endpoint buffers a large response or file.
  • A memory leak in a library version bump.

Check for a recent deploy correlating with the OOM start. If yes, roll back and root-cause. If no, profile memory (see below) before bumping — the bump might just delay the inevitable.

Rules of thumb for the tappass container:

ScenarioSuggested tappass memory
staging, --workers 12Gi
staging, --workers 22Gi (staging data is small)
prod, --workers 12Gi
prod, --workers 24Gi
prod with KMS envelope encryption under heavy load4Gi

Always leave ≥25% headroom over observed peak RSS. Cloud Run meters against the pod (tappass + OPA), so if you bump tappass to 4Gi and OPA stays at 1Gi, your pod cap becomes 5Gi.

Terminal window
gcloud run services update tappass \
--project=tappass-prod --region=europe-west1 \
--container=tappass --cpu=2 --memory=4Gi \
--container=opa --cpu=1 --memory=1Gi

This creates a new revision with the bumped sizing. Wait for it to reach Ready, then cut traffic if it's not already there:

Terminal window
gcloud run services describe tappass --project=tappass-prod --region=europe-west1 \
--format='value(status.latestReadyRevisionName,status.latestCreatedRevisionName)'
# If they differ, cut traffic explicitly:
gcloud run services update-traffic tappass --project=tappass-prod --region=europe-west1 \
--to-revisions="<latestReadyRevisionName>=100"
Terminal window
# Latency smooth
for i in 1 2 3 4 5; do
curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \
https://eu.tappass.ai/api/health/live
done
# No new OOMs on the new revision only
gcloud logging read \
'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND resource.labels.revision_name="<new-revision>" AND textPayload=~"Memory limit"' \
--project=tappass-prod --limit=5 --freshness=15m \
--format='value(timestamp)'
# Expect empty

Live fixes drift if nobody updates terraform. Persist every memory bump the same day:

deploy/terraform/environments/gcp-prod/main.tf
module "tappass" {
memory_limit = "4Gi" # tappass container
opa_memory_limit = "1Gi" # OPA sidecar
}

Commit with the reason in the message — a future-you hunting for why the bill jumped needs to find the explanation immediately:

fix(prod): bump memory 2Gi → 4Gi for uvicorn --workers 2
Each worker forks a full Python process — prod carries 9 providers
+ the full agent registry in memory, so 2 × ~1.5Gi overshot the old
2Gi cap and OOMed on rollout.
TrapWhy it bites
Bumping only memory while keeping --workers 2 after a failed deployThe fix works, but staging still tests --workers 2 at the old memory — you won't catch a regression until it OOMs in prod again
Reducing --workers back to 1 to "save memory"Undoes the cpu=2 throughput win — prefer bumping memory
Assuming OOM = memory leakOften it's just a new revision with fatter imports, a new provider, or more orgs
Raising memory without also raising the observed-peak + headroomYou'll OOM again the next time peak shifts (e.g. a tenant onboards a new provider)