Skip to content

Centralized OPA + per-entity gate locks — Concept

Centralized OPA + per-entity gate locks — Concept

Section titled “Centralized OPA + per-entity gate locks — Concept”

Status: Concept / architecture draft. Not yet a feature spec. Date: 2026-05-04 Origin: End of the PR-#219 → #226 series that closed every deterministic policy-enforcement gap. The 7-PR sequence got Tests 4/10/11 to 100% reliability, but Tests 6/7/8 still occasionally flake (~1 cluster failure per 3-run batch). The remaining cause is structural — per-instance OPA sidecars with cross-instance synchronisation — not anything a smaller patch can fix. This concept proposes the architectural shift the user approved at the end of that conversation.


Replace the "OPA sidecar per Cloud Run instance + KV-version-gate cross-instance sync" architecture with one shared OPA cluster that every tappass instance queries directly. Eliminate the cross-instance state-divergence bug class entirely. As a transitional measure for any code that still cares about per-instance refresh ordering, switch the version gate's single asyncio.Lock to per-entity locks so a slow refresh on one entity stops blocking refreshes on others.

The 7-PR series proved the per-instance-OPA-with-version-gate model is correctable but not stable. Every fix exposed a smaller race. The architecture itself is the issue.


┌──────────┐
│ Postgres │ (source of truth)
└─────┬────┘
┌───────────┼───────────┐
↓ ↓ ↓
┌──────────┐┌──────────┐┌──────────┐
│ tappass A││ tappass B││ tappass C│
│ + OPA A ││ + OPA B ││ + OPA C │
└────┬─────┘└────┬─────┘└────┬─────┘
│ │ │
└───────────┼───────────┘
┌──────────┐
│ Redis │ (KV: per-entity version stamps)
└──────────┘

Each tappass Cloud Run instance runs an OPA sidecar in the same container. Three instances = three OPAs = three copies of data.tappass.policy_data. The PolicyVersionGate keeps them in sync via Redis-published version stamps.

What works: the gate is correct. PR #219–#226 closed every deterministic gap (event-loop binding, store cache invalidation, refresh timeout, bootstrap contention, cold-start priming).

What still hurts:

  • N OPAs to keep in sync. Every PUT must propagate to N instances. The gate does this lazily on read. There is always a window where an instance has stale data.
  • Cross-instance refresh contention. When the gate refreshes, it pushes to local OPA. If multiple instances refresh the same entity simultaneously (mid-burst Cloud Run scale), they all hit OPA's single-threaded transaction processor in series.
  • Rollout-window staleness. During a deploy, the old revision keeps serving while the new one warms. Old revisions don't have the latest gate fixes. Every release introduces a small flake window.
  • Operational surface area. PolicyVersionGate, store invalidation, periodic reconciler, bootstrap reconciler, gate priming — five interlocking pieces, all to solve the same underlying problem: keeping N OPAs consistent.

What the cost actually is: ~15% of policy_floor reads after a PUT, in the first ~10 seconds, on the wrong instance, see stale data. Production impact is small (the very next read sees fresh data — the gate retries) but it's a real correctness window we cannot fully close without changing the architecture.


┌──────────┐
│ Postgres │ (still source of truth)
└─────┬────┘
┌───────────┼───────────┐
↓ ↓ ↓
┌──────────┐┌──────────┐┌──────────┐
│ tappass A││ tappass B││ tappass C│
└────┬─────┘└────┬─────┘└────┬─────┘
│ │ │
└───────────┼───────────┘
┌──────────────────┐
│ tappass-opa │ (Cloud Run service)
│ (1+ replicas) │
└──────────────────┘

A single Cloud Run service tappass-opa, internal-ingress-only (only reachable from tappass instances). Every tappass instance points its OPA client at the service URL instead of localhost:8181. Writes go to one place. Reads see exactly one truth.

ComponentStatus after migration
PolicyVersionGateDelete entirely
KV (Redis) policy version stampsDelete entirely
OrgPolicyStore.invalidate / ProjectStore.invalidateKeep (still useful for non-OPA paths like dashboard reads)
Periodic reconciler (reconcile_loop.py)Keep — backstop for OPA restart
Bootstrap reconciler (batched)Keep — seeds OPA on cold start
Per-write push (push_org_policy / push_project)Keep — now goes to the centralized URL
_applied per-instance trackingDelete entirely
Gate priming, refresh timeout, cache invalidation in gateDelete entirely

Net code change: +1 small Terraform module, –~600 lines of gate machinery + tests.

A single OPA = a single point of failure unless we replicate.

  • Cloud Run with min_instances ≥ 2 gives us multi-replica out of the box.

  • Replica consistency: OPA replicas need to share the data document. Three options:

    OptionMechanismTrade-off
    A. Single writer + bundle pullWrites go to one designated replica; others pull every N seconds via OPA's Bundle APISimple. Eventual consistency, ~1-5s lag between replicas.
    B. Sticky-session per tenantCloud Run routes by org_id; one tenant always hits one replicaNo sync needed but loses load-balancing freedom.
    C. Single replica + Cloud Run auto-recoveryJust min=1, max=1; rely on Cloud Run to restart on failureSimplest. Brief outage on restart (~5s).

    Recommendation: start with C (single replica, Cloud Run auto-recovery). Our policy_data write rate is single-digit per minute. Restarts are rare. Add multi-replica later if/when SLO demands it.

PathToday (sidecar)CentralizedDelta
OPA decision query~30ms (localhost)~35-40ms (intra-VPC)+5-10ms
Policy write (PUT)~600ms (single entity)~600ms (same)none
Bootstrap reconcile~1s (batched, PR #224)~1s (same)none

The 5-10ms decision-query overhead is well below the 30s pipeline-step ceiling. We currently spend that much on cache-staleness retries — it's a worthwhile trade.

  • tappass-opa is internal-ingress only (ingress = "internal" in Terraform).
  • mTLS between tappass and tappass-opa via SPIFFE.
  • Same OPA bundle (Rego rules) baked into the new service's image.
  • Same signing config. No new attack surface.

Each step is a deploy that's safely revertable.

PhaseChangeRisk
1. BuildNew tappass-opa Cloud Run service, internal-ingress, deployed alongside existing sidecars. Not yet wired up.Zero — additive.
2. Env switchAdd TAPPASS_OPA_URL env var. Default localhost:8181 (sidecar). When set, tappass uses centralized URL.Zero — opt-in.
3. StagingSet env var on staging, point at tappass-opa. Run deep e2e for 24h.Low — staging only.
4. Prod canaryOne prod instance opt-in via env var. Watch latency + error budget for 24h.Low — single instance.
5. Prod rolloutFlip env var on all prod instances. Sidecars still running (unused).Low — quick rollback.
6. CleanupRemove sidecars from container spec. Remove PolicyVersionGate, KV bumps, gate-priming code.Zero once #5 is stable.

Phase 6 is when the residual flakes go to zero.

  • Postgres as durable source of truth.
  • Per-write push_org_policy / push_project after every PUT — these now hit the centralized OPA, get processed in OPA's transaction order, and that's it.
  • Periodic reconciler — still a backstop in case OPA restarts mid-incident.
  • Bootstrap reconciler — still seeds OPA on cold start (just one cold start now, on the OPA service, not N).

The migration is correct iff: after Phase 6, the deep e2e is 3/3 green on Tests 4, 6, 7, 8, 10, 11, 12 across 10 consecutive runs. Specifically:

  • Test 7's 20-call burst: 0 leaks, deterministically.
  • Test 8 post-reset: always allows, deterministically.
  • No policy_version_gate_* events in Cloud Logging (because that code is gone).
  • opa_decision p99 latency stays under 100ms.

4. Proposal B — per-entity locks (transitional)

Section titled “4. Proposal B — per-entity locks (transitional)”

If we don't do the full migration immediately, we can address the smaller "cross-entity blocking" issue right now with a one-file change.

class PolicyVersionGate:
def __init__(self, ...):
self._lock = asyncio.Lock() # ONE lock for the whole gate
async def _ensure(self, ..., kind, ent_id, ...):
async with self._lock:
# refresh THIS entity

A slow refresh on org_a blocks every other entity's refresh. With per-instance OPA contention, that means a stuck entity can stall the entire gate for 10s (the refresh timeout). Other tenants are blast-radius.

class PolicyVersionGate:
def __init__(self, ...):
self._locks: dict[str, asyncio.Lock] = {}
self._locks_lock = asyncio.Lock() # for managing the dict
async def _get_lock(self, key: str) -> asyncio.Lock:
async with self._locks_lock:
lock = self._locks.get(key)
if lock is None:
lock = asyncio.Lock()
self._locks[key] = lock
return lock
async def _ensure(self, ..., kind, ent_id, ...):
local_key = f"{kind}:{ent_id}"
lock = await self._get_lock(local_key)
async with lock:
# refresh THIS entity

One lock per (kind, ent_id) tuple. Refreshes for org_a and org_b proceed in parallel. Concurrent refreshes for the same entity still serialise (correct — we don't want duplicate work).

Each asyncio.Lock is ~200 bytes. With 1000 active orgs+projects per instance, that's ~200KB of locks. Acceptable.

For high-cardinality cleanup: an LRU eviction (drop locks not touched in 1h) keeps the dict bounded. Not strictly needed at our scale.

4.4 Why this is transitional, not the final answer

Section titled “4.4 Why this is transitional, not the final answer”

Per-entity locks reduce blast radius but don't fix the underlying problem: N OPAs need to be kept in sync. Proposal A removes the N-OPAs entirely, which makes per-entity locks moot (the gate goes away).

If we're shipping Proposal A in the next 2-4 weeks, skip B. If A is more like 1-3 months out, ship B as a bridge — it's a small change with measurable impact on tail latency.


  1. Ship Proposal B (per-entity locks) first — small PR, eliminates cross-entity blocking immediately, reduces blast radius. ~50 LOC + tests.

  2. Plan and ship Proposal A (centralized OPA) as a 2-week project:

    • Week 1: Build + Phase 1 (deploy alongside sidecars).
    • Week 2: Phase 2-4 (env switch + staging + prod canary).
    • Week 3: Phase 5-6 (full rollout + sidecar removal).
  3. Delete the gate code in Phase 6. This is the win — the residual flake bug class is structurally gone, and the operational surface area drops from 5 interlocking pieces to 1 clean service.


  • Do we keep the periodic reconciler in the centralized model? Probably yes — as a backstop in case the OPA service restarts and loses its data document. Frequency can drop from 60s → 5min.
  • Can we replicate OPA writes to two regions? Yes via OPA bundles, but only worth it if we have a multi-region SLO. Currently single-region (europe-west1).
  • How does this interact with the per-tenant Rego module work? Same way — those modules upload via PUT /v1/policies/<id> and would now hit the centralized OPA. No conceptual change.
  • What about ephemeral / dev mode? The TAPPASS_ALLOW_EPHEMERAL flag still works — when set, tappass skips OPA queries entirely. Centralized OPA doesn't change this.

  • It is not a request to ship right now. It's a design proposal for the user to react to.
  • It is not a replacement for any of the PR-#219 → #226 fixes. Those are real and stay merged. They make the current architecture as good as it can be — Proposal A is what makes it actually bullet-proof.
  • It is not a database migration. Postgres remains the source of truth and its schema doesn't change.