Centralized OPA + per-entity gate locks — Concept
Centralized OPA + per-entity gate locks — Concept
Section titled “Centralized OPA + per-entity gate locks — Concept”Status: Concept / architecture draft. Not yet a feature spec. Date: 2026-05-04 Origin: End of the PR-#219 → #226 series that closed every deterministic policy-enforcement gap. The 7-PR sequence got Tests 4/10/11 to 100% reliability, but Tests 6/7/8 still occasionally flake (~1 cluster failure per 3-run batch). The remaining cause is structural — per-instance OPA sidecars with cross-instance synchronisation — not anything a smaller patch can fix. This concept proposes the architectural shift the user approved at the end of that conversation.
1. The thesis
Section titled “1. The thesis”Replace the "OPA sidecar per Cloud Run instance + KV-version-gate cross-instance sync" architecture with one shared OPA cluster that every tappass instance queries directly. Eliminate the cross-instance state-divergence bug class entirely. As a transitional measure for any code that still cares about per-instance refresh ordering, switch the version gate's single asyncio.Lock to per-entity locks so a slow refresh on one entity stops blocking refreshes on others.
The 7-PR series proved the per-instance-OPA-with-version-gate model is correctable but not stable. Every fix exposed a smaller race. The architecture itself is the issue.
2. What we have today
Section titled “2. What we have today” ┌──────────┐ │ Postgres │ (source of truth) └─────┬────┘ │ ┌───────────┼───────────┐ ↓ ↓ ↓ ┌──────────┐┌──────────┐┌──────────┐ │ tappass A││ tappass B││ tappass C│ │ + OPA A ││ + OPA B ││ + OPA C │ └────┬─────┘└────┬─────┘└────┬─────┘ │ │ │ └───────────┼───────────┘ ↓ ┌──────────┐ │ Redis │ (KV: per-entity version stamps) └──────────┘Each tappass Cloud Run instance runs an OPA sidecar in the same container. Three instances = three OPAs = three copies of data.tappass.policy_data. The PolicyVersionGate keeps them in sync via Redis-published version stamps.
What works: the gate is correct. PR #219–#226 closed every deterministic gap (event-loop binding, store cache invalidation, refresh timeout, bootstrap contention, cold-start priming).
What still hurts:
- N OPAs to keep in sync. Every PUT must propagate to N instances. The gate does this lazily on read. There is always a window where an instance has stale data.
- Cross-instance refresh contention. When the gate refreshes, it pushes to local OPA. If multiple instances refresh the same entity simultaneously (mid-burst Cloud Run scale), they all hit OPA's single-threaded transaction processor in series.
- Rollout-window staleness. During a deploy, the old revision keeps serving while the new one warms. Old revisions don't have the latest gate fixes. Every release introduces a small flake window.
- Operational surface area. PolicyVersionGate, store invalidation, periodic reconciler, bootstrap reconciler, gate priming — five interlocking pieces, all to solve the same underlying problem: keeping N OPAs consistent.
What the cost actually is: ~15% of policy_floor reads after a PUT, in the first ~10 seconds, on the wrong instance, see stale data. Production impact is small (the very next read sees fresh data — the gate retries) but it's a real correctness window we cannot fully close without changing the architecture.
3. Proposal A — centralized OPA cluster
Section titled “3. Proposal A — centralized OPA cluster”3.1 Architecture
Section titled “3.1 Architecture” ┌──────────┐ │ Postgres │ (still source of truth) └─────┬────┘ │ ┌───────────┼───────────┐ ↓ ↓ ↓ ┌──────────┐┌──────────┐┌──────────┐ │ tappass A││ tappass B││ tappass C│ └────┬─────┘└────┬─────┘└────┬─────┘ │ │ │ └───────────┼───────────┘ ↓ ┌──────────────────┐ │ tappass-opa │ (Cloud Run service) │ (1+ replicas) │ └──────────────────┘A single Cloud Run service tappass-opa, internal-ingress-only (only reachable from tappass instances). Every tappass instance points its OPA client at the service URL instead of localhost:8181. Writes go to one place. Reads see exactly one truth.
3.2 What this eliminates
Section titled “3.2 What this eliminates”| Component | Status after migration |
|---|---|
PolicyVersionGate | Delete entirely |
| KV (Redis) policy version stamps | Delete entirely |
| OrgPolicyStore.invalidate / ProjectStore.invalidate | Keep (still useful for non-OPA paths like dashboard reads) |
Periodic reconciler (reconcile_loop.py) | Keep — backstop for OPA restart |
| Bootstrap reconciler (batched) | Keep — seeds OPA on cold start |
| Per-write push (push_org_policy / push_project) | Keep — now goes to the centralized URL |
_applied per-instance tracking | Delete entirely |
| Gate priming, refresh timeout, cache invalidation in gate | Delete entirely |
Net code change: +1 small Terraform module, –~600 lines of gate machinery + tests.
3.3 Availability
Section titled “3.3 Availability”A single OPA = a single point of failure unless we replicate.
-
Cloud Run with min_instances ≥ 2 gives us multi-replica out of the box.
-
Replica consistency: OPA replicas need to share the
datadocument. Three options:Option Mechanism Trade-off A. Single writer + bundle pull Writes go to one designated replica; others pull every N seconds via OPA's Bundle API Simple. Eventual consistency, ~1-5s lag between replicas. B. Sticky-session per tenant Cloud Run routes by org_id; one tenant always hits one replica No sync needed but loses load-balancing freedom. C. Single replica + Cloud Run auto-recovery Just min=1, max=1; rely on Cloud Run to restart on failure Simplest. Brief outage on restart (~5s). Recommendation: start with C (single replica, Cloud Run auto-recovery). Our policy_data write rate is single-digit per minute. Restarts are rare. Add multi-replica later if/when SLO demands it.
3.4 Latency
Section titled “3.4 Latency”| Path | Today (sidecar) | Centralized | Delta |
|---|---|---|---|
| OPA decision query | ~30ms (localhost) | ~35-40ms (intra-VPC) | +5-10ms |
| Policy write (PUT) | ~600ms (single entity) | ~600ms (same) | none |
| Bootstrap reconcile | ~1s (batched, PR #224) | ~1s (same) | none |
The 5-10ms decision-query overhead is well below the 30s pipeline-step ceiling. We currently spend that much on cache-staleness retries — it's a worthwhile trade.
3.5 Security
Section titled “3.5 Security”tappass-opais internal-ingress only (ingress = "internal"in Terraform).- mTLS between tappass and tappass-opa via SPIFFE.
- Same OPA bundle (Rego rules) baked into the new service's image.
- Same signing config. No new attack surface.
3.6 Migration path
Section titled “3.6 Migration path”Each step is a deploy that's safely revertable.
| Phase | Change | Risk |
|---|---|---|
| 1. Build | New tappass-opa Cloud Run service, internal-ingress, deployed alongside existing sidecars. Not yet wired up. | Zero — additive. |
| 2. Env switch | Add TAPPASS_OPA_URL env var. Default localhost:8181 (sidecar). When set, tappass uses centralized URL. | Zero — opt-in. |
| 3. Staging | Set env var on staging, point at tappass-opa. Run deep e2e for 24h. | Low — staging only. |
| 4. Prod canary | One prod instance opt-in via env var. Watch latency + error budget for 24h. | Low — single instance. |
| 5. Prod rollout | Flip env var on all prod instances. Sidecars still running (unused). | Low — quick rollback. |
| 6. Cleanup | Remove sidecars from container spec. Remove PolicyVersionGate, KV bumps, gate-priming code. | Zero once #5 is stable. |
Phase 6 is when the residual flakes go to zero.
3.7 What we keep doing the same
Section titled “3.7 What we keep doing the same”- Postgres as durable source of truth.
- Per-write
push_org_policy/push_projectafter every PUT — these now hit the centralized OPA, get processed in OPA's transaction order, and that's it. - Periodic reconciler — still a backstop in case OPA restarts mid-incident.
- Bootstrap reconciler — still seeds OPA on cold start (just one cold start now, on the OPA service, not N).
3.8 Validation plan
Section titled “3.8 Validation plan”The migration is correct iff: after Phase 6, the deep e2e is 3/3 green on Tests 4, 6, 7, 8, 10, 11, 12 across 10 consecutive runs. Specifically:
- Test 7's 20-call burst: 0 leaks, deterministically.
- Test 8 post-reset: always allows, deterministically.
- No
policy_version_gate_*events in Cloud Logging (because that code is gone). opa_decisionp99 latency stays under 100ms.
4. Proposal B — per-entity locks (transitional)
Section titled “4. Proposal B — per-entity locks (transitional)”If we don't do the full migration immediately, we can address the smaller "cross-entity blocking" issue right now with a one-file change.
4.1 Today's lock
Section titled “4.1 Today's lock”class PolicyVersionGate: def __init__(self, ...): self._lock = asyncio.Lock() # ONE lock for the whole gate
async def _ensure(self, ..., kind, ent_id, ...): async with self._lock: # refresh THIS entityA slow refresh on org_a blocks every other entity's refresh. With per-instance OPA contention, that means a stuck entity can stall the entire gate for 10s (the refresh timeout). Other tenants are blast-radius.
4.2 Proposed lock
Section titled “4.2 Proposed lock”class PolicyVersionGate: def __init__(self, ...): self._locks: dict[str, asyncio.Lock] = {} self._locks_lock = asyncio.Lock() # for managing the dict
async def _get_lock(self, key: str) -> asyncio.Lock: async with self._locks_lock: lock = self._locks.get(key) if lock is None: lock = asyncio.Lock() self._locks[key] = lock return lock
async def _ensure(self, ..., kind, ent_id, ...): local_key = f"{kind}:{ent_id}" lock = await self._get_lock(local_key) async with lock: # refresh THIS entityOne lock per (kind, ent_id) tuple. Refreshes for org_a and org_b proceed in parallel. Concurrent refreshes for the same entity still serialise (correct — we don't want duplicate work).
4.3 Memory
Section titled “4.3 Memory”Each asyncio.Lock is ~200 bytes. With 1000 active orgs+projects per instance, that's ~200KB of locks. Acceptable.
For high-cardinality cleanup: an LRU eviction (drop locks not touched in 1h) keeps the dict bounded. Not strictly needed at our scale.
4.4 Why this is transitional, not the final answer
Section titled “4.4 Why this is transitional, not the final answer”Per-entity locks reduce blast radius but don't fix the underlying problem: N OPAs need to be kept in sync. Proposal A removes the N-OPAs entirely, which makes per-entity locks moot (the gate goes away).
If we're shipping Proposal A in the next 2-4 weeks, skip B. If A is more like 1-3 months out, ship B as a bridge — it's a small change with measurable impact on tail latency.
5. Recommendation
Section titled “5. Recommendation”-
Ship Proposal B (per-entity locks) first — small PR, eliminates cross-entity blocking immediately, reduces blast radius. ~50 LOC + tests.
-
Plan and ship Proposal A (centralized OPA) as a 2-week project:
- Week 1: Build + Phase 1 (deploy alongside sidecars).
- Week 2: Phase 2-4 (env switch + staging + prod canary).
- Week 3: Phase 5-6 (full rollout + sidecar removal).
-
Delete the gate code in Phase 6. This is the win — the residual flake bug class is structurally gone, and the operational surface area drops from 5 interlocking pieces to 1 clean service.
6. Open questions
Section titled “6. Open questions”- Do we keep the periodic reconciler in the centralized model? Probably yes — as a backstop in case the OPA service restarts and loses its data document. Frequency can drop from 60s → 5min.
- Can we replicate OPA writes to two regions? Yes via OPA bundles, but only worth it if we have a multi-region SLO. Currently single-region (europe-west1).
- How does this interact with the per-tenant Rego module work? Same way — those modules upload via
PUT /v1/policies/<id>and would now hit the centralized OPA. No conceptual change. - What about ephemeral / dev mode? The
TAPPASS_ALLOW_EPHEMERALflag still works — when set, tappass skips OPA queries entirely. Centralized OPA doesn't change this.
7. What this concept is NOT
Section titled “7. What this concept is NOT”- It is not a request to ship right now. It's a design proposal for the user to react to.
- It is not a replacement for any of the PR-#219 → #226 fixes. Those are real and stay merged. They make the current architecture as good as it can be — Proposal A is what makes it actually bullet-proof.
- It is not a database migration. Postgres remains the source of truth and its schema doesn't change.