Centralized OPA + per-entity gate locks — Concept

Status: Concept / architecture draft. Not yet a feature spec. Date: 2026-05-04 Origin: End of the PR-#219 → #226 series that closed every deterministic policy-enforcement gap. The 7-PR sequence got Tests 4/10/11 to 100% reliability, but Tests 6/7/8 still occasionally flake (~1 cluster failure per 3-run batch). The remaining cause is structural — per-instance OPA sidecars with cross-instance synchronisation — not anything a smaller patch can fix. This concept proposes the architectural shift the user approved at the end of that conversation.

1. The thesis

Replace the "OPA sidecar per Cloud Run instance + KV-version-gate cross-instance sync" architecture with one shared OPA cluster that every tappass instance queries directly. Eliminate the cross-instance state-divergence bug class entirely. As a transitional measure for any code that still cares about per-instance refresh ordering, switch the version gate's single asyncio.Lock to per-entity locks so a slow refresh on one entity stops blocking refreshes on others.

The 7-PR series proved the per-instance-OPA-with-version-gate model is correctable but not stable. Every fix exposed a smaller race. The architecture itself is the issue.

2. What we have today

              ┌──────────┐
              │  Postgres │   (source of truth)
              └─────┬────┘
                    │
        ┌───────────┼───────────┐
        ↓           ↓           ↓
  ┌──────────┐┌──────────┐┌──────────┐
  │ tappass A││ tappass B││ tappass C│
  │ + OPA A  ││ + OPA B  ││ + OPA C  │
  └────┬─────┘└────┬─────┘└────┬─────┘
       │           │           │
       └───────────┼───────────┘
                   ↓
              ┌──────────┐
              │  Redis   │   (KV: per-entity version stamps)
              └──────────┘

Each tappass Cloud Run instance runs an OPA sidecar in the same container. Three instances = three OPAs = three copies of data.tappass.policy_data. The PolicyVersionGate keeps them in sync via Redis-published version stamps.

What works: the gate is correct. PR #219–#226 closed every deterministic gap (event-loop binding, store cache invalidation, refresh timeout, bootstrap contention, cold-start priming).

What still hurts:

N OPAs to keep in sync. Every PUT must propagate to N instances. The gate does this lazily on read. There is always a window where an instance has stale data.
Cross-instance refresh contention. When the gate refreshes, it pushes to local OPA. If multiple instances refresh the same entity simultaneously (mid-burst Cloud Run scale), they all hit OPA's single-threaded transaction processor in series.
Rollout-window staleness. During a deploy, the old revision keeps serving while the new one warms. Old revisions don't have the latest gate fixes. Every release introduces a small flake window.
Operational surface area. PolicyVersionGate, store invalidation, periodic reconciler, bootstrap reconciler, gate priming — five interlocking pieces, all to solve the same underlying problem: keeping N OPAs consistent.

What the cost actually is: ~15% of policy_floor reads after a PUT, in the first ~10 seconds, on the wrong instance, see stale data. Production impact is small (the very next read sees fresh data — the gate retries) but it's a real correctness window we cannot fully close without changing the architecture.

3. Proposal A — centralized OPA cluster

3.1 Architecture

              ┌──────────┐
              │  Postgres │   (still source of truth)
              └─────┬────┘
                    │
        ┌───────────┼───────────┐
        ↓           ↓           ↓
  ┌──────────┐┌──────────┐┌──────────┐
  │ tappass A││ tappass B││ tappass C│
  └────┬─────┘└────┬─────┘└────┬─────┘
       │           │           │
       └───────────┼───────────┘
                   ↓
              ┌──────────────────┐
              │  tappass-opa     │   (Cloud Run service)
              │  (1+ replicas)   │
              └──────────────────┘

A single Cloud Run service tappass-opa, internal-ingress-only (only reachable from tappass instances). Every tappass instance points its OPA client at the service URL instead of localhost:8181. Writes go to one place. Reads see exactly one truth.

3.2 What this eliminates

Component	Status after migration
`PolicyVersionGate`	Delete entirely
KV (Redis) policy version stamps	Delete entirely
OrgPolicyStore.invalidate / ProjectStore.invalidate	Keep (still useful for non-OPA paths like dashboard reads)
Periodic reconciler (`reconcile_loop.py`)	Keep — backstop for OPA restart
Bootstrap reconciler (batched)	Keep — seeds OPA on cold start
Per-write push (push_org_policy / push_project)	Keep — now goes to the centralized URL
`_applied` per-instance tracking	Delete entirely
Gate priming, refresh timeout, cache invalidation in gate	Delete entirely

Net code change: +1 small Terraform module, –~600 lines of gate machinery + tests.

3.3 Availability

A single OPA = a single point of failure unless we replicate.

Cloud Run with min_instances ≥ 2 gives us multi-replica out of the box.

Replica consistency: OPA replicas need to share the data document. Three options:

Option	Mechanism	Trade-off
A. Single writer + bundle pull	Writes go to one designated replica; others pull every N seconds via OPA's Bundle API	Simple. Eventual consistency, ~1-5s lag between replicas.
B. Sticky-session per tenant	Cloud Run routes by org_id; one tenant always hits one replica	No sync needed but loses load-balancing freedom.
C. Single replica + Cloud Run auto-recovery	Just min=1, max=1; rely on Cloud Run to restart on failure	Simplest. Brief outage on restart (~5s).

Recommendation: start with C (single replica, Cloud Run auto-recovery). Our policy_data write rate is single-digit per minute. Restarts are rare. Add multi-replica later if/when SLO demands it.

3.4 Latency

Path	Today (sidecar)	Centralized	Delta
OPA decision query	~30ms (localhost)	~35-40ms (intra-VPC)	+5-10ms
Policy write (PUT)	~600ms (single entity)	~600ms (same)	none
Bootstrap reconcile	~1s (batched, PR #224)	~1s (same)	none

The 5-10ms decision-query overhead is well below the 30s pipeline-step ceiling. We currently spend that much on cache-staleness retries — it's a worthwhile trade.

3.5 Security

tappass-opa is internal-ingress only (ingress = "internal" in Terraform).
mTLS between tappass and tappass-opa via SPIFFE.
Same OPA bundle (Rego rules) baked into the new service's image.
Same signing config. No new attack surface.

3.6 Migration path

Each step is a deploy that's safely revertable.

Phase	Change	Risk
1. Build	New `tappass-opa` Cloud Run service, internal-ingress, deployed alongside existing sidecars. Not yet wired up.	Zero — additive.
2. Env switch	Add `TAPPASS_OPA_URL` env var. Default `localhost:8181` (sidecar). When set, tappass uses centralized URL.	Zero — opt-in.
3. Staging	Set env var on staging, point at `tappass-opa`. Run deep e2e for 24h.	Low — staging only.
4. Prod canary	One prod instance opt-in via env var. Watch latency + error budget for 24h.	Low — single instance.
5. Prod rollout	Flip env var on all prod instances. Sidecars still running (unused).	Low — quick rollback.
6. Cleanup	Remove sidecars from container spec. Remove `PolicyVersionGate`, KV bumps, gate-priming code.	Zero once #5 is stable.

Phase 6 is when the residual flakes go to zero.

3.7 What we keep doing the same

Postgres as durable source of truth.
Per-write push_org_policy / push_project after every PUT — these now hit the centralized OPA, get processed in OPA's transaction order, and that's it.
Periodic reconciler — still a backstop in case OPA restarts mid-incident.
Bootstrap reconciler — still seeds OPA on cold start (just one cold start now, on the OPA service, not N).

3.8 Validation plan

The migration is correct iff: after Phase 6, the deep e2e is 3/3 green on Tests 4, 6, 7, 8, 10, 11, 12 across 10 consecutive runs. Specifically:

Test 7's 20-call burst: 0 leaks, deterministically.
Test 8 post-reset: always allows, deterministically.
No policy_version_gate_* events in Cloud Logging (because that code is gone).
opa_decision p99 latency stays under 100ms.

4. Proposal B — per-entity locks (transitional)

If we don't do the full migration immediately, we can address the smaller "cross-entity blocking" issue right now with a one-file change.

4.1 Today's lock

class PolicyVersionGate:
    def __init__(self, ...):
        self._lock = asyncio.Lock()  # ONE lock for the whole gate

    async def _ensure(self, ..., kind, ent_id, ...):
        async with self._lock:
            # refresh THIS entity

A slow refresh on org_a blocks every other entity's refresh. With per-instance OPA contention, that means a stuck entity can stall the entire gate for 10s (the refresh timeout). Other tenants are blast-radius.

4.2 Proposed lock

class PolicyVersionGate:
    def __init__(self, ...):
        self._locks: dict[str, asyncio.Lock] = {}
        self._locks_lock = asyncio.Lock()  # for managing the dict

    async def _get_lock(self, key: str) -> asyncio.Lock:
        async with self._locks_lock:
            lock = self._locks.get(key)
            if lock is None:
                lock = asyncio.Lock()
                self._locks[key] = lock
            return lock

    async def _ensure(self, ..., kind, ent_id, ...):
        local_key = f"{kind}:{ent_id}"
        lock = await self._get_lock(local_key)
        async with lock:
            # refresh THIS entity

One lock per (kind, ent_id) tuple. Refreshes for org_a and org_b proceed in parallel. Concurrent refreshes for the same entity still serialise (correct — we don't want duplicate work).

4.3 Memory

Each asyncio.Lock is ~200 bytes. With 1000 active orgs+projects per instance, that's ~200KB of locks. Acceptable.

For high-cardinality cleanup: an LRU eviction (drop locks not touched in 1h) keeps the dict bounded. Not strictly needed at our scale.

4.4 Why this is transitional, not the final answer

Per-entity locks reduce blast radius but don't fix the underlying problem: N OPAs need to be kept in sync. Proposal A removes the N-OPAs entirely, which makes per-entity locks moot (the gate goes away).

If we're shipping Proposal A in the next 2-4 weeks, skip B. If A is more like 1-3 months out, ship B as a bridge — it's a small change with measurable impact on tail latency.

5. Recommendation

Ship Proposal B (per-entity locks) first — small PR, eliminates cross-entity blocking immediately, reduces blast radius. ~50 LOC + tests.
Plan and ship Proposal A (centralized OPA) as a 2-week project:
- Week 1: Build + Phase 1 (deploy alongside sidecars).
- Week 2: Phase 2-4 (env switch + staging + prod canary).
- Week 3: Phase 5-6 (full rollout + sidecar removal).
Delete the gate code in Phase 6. This is the win — the residual flake bug class is structurally gone, and the operational surface area drops from 5 interlocking pieces to 1 clean service.

6. Open questions

Do we keep the periodic reconciler in the centralized model? Probably yes — as a backstop in case the OPA service restarts and loses its data document. Frequency can drop from 60s → 5min.
Can we replicate OPA writes to two regions? Yes via OPA bundles, but only worth it if we have a multi-region SLO. Currently single-region (europe-west1).
How does this interact with the per-tenant Rego module work? Same way — those modules upload via PUT /v1/policies/<id> and would now hit the centralized OPA. No conceptual change.
What about ephemeral / dev mode? The TAPPASS_ALLOW_EPHEMERAL flag still works — when set, tappass skips OPA queries entirely. Centralized OPA doesn't change this.

7. What this concept is NOT

It is not a request to ship right now. It's a design proposal for the user to react to.
It is not a replacement for any of the PR-#219 → #226 fixes. Those are real and stay merged. They make the current architecture as good as it can be — Proposal A is what makes it actually bullet-proof.
It is not a database migration. Postgres remains the source of truth and its schema doesn't change.