Security architecture

TapPass's security posture has four pillars. They stack — the outer layers protect the inner ones, and every layer fails closed. This page is the map; the sub-pages are the detail.

The four pillars

Identity — every actor (human, service, agent, customer SDK) has a verifiable identity before they touch the request path.
Trust zones — hard boundaries between planes. Crossings are explicit, logged, and rate-limited.
Encryption — data is encrypted at rest under a KMS-wrapped DEK; secrets never leave Google Secret Manager in plaintext; transport is TLS to the edge and mTLS or signed within the VPC.
Audit — every decision lands in a hash-chained, Ed25519-signed trail that's tamper-evident under offline verification.

Trust zones

Four concentric zones. Each crossing is a boundary with its own auth mechanism; no request skips zones.

flowchart TB
  subgraph Z1["Zone 1 — Public internet"]
    U["End users / customer agents"]
  end
  subgraph Z2["Zone 2 — Cloudflare edge"]
    CF["WAF + DDoS
+ Access for internal-docs
+ Pages hosting"]
  end
  subgraph Z3["Zone 3 — Cloud Run pod"]
    TP["tappass container
(FastAPI, uvicorn workers)"]
    OPA["OPA sidecar
localhost:8181"]
    TP <-->|loopback
no TLS| OPA
  end
  subgraph Z4["Zone 4 — Private VPC"]
    DB["Cloud SQL Postgres
unix socket + MD5"]
    RD["Memorystore Redis
password + TLS"]
    SM["Secret Manager
IAM-bound"]
    KMS["KMS
IAM-bound"]
    GCS["GCS
IAM-bound"]
  end

  U -->|TLS 1.3| CF
  CF -->|TLS + X-Origin-Verify HMAC| TP
  TP -->|VPC connector
PRIVATE_RANGES_ONLY| DB
  TP -->|VPC connector| RD
  TP -->|Google API| SM
  TP -->|Google API| KMS
  TP -->|Google API| GCS

  classDef zone1 fill:#3a1f1f,stroke:#8a4646,color:#f5d5d5
  classDef zone2 fill:#3a2e1b,stroke:#8a7240,color:#f5e7c7
  classDef zone3 fill:#1f3a2c,stroke:#468a68,color:#c7f5d7
  classDef zone4 fill:#1b2a3a,stroke:#406f8a,color:#c7dff5
  class U zone1
  class CF zone2
  class TP,OPA zone3
  class DB,RD,SM,KMS,GCS zone4

A request falls through these zones once. Every step is a boundary:

Edge → Cloud Run: Cloudflare must sign with X-Origin-Verify or the AuthMiddleware rejects. This stops anyone who discovers the .run.app URL from bypassing the WAF.
Cloud Run → VPC: egress is PRIVATE_RANGES_ONLY — the container can't reach the public internet except for allow-listed vendor APIs (Resend, Sentry, PostHog, LLM providers) that go out via 0.0.0.0/0 on named egress.
VPC → data: Cloud SQL uses a unix-socket mount (no TCP exposure), Memorystore is password-auth + private IP, Secret Manager is IAM-bound, KMS is IAM-bound.

Identity flow

TapPass has three distinct identity primitives, each for a different actor. Confusing them is the most common security mistake.

1. Customer `tp_` keys — for agents

sequenceDiagram
  participant A as Agent (customer)
  participant GW as Gateway
  participant DB as developer_keys (hashed)
  participant RD as Redis
  A->>GW: Authorization: Bearer tp_dev_…
  GW->>GW: api_key.py — Argon2id hash
  GW->>DB: lookup hash
  DB-->>GW: agent_id, org_id
  GW->>RD: rate-limit check
  GW->>GW: request.state.account = Account

Format: tp_dev_<random> (developer key) or tp_live_<random> (scoped deploy key). Stored hashed (Argon2id) in developer_keys.
Rotation: customer-initiated via dashboard; old key revoked, not deleted. See Rotate API keys.
Scope: one key → one agent_id. Key can never escalate across orgs.
Transport: Authorization: Bearer tp_... header over TLS.

2. Session JWTs — for humans on the dashboard

sequenceDiagram
  participant H as Human (browser)
  participant IdP as SSO IdP (Google / Azure)
  participant API as /sso/callback
  participant DB as auth_sessions

  H->>IdP: OIDC / SAML sign-in
  IdP-->>H: id_token
  H->>API: /sso/callback?code=…
  API->>IdP: exchange code
  IdP-->>API: verified email + domain
  API->>API: mint session JWT (EdDSA)
  API->>DB: revoke prior sessions for this account
  API->>DB: insert new session
  API-->>H: Set-Cookie: tappass_session=… (HttpOnly, Secure, SameSite)

Issuer: TapPass backend; signing key in Secret Manager (TAPPASS_JWT_SECRET / TAPPASS_TOKEN_KEY_FILE).
Rotation: every login revokes prior sessions — identity/signup.py:318-325. A stolen session JWT has a bounded-by-next-login blast radius.
Storage: HttpOnly + Secure cookie; never in localStorage.
Allowed domains: gated by TAPPASS_SSO_ALLOWED_DOMAINS so the SSO provider can't log in arbitrary Gmail users.

3. SPIFFE / JWT-SVID — for inter-service workload identity

sequenceDiagram
  participant WL as Customer workload / SDK
  participant SPIRE as Customer SPIRE Agent
  participant GW as tappass backend
  participant B as SPIFFE bundle (pinned)

  WL->>SPIRE: fetch JWT-SVID
  SPIRE-->>WL: JWT-SVID (short-lived)
  WL->>GW: Authorization: Bearer 
  GW->>B: resolve signing bundle
  B-->>GW: trust anchor for this SPIFFE ID
  GW->>GW: verify signature + audience + expiry
  GW->>GW: request.state.workload = SpiffeId

Used by customer-side SDK deployments that want mTLS-equivalent identity without the cert dance — the SDK obtains a JWT-SVID from their SPIRE Agent and we verify it against a pinned bundle.

Fail-closed boundaries

A fail-closed boundary rejects the request when its own data source is unavailable. Fail-open does the opposite. We fail closed on everything that matters:

Boundary	Behaviour on dependency failure
Auth middleware	Cannot verify `tp_` key → 401
OPA policy engine	Timeout (>500ms) → deny (`opa_authz_unavailable_denied`)
Vault (credential access)	Vault unreachable → 503 to agent, no fallback
Audit writer	DB unavailable → retry with backoff, not silent drop
X-Origin-Verify	Missing header → 401 unless whitelisted path (`/health*`)
Rate limiter	Redis unavailable → in-memory fallback (documented trade-off)

Encryption at rest

Two layers that stack. The row ciphertext never leaves Postgres decryptable on its own — an attacker with DB access alone can't read a provider key.

flowchart BT
  Row["Row in Postgres
vault_llm_keys.ciphertext"]
  DEK["DEK (data-encryption key)
per-org, in memory"]
  Seed["DEK seed
Secret Manager
(staging + compose path)"]
  KMS["KMS KEK
projects/.../cryptoKeys/vault-dek
(prod path)"]

  Row -->|AEAD-decrypt with DEK| DEK
  DEK -->|unwrap| KMS
  DEK -.->|or PBKDF2-derive| Seed

  classDef data fill:#1b2a3a,stroke:#406f8a,color:#c7dff5
  classDef key fill:#1f3a2c,stroke:#468a68,color:#c7f5d7
  classDef kms fill:#3a2e1b,stroke:#8a7240,color:#f5e7c7
  class Row data
  class DEK,Seed key
  class KMS kms

Two paths, toggled by TAPPASS_VAULT_KEY_KMS:

Empty / unset: DEK seed comes straight from TAPPASS_VAULT_KEY in Secret Manager; PBKDF2 derives the AEAD key per row.
Set to a KMS key URI: DEK is generated per-org, wrapped by KMS at rest, unwrapped at request time. The KMS KEK never leaves Google's HSMs.

Prod runs the KMS path (shipped in commit 2a1843f). Staging runs the seed path because it keeps local tests cheap.

See tappass/vault/crypto.py for the implementation.

Secret management in one picture

flowchart LR
  OP["1Password
Engineering vault"]
  TF["terraform tfvars
gitignored"]
  SM["GCP Secret Manager
tappass-*-dsn, *-api-key, *-vault-*"]
  CR["Cloud Run
env var via secretKeyRef"]

  OP -->|human eyes,
break-glass| TF
  TF -->|terraform apply| SM
  SM -->|secretKeyRef at
container start| CR

  classDef human fill:#3a2e1b,stroke:#8a7240,color:#f5e7c7
  classDef code fill:#1b2a3a,stroke:#406f8a,color:#c7dff5
  class OP human
  class TF,SM,CR code

Every runtime secret is in Secret Manager; code reaches it via tappass.secrets.get(name) which checks env vars first, then the configured backend.
Humans access the same values via 1Password for break-glass; 1Password is the master — Secret Manager is the deployment target.
tfvars files are gitignored; the .example versions are committed with placeholders.

Full detail: Secret management.

Audit: why chained + signed

Every decision writes an AuditEvent. Each event hashes together the previous event's hash + the new event's payload, then signs the result with the audit chain's Ed25519 key. Tampering with any event breaks the chain for every subsequent event.

Detection: the integrity-check background worker (observability/background.py:integrity_check_worker) re-verifies the chain every 4 hours and fires a SEV alert on mismatch.
Retention: audit events are the one object exempt from GDPR erasure — see Customer data export for how we anonymise the user_id while keeping the event.
Export: daily cold-storage copy to GCS for SLA with the DPA.

Full detail: Audit trail internals.

Threat model shorthand

Attacker	Primary defence	Secondary
Scanner hitting `.run.app` URL directly	X-Origin-Verify HMAC	Rate limiter
Leaked customer `tp_` key	Hashed storage + rotation on self-serve	Per-agent scoping — one key ≠ org takeover
Compromised session cookie	Login-rotation of session JWT	SameSite + Secure + HttpOnly
Rogue SDK flooding `/hooks`	Rate limiter (Redis)	Per-agent cost envelope
Insider with DB read	Row-level crypto under KMS DEK	No plaintext LLM keys in DB
Vendor breach (Sentry, PostHog)	`send_default_pii=False`; no secrets in events	Scrub middleware (`_scrub_event`)

We deliberately do not defend against a fully-compromised GCP project — if an attacker has GCP Organization Admin, the game is over and our DPA recovery SLA kicks in.

Do's and don'ts

Land new secrets in Secret Manager via terraform, with a corresponding secretKeyRef in cloudrun.tf.
Emit an AuditEvent for every new decision path.
Fail closed by default; make fail-open explicit with a comment that justifies it.

Don't

Log secrets or PII (the _scrub_event hook catches common cases but it's not a safety net for sloppy logger.info).
Add middleware that bypasses AuthMiddleware — extend it or add a new allow-list entry.
Cache credentials. The vault is the source of truth, every call.
Introduce a new identity primitive. If tp_, session JWT, or SPIFFE doesn't fit the use case, escalate before writing code.

Also see

Access control — how humans get provisioned.
Secret management — rotation playbook.
Audit trail internals — hash-chain mechanics.
Compliance program — frameworks, ownership, DPA.
Deployment architecture — where each layer runs physically.