Feature flags
We use feature flags to ship risky changes safely. We also remove them — a flag that’s been on for 30 days in prod is technical debt.
Where they live
Section titled “Where they live”flags: new_trust_engine: default: false rollout: staging: true prod: percent: 10 tenants: ["early-access-tenant-id"] owner: "@jens" created: 2026-03-15 remove_by: 2026-06-15
multiline_pii_detection: default: true # fully rolled out — candidate for removal owner: "@eng-lead" created: 2026-01-10 remove_by: 2026-04-10 # overdue, remove next sprintConfig hot-reloads on file change. No restart.
Reading a flag
Section titled “Reading a flag”from tappass.config import feature_flags
if feature_flags.is_on("new_trust_engine", tenant_id=ctx.tenant.id): result = new_engine.compute(...)else: result = legacy_engine.compute(...)The evaluator checks:
- Tenant allowlist (explicit
onfor this tenant) - Tenant blocklist (explicit
off) - Percentage rollout (deterministic hash of tenant_id → stable assignment)
- Environment default
- Global default
Naming
Section titled “Naming”- Lowercase snake_case
- Describes the new behaviour being gated, not the old one
- Good:
new_trust_engine,multiline_pii_detection,background_audit_export - Bad:
disable_legacy_engine,experimental_mode_v2,alice_test
“Disable X” inverts the polarity and makes reading the code harder. “Experimental” tells you nothing.
Lifecycle
Section titled “Lifecycle”propose → ship gated → ramp → fully-on → remove │ └─ remove_by deadlineEvery flag has a remove_by date set at creation. Six weeks default. If it’s still needed after that, either:
- Promote to config. If the behaviour is tenant-specific and permanent, move to the tenant’s policy profile.
- Renegotiate. Set a new
remove_bywith a reason in the PR description.
Removal
Section titled “Removal”Removing a flag is a single PR that:
- Deletes the
if flag.is_on():branches — keep only the new path - Removes the flag row from
config/feature_flags.yaml - Deletes any test that asserted both-branch behaviour
No deprecation comments. No TODO markers. Just delete.
Kill-switches
Section titled “Kill-switches”Some flags are kill-switches rather than rollout gates. These stay in config indefinitely and default to “safe”:
kill_switches: disable_azure_content_safety: default: false description: "Emergency off-switch if Azure backend goes sideways"
fail_closed_on_detection_backend_error: default: false description: "Flip to true if a backend corrupts requests"Kill-switches don’t have remove_by. They exist for emergencies and are reviewed quarterly.
What NOT to flag
Section titled “What NOT to flag”- Bug fixes. If it’s a bug, fix it. No flag.
- Documentation changes. Obviously.
- Refactors. If you refactor behind a flag and the flag stays forever, you’ve just made the code worse.
- UI-only experiments — those go through PostHog feature flags in the frontend, not the backend config.
Observability on flags
Section titled “Observability on flags”Every flag evaluation emits a metric:
tappass_feature_flag_eval_total{flag="new_trust_engine", tenant="...", value="on"}Grafana dashboard “Feature flags” shows eval distribution. If a flag is set to 10% but you see 40% on, there’s a bug in your rollout criteria — investigate before it causes an incident.
Audit trail
Section titled “Audit trail”Enabling a flag for a tenant (or globally) emits a feature_flag.changed event with before/after values. The event goes into the audit trail alongside everything else — we can tell you exactly when a flag changed for any tenant.