Skip to content

Feature flags

We use feature flags to ship risky changes safely. We also remove them — a flag that’s been on for 30 days in prod is technical debt.

config/feature_flags.yaml
flags:
new_trust_engine:
default: false
rollout:
staging: true
prod:
percent: 10
tenants: ["early-access-tenant-id"]
owner: "@jens"
created: 2026-03-15
remove_by: 2026-06-15
multiline_pii_detection:
default: true # fully rolled out — candidate for removal
owner: "@eng-lead"
created: 2026-01-10
remove_by: 2026-04-10 # overdue, remove next sprint

Config hot-reloads on file change. No restart.

from tappass.config import feature_flags
if feature_flags.is_on("new_trust_engine", tenant_id=ctx.tenant.id):
result = new_engine.compute(...)
else:
result = legacy_engine.compute(...)

The evaluator checks:

  1. Tenant allowlist (explicit on for this tenant)
  2. Tenant blocklist (explicit off)
  3. Percentage rollout (deterministic hash of tenant_id → stable assignment)
  4. Environment default
  5. Global default
  • Lowercase snake_case
  • Describes the new behaviour being gated, not the old one
  • Good: new_trust_engine, multiline_pii_detection, background_audit_export
  • Bad: disable_legacy_engine, experimental_mode_v2, alice_test

“Disable X” inverts the polarity and makes reading the code harder. “Experimental” tells you nothing.

propose → ship gated → ramp → fully-on → remove
└─ remove_by deadline

Every flag has a remove_by date set at creation. Six weeks default. If it’s still needed after that, either:

  1. Promote to config. If the behaviour is tenant-specific and permanent, move to the tenant’s policy profile.
  2. Renegotiate. Set a new remove_by with a reason in the PR description.

Removing a flag is a single PR that:

  1. Deletes the if flag.is_on(): branches — keep only the new path
  2. Removes the flag row from config/feature_flags.yaml
  3. Deletes any test that asserted both-branch behaviour

No deprecation comments. No TODO markers. Just delete.

Some flags are kill-switches rather than rollout gates. These stay in config indefinitely and default to “safe”:

kill_switches:
disable_azure_content_safety:
default: false
description: "Emergency off-switch if Azure backend goes sideways"
fail_closed_on_detection_backend_error:
default: false
description: "Flip to true if a backend corrupts requests"

Kill-switches don’t have remove_by. They exist for emergencies and are reviewed quarterly.

  • Bug fixes. If it’s a bug, fix it. No flag.
  • Documentation changes. Obviously.
  • Refactors. If you refactor behind a flag and the flag stays forever, you’ve just made the code worse.
  • UI-only experiments — those go through PostHog feature flags in the frontend, not the backend config.

Every flag evaluation emits a metric:

tappass_feature_flag_eval_total{flag="new_trust_engine", tenant="...", value="on"}

Grafana dashboard “Feature flags” shows eval distribution. If a flag is set to 10% but you see 40% on, there’s a bug in your rollout criteria — investigate before it causes an incident.

Enabling a flag for a tenant (or globally) emits a feature_flag.changed event with before/after values. The event goes into the audit trail alongside everything else — we can tell you exactly when a flag changed for any tenant.