Feature flags

We use feature flags to ship risky changes safely. We also remove them — a flag that's been on for 30 days in prod is technical debt.

Where they live

flags:
  new_trust_engine:
    default: false
    rollout:
      staging: true
      prod:
        percent: 10
        tenants: ["early-access-tenant-id"]
    owner: "@jens"
    created: 2026-03-15
    remove_by: 2026-06-15

  multiline_pii_detection:
    default: true     # fully rolled out — candidate for removal
    owner: "@eng-lead"
    created: 2026-01-10
    remove_by: 2026-04-10  # overdue, remove next sprint

Config hot-reloads on file change. No restart.

Reading a flag

from tappass.config import feature_flags

if feature_flags.is_on("new_trust_engine", tenant_id=ctx.tenant.id):
    result = new_engine.compute(...)
else:
    result = legacy_engine.compute(...)

The evaluator checks:

Tenant allowlist (explicit on for this tenant)
Tenant blocklist (explicit off)
Percentage rollout (deterministic hash of tenant_id → stable assignment)
Environment default
Global default

Naming

Lowercase snake_case
Describes the new behaviour being gated, not the old one
Good: new_trust_engine, multiline_pii_detection, background_audit_export
Bad: disable_legacy_engine, experimental_mode_v2, alice_test

"Disable X" inverts the polarity and makes reading the code harder. "Experimental" tells you nothing.

Lifecycle

propose → ship gated → ramp → fully-on → remove
                                  │
                                  └─ remove_by deadline

Every flag has a remove_by date set at creation. Six weeks default. If it's still needed after that, either:

Promote to config. If the behaviour is tenant-specific and permanent, move to the tenant's policy profile.
Renegotiate. Set a new remove_by with a reason in the PR description.

Removal

Removing a flag is a single PR that:

Deletes the if flag.is_on(): branches — keep only the new path
Removes the flag row from config/feature_flags.yaml
Deletes any test that asserted both-branch behaviour

No deprecation comments. No TODO markers. Just delete.

Kill-switches

Some flags are kill-switches rather than rollout gates. These stay in config indefinitely and default to "safe":

kill_switches:
  disable_azure_content_safety:
    default: false
    description: "Emergency off-switch if Azure backend goes sideways"

  fail_closed_on_detection_backend_error:
    default: false
    description: "Flip to true if a backend corrupts requests"

Kill-switches don't have remove_by. They exist for emergencies and are reviewed quarterly.

What NOT to flag

Bug fixes. If it's a bug, fix it. No flag.
Documentation changes. Obviously.
Refactors. If you refactor behind a flag and the flag stays forever, you've just made the code worse.
UI-only experiments — those go through PostHog feature flags in the frontend, not the backend config.

Observability on flags

Every flag evaluation emits a metric:

tappass_feature_flag_eval_total{flag="new_trust_engine", tenant="...", value="on"}

Grafana dashboard "Feature flags" shows eval distribution. If a flag is set to 10% but you see 40% on, there's a bug in your rollout criteria — investigate before it causes an incident.

Audit trail

Enabling a flag for a tenant (or globally) emits a feature_flag.changed event with before/after values. The event goes into the audit trail alongside everything else — we can tell you exactly when a flag changed for any tenant.