Intent → Policy: a smart, declarative mapping (no if/else trees)

The problem

The pipeline-creation Step 2 asks the operator a high-level question:

What data does it handle? ☐ Customer PII ☐ Payment data ☐ Source code & secrets ☐ Internal docs only ☐ External communications ☐ Health data ☐ EU residents

We need to translate those checkboxes into:

Pipeline steps (which detectors to enable, in what mode)
Tool constraints (per-tool parameter rules)
OPA Rego policies (instantiated from templates)

A naïve if pii: enable_pii_detector chain doesn't scale: 7 categories × N regulations × M tools × outcomes is combinatorial. Hardcoded mappings calcify the moment a new category, regulation, or tool lands.

The pattern: two layers of indirection

Categories       →     Concerns         →     Mitigations
(UI labels)            (risk/reg model)        (steps + constraints + Rego)
"Customer PII"         data_leak              detect_pii (block)
"Payment data"         pci_dss                detect_secrets (block)
"EU residents"         gdpr_required          tool: send_email exclude *.us
                       audit_required         scan_output (block)
                       …                      …

Why two layers, not one direct map:

Dedup is automatic. Customer PII and Health Data both trigger data_leak. With direct mapping, detect_pii would be enabled twice (once per category) and we'd need conflict-resolution logic in the UI. With concerns as a set, dedup falls out for free.
Categories vs. Concerns vs. Mitigations evolve at different rates.
- Categories change when product UX redesigns (every 6 months)
- Concerns change when regulations land (HIPAA, DORA, EU AI Act — every few years)
- Mitigations change when detectors / tools / policies ship (every sprint)
- One layer per rate-of-change.
Composability. New category → one entry pointing at existing concerns. New regulation → one concerns entry pointing at mitigations. New mitigation → registered once, every concern that needs it picks it up.
The NLP-interview future works without modification (see § "Forward: NLP → policy" below).

The catalog shape

Two YAML files, hand-curated by domain experts (compliance + security):

`tappass/policy/intent_catalog.yaml`

categories:
  customer_pii:
    label: "Customer PII"
    hint: "Names, emails, addresses, phone numbers"
    triggers: [data_leak, audit_required]

  payment_data:
    label: "Payment data"
    hint: "Card numbers, IBANs, transaction IDs"
    triggers: [pci_dss, fraud_risk, data_leak, audit_required]

  source_code_secrets:
    label: "Source code & secrets"
    hint: "Code snippets, API keys, credentials"
    triggers: [secrets_leak, code_exec_risk, ip_protection]

  internal_docs_only:
    label: "Internal docs only"
    hint: "No customer or regulated data"
    triggers: []        # No concerns; lowers the floor (see § "Conflict resolution")

  external_comms:
    label: "External communications"
    hint: "Sends email / messages outside the org"
    triggers: [exfiltration_risk, recipient_validation, audit_required]

  health_data:
    label: "Health data"
    hint: "Medical records, diagnoses, test results"
    triggers: [hipaa, data_leak, access_control_strict, audit_required]

  eu_residents:
    label: "EU residents"
    hint: "GDPR applies; EU data residency required"
    triggers: [gdpr_required, data_residency]

`tappass/policy/concerns.yaml`

concerns:
  data_leak:
    summary: "Sensitive data leaving the org via any channel"
    pipeline_steps:
      detect_pii:        { enabled: true, on_detection: block }
      scan_output:       { enabled: true, on_detection: block }
    tool_constraints: {}
    rego_templates:
      - block_tool_when_pii_detected:
          target_tool: send_email

  secrets_leak:
    summary: "API keys / credentials in agent input or output"
    pipeline_steps:
      detect_secrets:    { enabled: true, on_detection: block }
    tool_constraints:
      Bash:
        command:         { not_contains: ["~/.aws", "~/.ssh/id_", "AWS_SECRET"] }
    rego_templates: []

  pci_dss:
    summary: "PCI-DSS scope: card numbers, transaction IDs"
    pipeline_steps:
      detect_pii:        { enabled: true, on_detection: block }   # additive — dedups
      detect_secrets:    { enabled: true, on_detection: block }
      classify_data:     { enabled: true }
    tool_constraints: {}
    rego_templates: []

  hipaa:
    summary: "Protected Health Information"
    pipeline_steps:
      detect_pii:        { enabled: true, on_detection: block }   # phi reuses pii detector for now
      classify_data:     { enabled: true }
    tool_constraints:
      send_email:
        to:              { exclude_pattern: "^(?!.*@(acme|hospital)\\.(com|org)).*" }
    rego_templates: []

  gdpr_required:
    summary: "EU data subjects in scope"
    pipeline_steps:
      detect_pii:        { enabled: true, on_detection: block }
      classify_data:     { enabled: true }
    tool_constraints:
      send_email:
        to:              { exclude: ["*@*.us", "*@*.cn"] }
    rego_templates: []

  data_residency:
    summary: "Data must remain in a specified region"
    pipeline_steps: {}
    tool_constraints: {}
    rego_templates:
      - block_egress_outside_region:
          allowed_regions: [eu]

  exfiltration_risk:
    summary: "External destinations reachable via tools"
    pipeline_steps:
      detect_exfiltration: { enabled: true, on_detection: block }
    tool_constraints: {}
    rego_templates: []

  recipient_validation:
    summary: "Validate destination domains for outbound messaging"
    pipeline_steps: {}
    tool_constraints:
      send_email:
        to:              { match: "^[^@]+@(allowed-domain-1|allowed-domain-2)\\." }
    rego_templates: []

  code_exec_risk:
    summary: "Tool calls that execute code"
    pipeline_steps:
      detect_code_exec:  { enabled: true, on_detection: block }
    tool_constraints:
      Bash:
        command:         { not_contains: ["rm -rf", "sudo", "eval $", "curl | sh"] }
    rego_templates: []

  audit_required:
    summary: "Regulatory audit trail required"
    pipeline_steps:
      audit_signing:     { enabled: true }
    tool_constraints: {}
    rego_templates: []

  access_control_strict:
    summary: "Step-up auth / approval for sensitive actions"
    pipeline_steps:
      require_approval:  { enabled: true, on_detection: block }
    tool_constraints: {}
    rego_templates: []

  fraud_risk:
    summary: "High-value transaction patterns"
    pipeline_steps:
      detect_anomaly:    { enabled: true, on_detection: notify }
    tool_constraints:
      transfer_funds:
        amount:          { max: 10000 }
    rego_templates: []

  ip_protection:
    summary: "Source code / IP must not leak externally"
    pipeline_steps:
      detect_secrets:    { enabled: true, on_detection: block }
    tool_constraints:
      Read:
        file_path:       { not_contains: ["/repo/", ".env"] }
    rego_templates: []

The catalog is data, not code. A compliance officer with no Python knowledge can add a category or concern by editing YAML.

The resolver (one function, ~40 lines)

from dataclasses import dataclass, field
from typing import Any

@dataclass
class ResolvedPolicy:
    pipeline_steps: dict[str, dict] = field(default_factory=dict)
    tool_constraints: dict[str, dict] = field(default_factory=dict)
    rego_template_instances: list[dict] = field(default_factory=list)
    # Audit trail: which categories / concerns produced each line.
    provenance: dict[str, list[str]] = field(default_factory=dict)


def resolve(category_ids: list[str], catalog: Catalog) -> ResolvedPolicy:
    """Categories → concerns → mitigations, deduplicated and merged."""
    concerns: set[str] = set()
    cat_to_concerns: dict[str, list[str]] = {}
    for cid in category_ids:
        cat = catalog.categories[cid]
        cat_to_concerns[cid] = cat.triggers
        concerns.update(cat.triggers)

    out = ResolvedPolicy()
    for concern_id in concerns:
        c = catalog.concerns[concern_id]

        # Pipeline steps — strictest wins (block > log; lower threshold wins).
        for step, cfg in c.pipeline_steps.items():
            existing = out.pipeline_steps.get(step, {})
            out.pipeline_steps[step] = _merge_step_cfg(existing, cfg)
            out.provenance.setdefault(f"step:{step}", []).append(concern_id)

        # Tool constraints — union per (tool, param).
        for tool, rules in c.tool_constraints.items():
            existing = out.tool_constraints.get(tool, {})
            out.tool_constraints[tool] = _merge_constraint(existing, rules)
            out.provenance.setdefault(f"tool:{tool}", []).append(concern_id)

        # Rego templates — instantiate each, dedup by (template_id, params).
        for tmpl in c.rego_templates:
            if tmpl not in out.rego_template_instances:
                out.rego_template_instances.append(tmpl)
                out.provenance.setdefault(
                    f"rego:{tmpl.template_id}", []
                ).append(concern_id)

    return out

That's the entire engine. Forty lines. The catalog does the heavy lifting; the resolver is a fold.

Conflict resolution rules (built into `_merge_*`)

When two concerns disagree on a step or constraint, the stricter wins. No exceptions, no operator override at this layer (operator can override after by editing the resulting pipeline directly — that's a separate, audited action).

Field	Strict-wins rule
`enabled`	`True` wins over `False`
`on_detection`	`block` > `notify` > `log`
`max` (numeric ceiling)	lower wins
`min` (numeric floor)	higher wins
`contains` (must contain)	union (logical AND across all sources)
`not_contains` (must not contain)	union
`exclude` / `not_match`	union

This rule is the opinionated part. Documented prominently so operators understand "ticking 7 boxes never relaxes anything."

The "Internal docs only" exception

This is the one negative assertion in the catalog. It triggers no concerns — meaning a pipeline with only internal_docs_only checked gets the slim proxy with no detection enabled.

But the moment the operator also ticks customer_pii (or anything else), the negative is overridden because concerns are additive and strictness wins. So internal_docs_only is effectively a "lower the floor" signal, never an override.

Operator-facing explainability

Once resolve() returns a ResolvedPolicy, the UI shows a preview before commit:

Based on what you ticked, we'll enable:

  ✓ detect_pii (block)
    Because: customer_pii, payment_data, eu_residents, health_data
  ✓ detect_secrets (block)
    Because: source_code_secrets, payment_data
  ✓ detect_exfiltration (block)
    Because: external_comms
  ✓ scan_output (block)
    Because: customer_pii
  …

  ✓ Tool constraint: Bash.command not_contains: ["rm -rf", "sudo", "eval $"]
    Because: source_code_secrets

  ✓ OPA template: block_egress_outside_region (allowed: [eu])
    Because: eu_residents

13 steps · 6 tool constraints · 2 OPA policies

Every line has a because trail. This solves the "why is this step here?" question that operators ask three months later when they want to relax something.

Forward: NLP → policy (the interview)

The same resolver runs. The LLM never sees Rego, never sees step names. It produces a category set:

Operator says: "We're a Belgian fintech, our agents send transactional
emails to EU customers about wire transfers."

LLM extracts (via tool call to a small classifier):
  - geography: EU (Belgium)
  - industry: fintech
  - data: payment, customer PII
  - tools used: outbound email
  - regulated: yes (PCI, GDPR)

LLM proposes categories with confidence scores:
  - eu_residents       (confident: explicit "EU customers")
  - customer_pii       (high: customers + emails)
  - payment_data       (high: "wire transfers")
  - external_comms     (high: "send emails")
  - source_code_secrets (low: not mentioned, leave unticked)

UI shows the proposal — operator confirms or flips boxes.
Same resolve() runs. Same ResolvedPolicy lands.

Why this composes:

The catalog stays declarative and human-curated
The LLM is a typed-selection helper from prose → categories (a small, well-bounded task)
No LLM-generated Rego ever runs (huge safety win — Rego from natural language is dangerous)
Confidence scores let the UI distinguish "we read this directly" from "we inferred this"

The interview can iterate:

"I see you ticked health data — are EU residents in scope?" "Will any of these tools post to webhooks or external APIs?"

Each question fills a category cell. Three to five questions usually cover most pipelines.

Catalog as knowledge graph: a follow-up "explain this rule" view can walk backwards — show "detect_pii is enabled because of these concerns, which were triggered by these categories, which the LLM derived from these phrases in the interview." Every decision is traceable to a phrase the operator said.

What ships in the smallest viable cut

Two files + one route + one UI panel:

tappass/policy/intent_catalog.yaml (categories — 7 entries)
tappass/policy/concerns.yaml (concerns — 12 entries)
tappass/policy/intent_resolver.py (the 40-line resolve function + merge helpers + a Catalog loader)
Extend OnboardAgentRequest (and / or PipelineCreate) with categories: list[str]. On submit, run resolve(), write the resulting steps + constraints + Rego templates into the per-tenant pipeline + emit one policy_intent_applied audit event with the full provenance map
Frontend Step 3: render the preview from the resolver's response

The NLP layer is a phase-2: a separate /agents/onboard/from-prose endpoint that takes a string, calls a classifier, returns the proposed category set. Same resolver runs after operator confirms.

What this concept does NOT prescribe

Specific catalog values — the YAML above is illustrative. Compliance/security folks will tune them.
The Rego template implementations — those live in tappass/policy/templates/builtins.py (already shipped: 3 templates; needs ~5 more to cover the catalog above).
The classifier model — could be a small fine-tuned local model, an LLM with structured output, or a rules-based parser. Decoupled from this design.
Whether categories are global or per-org — start global; per-org overrides are a phase-3.

Questions to resolve before we build

Who owns catalog edits? A compliance role in the dashboard, or YAML in source control? (Recommend: YAML in source for v1; admin UI in v2 with audit on every catalog edit.)
Do operator overrides get persisted as catalog deltas, or as direct pipeline edits? (Recommend: direct pipeline edits — the catalog stays clean, the operator's override is visible in the pipeline diff.)
What happens when the catalog changes after agents are onboarded? (Recommend: do nothing automatically — surface a "your pipeline diverges from the current catalog defaults" advisory in the dashboard. Re-applying is an explicit, audited action.)

Intent → Policy: a smart, declarative mapping (no if/else trees)

Intent → Policy: a smart, declarative mapping (no if/else trees)

The problem

The pattern: two layers of indirection

The catalog shape

tappass/policy/intent_catalog.yaml

tappass/policy/concerns.yaml

The resolver (one function, ~40 lines)

Conflict resolution rules (built into _merge_*)

The "Internal docs only" exception

Operator-facing explainability

Forward: NLP → policy (the interview)

What ships in the smallest viable cut

What this concept does NOT prescribe

Questions to resolve before we build

`tappass/policy/intent_catalog.yaml`

`tappass/policy/concerns.yaml`

Conflict resolution rules (built into `_merge_*`)