Pre-deployment evaluator

What it does: Runs 50+ adversarial probes against the agent before it ships, produces a quantitative pass/fail report against the policy.

1. Vision context

The architecture's runtime spine governs what the agent does once deployed. But the procurement gate question — "is the agent safe to ship?" — needs a pre-deployment answer. This is what Giskard centers on; what Enoki includes; what TapPass needs to be competitive on.

The evaluator runs adversarial probes against the agent under the candidate policy. Probes test for prompt injection, data disclosure, excessive agency, over-refusal, jailbreaks, tool misuse, compliance-specific failures. The output is a quantitative report the operator gates the release on (or fails to gate, and ships anyway with eyes open).

Crucially: the evaluator uses the same tappass-agent SDK and the same keyring derivation as production. There is no divergence between what we tested and what we deployed. See architecture §8 for full design.

2. Functional specification

CLI: tappass eval run --agent <pkg> --policy <id> --packs <list> --probe-suite <ver> [--gate fail-on=critical].

Behavior:

Derive a temporary SandboxConfig for the candidate policy (calls policy-to-sandbox-config-builder with mode=evaluation).
Spawn an ephemeral sandbox with that config.
For each probe in the suite, drive the agent through the probe scenario.
Collect: did the agent emit denied tools? did detections fire? did loop_guard trigger? did the agent leak PII? did it follow prompt-injection bait? did it over-refuse?
Aggregate per-probe pass/fail; compute pack-level scores.
Emit JUnit XML + TapPass trace bundle.
Optionally exit non-zero based on --gate.

Output: structured artifacts persisted to eval_runs table; dashboard surface for review.

3. Technical design

Lives at tappass/eval/. Driver shells out to the agent package's run-task CLI per probe; collects audit events from the running ephemeral sandbox; correlates by trace id.

4. Definition of done

All acceptance_criteria pass.
Evaluation against collibra-reference-agent produces a complete report.
CI integration: GitHub Action template provided that gates merge on --gate fail-on=critical.
Evaluation history queryable per agent / per policy version.

5. Coordination notes

With policy-to-sandbox-config-builder: we call derive(mode=evaluation) — pure function, no persist, no audit emit. Builder must support this mode.

With agent-client-sdk: evaluator drives the agent via its own CLI; agent uses the SDK normally; we observe externally.

With behavior-drift-monitor: evaluation runs establish the baseline tool-call distribution; the drift monitor compares production reality against this.

Open questions:

(Q) Should evaluation be a gate (block deploy on failure) or advisory (report and let operator decide)? Lean: configurable per cascade level. Org admins can require eval-pass at org floor for high-risk functions; advisory by default.
(Q) Probe-author SDK for customers who want to add custom probes? Lean: ship core suite v1; custom-probe SDK in v2.

6. Out of scope

Quality / hallucination benchmarking beyond what's needed for policy conformance — separate concern; this evaluator checks whether hallucinations cross policy boundaries, not whether they're factually accurate.
Model fine-tuning to fix probe failures — orthogonal.
Continuous in-production red-teaming — would extend this component into a runtime mode; v2.