Skip to content

Pre-deployment evaluator

What it does: Runs 50+ adversarial probes against the agent before it ships, produces a quantitative pass/fail report against the policy.

The architecture's runtime spine governs what the agent does once deployed. But the procurement gate question — "is the agent safe to ship?" — needs a pre-deployment answer. This is what Giskard centers on; what Enoki includes; what TapPass needs to be competitive on.

The evaluator runs adversarial probes against the agent under the candidate policy. Probes test for prompt injection, data disclosure, excessive agency, over-refusal, jailbreaks, tool misuse, compliance-specific failures. The output is a quantitative report the operator gates the release on (or fails to gate, and ships anyway with eyes open).

Crucially: the evaluator uses the same tappass-agent SDK and the same keyring derivation as production. There is no divergence between what we tested and what we deployed. See architecture §8 for full design.

CLI: tappass eval run --agent <pkg> --policy <id> --packs <list> --probe-suite <ver> [--gate fail-on=critical].

Behavior:

  1. Derive a temporary SandboxConfig for the candidate policy (calls policy-to-sandbox-config-builder with mode=evaluation).
  2. Spawn an ephemeral sandbox with that config.
  3. For each probe in the suite, drive the agent through the probe scenario.
  4. Collect: did the agent emit denied tools? did detections fire? did loop_guard trigger? did the agent leak PII? did it follow prompt-injection bait? did it over-refuse?
  5. Aggregate per-probe pass/fail; compute pack-level scores.
  6. Emit JUnit XML + TapPass trace bundle.
  7. Optionally exit non-zero based on --gate.

Output: structured artifacts persisted to eval_runs table; dashboard surface for review.

Lives at tappass/eval/. Driver shells out to the agent package's run-task CLI per probe; collects audit events from the running ephemeral sandbox; correlates by trace id.

  • All acceptance_criteria pass.
  • Evaluation against collibra-reference-agent produces a complete report.
  • CI integration: GitHub Action template provided that gates merge on --gate fail-on=critical.
  • Evaluation history queryable per agent / per policy version.

With policy-to-sandbox-config-builder: we call derive(mode=evaluation) — pure function, no persist, no audit emit. Builder must support this mode.

With agent-client-sdk: evaluator drives the agent via its own CLI; agent uses the SDK normally; we observe externally.

With behavior-drift-monitor: evaluation runs establish the baseline tool-call distribution; the drift monitor compares production reality against this.

Open questions:

  • (Q) Should evaluation be a gate (block deploy on failure) or advisory (report and let operator decide)? Lean: configurable per cascade level. Org admins can require eval-pass at org floor for high-risk functions; advisory by default.
  • (Q) Probe-author SDK for customers who want to add custom probes? Lean: ship core suite v1; custom-probe SDK in v2.
  • Quality / hallucination benchmarking beyond what's needed for policy conformance — separate concern; this evaluator checks whether hallucinations cross policy boundaries, not whether they're factually accurate.
  • Model fine-tuning to fix probe failures — orthogonal.
  • Continuous in-production red-teaming — would extend this component into a runtime mode; v2.