Debugging
The golden rule
Section titled “The golden rule”The audit trail is ground truth. If you want to know what happened to a request, start there. Logs are useful; Grafana is useful; the audit trail is the source of truth.
Local debugging
Section titled “Local debugging”Getting into a running test
Section titled “Getting into a running test”pytest --pdb tests/integration/test_pipeline.py::test_pii_blockDrops into pdb on failure. Use ! in pdb to run Python, u/d to walk frames.
For async code, breakpoint() in the test itself works with recent Python + asyncio.
Attaching to the running server
Section titled “Attaching to the running server”# Start the server with debugpypython -m debugpy --listen 0.0.0.0:5678 --wait-for-client -m tappass.mainThen attach VS Code / Cursor with a launch.json entry for connect on port 5678.
We keep a template launch.json in .vscode/ inside tappass/.
Seeing what the pipeline did
Section titled “Seeing what the pipeline did”A local server writes every audit event to data/audit.jsonl as well as the database — tail it:
tail -f data/audit.jsonl | jq '{ts, event_kind, agent: .agent_id, detections: [.detections[]?.category], verdict: .policy_result.verdict}'Handy for sanity-checking a fresh step change without opening a DB client.
Hitting your local server from a real SDK
Section titled “Hitting your local server from a real SDK”export TAPPASS_URL=http://localhost:9620export TAPPASS_API_KEY=tp_dev_alicepython -c "from tappass import Agent; print(Agent(os.environ['TAPPASS_URL'], os.environ['TAPPASS_API_KEY']).chat('hi').content)"make seed-dev-keys pre-populates tp_dev_alice, tp_dev_bob, etc. into the local DB.
The “why is this flag being ignored?” debug
Section titled “The “why is this flag being ignored?” debug”Flags flow in via header → identity/api_key.py → PipelineContext.flags. If a flag isn’t taking effect:
- Log the final
ctx.flagsat engine entry: setTAPPASS_LOG_LEVEL=DEBUG. - Check the step actually reads the flag; many steps only read config, not per-request flags. Flag mapping lives in
pipeline/flag_binding.py.
Production debugging
Section titled “Production debugging”No SSH. No pdb. You have:
- Grafana for metrics
- Cloud Run logs via
gcloud logging tail "resource.type=cloud_run_revision" - Audit trail queries against prod Postgres read replica
- PagerDuty incident threads for timeline
Reading Cloud Run logs
Section titled “Reading Cloud Run logs”gcloud logging tail \ "resource.type=cloud_run_revision AND resource.labels.service_name=tappass" \ --project=tappass-prod-eu-west1 \ --format="value(timestamp, jsonPayload.level, jsonPayload.message)"Filter further:
# Only errorsgcloud logging read \ 'resource.type=cloud_run_revision AND severity>=ERROR AND resource.labels.service_name=tappass' \ --project=tappass-prod-eu-west1 --limit=100
# Specific audit_idgcloud logging read \ 'resource.type=cloud_run_revision AND jsonPayload.audit_id="ae_01JC..."' \ --project=tappass-prod-eu-west1Every log line is JSON. Keys we always emit: audit_id, tenant_id, agent_id, request_id, level, message.
Querying the prod audit trail (read-only)
Section titled “Querying the prod audit trail (read-only)”Use the read replica. Connect via Cloud SQL Proxy:
cloud-sql-proxy \ --address 127.0.0.1 \ --port 5433 \ tappass-prod-eu-west1:europe-west1:tappass-prod-pg-replica &
psql -h 127.0.0.1 -p 5433 -U tappass-ro -d tappassYour tappass-ro role is scoped read-only to the replica. You can’t mutate even if you try — the replica is streaming.
Common queries live in Data model → Useful read-only queries.
Finding why a specific call was blocked
Section titled “Finding why a specific call was blocked”SELECT audit_id, ts, detections, policy_result, agent_idFROM audit_eventsWHERE request_id = :the_request_id;request_id is what the customer saw in the response’s X-Request-Id header. Paste that into the query → you see exactly which step flagged it and which policy verdict fired.
When Grafana says “latency spike”
Section titled “When Grafana says “latency spike””- Drill into the Grafana dashboard “Pipeline Steps” — which step is slow?
- If it’s a detection backend: check its health endpoint (
/backends/<name>/health) - If it’s
call_llm: check the upstream provider status page - If it’s
audit_write: Postgres is the bottleneck — check the Postgres dashboard
The “I think it’s the CDN” check
Section titled “The “I think it’s the CDN” check”# What did Cloudflare actually send?curl -sI https://app.tappass.ai/v1/chat/completions # watch cf-cache-status
# Bypass the edgecurl --resolve app.tappass.ai:443:<CF_ORIGIN_IP> https://app.tappass.ai/...Profiling
Section titled “Profiling”# Local profiling of a single testpytest -k test_big_pipeline --profile --profile-svgGenerates prof/combined.svg. Opens as a flame graph.
Async blocking
Section titled “Async blocking”The hot path is async. If a step goes CPU-bound, it blocks the event loop.
async def run(self, ctx, config): # BAD — blocks the loop result = expensive_cpu_thing(ctx.user_message)
# GOOD — offload result = await asyncio.to_thread(expensive_cpu_thing, ctx.user_message)Telltale sign in Grafana: event_loop_lag climbs, but individual step duration_ms stays flat.
When you’re stuck
Section titled “When you’re stuck”In order:
- Read the last 100 lines of logs for the affected
request_id - Read the audit event
- Read the test that covers this path (write one if it doesn’t exist)
- Re-read the relevant step’s code, top to bottom
git log -pthe file to see what changed recently- Ask in
#engwith a specificaudit_idand what you’ve ruled out
Don’t ask “why is X broken?” without these five steps. Half the time, step 2 answers it.