Skip to content

Debugging

The audit trail is ground truth. If you want to know what happened to a request, start there. Logs are useful; Grafana is useful; the audit trail is the source of truth.

Terminal window
pytest --pdb tests/integration/test_pipeline.py::test_pii_block

Drops into pdb on failure. Use ! in pdb to run Python, u/d to walk frames.

For async code, breakpoint() in the test itself works with recent Python + asyncio.

Terminal window
# Start the server with debugpy
python -m debugpy --listen 0.0.0.0:5678 --wait-for-client -m tappass.main

Then attach VS Code / Cursor with a launch.json entry for connect on port 5678.

We keep a template launch.json in .vscode/ inside tappass/.

A local server writes every audit event to data/audit.jsonl as well as the database — tail it:

Terminal window
tail -f data/audit.jsonl | jq '{ts, event_kind, agent: .agent_id, detections: [.detections[]?.category], verdict: .policy_result.verdict}'

Handy for sanity-checking a fresh step change without opening a DB client.

Terminal window
export TAPPASS_URL=http://localhost:9620
export TAPPASS_API_KEY=tp_dev_alice
python -c "from tappass import Agent; print(Agent(os.environ['TAPPASS_URL'], os.environ['TAPPASS_API_KEY']).chat('hi').content)"

make seed-dev-keys pre-populates tp_dev_alice, tp_dev_bob, etc. into the local DB.

The “why is this flag being ignored?” debug

Section titled “The “why is this flag being ignored?” debug”

Flags flow in via header → identity/api_key.pyPipelineContext.flags. If a flag isn’t taking effect:

  1. Log the final ctx.flags at engine entry: set TAPPASS_LOG_LEVEL=DEBUG.
  2. Check the step actually reads the flag; many steps only read config, not per-request flags. Flag mapping lives in pipeline/flag_binding.py.

No SSH. No pdb. You have:

  • Grafana for metrics
  • Cloud Run logs via gcloud logging tail "resource.type=cloud_run_revision"
  • Audit trail queries against prod Postgres read replica
  • PagerDuty incident threads for timeline
Terminal window
gcloud logging tail \
"resource.type=cloud_run_revision AND resource.labels.service_name=tappass" \
--project=tappass-prod-eu-west1 \
--format="value(timestamp, jsonPayload.level, jsonPayload.message)"

Filter further:

Terminal window
# Only errors
gcloud logging read \
'resource.type=cloud_run_revision AND severity>=ERROR AND resource.labels.service_name=tappass' \
--project=tappass-prod-eu-west1 --limit=100
# Specific audit_id
gcloud logging read \
'resource.type=cloud_run_revision AND jsonPayload.audit_id="ae_01JC..."' \
--project=tappass-prod-eu-west1

Every log line is JSON. Keys we always emit: audit_id, tenant_id, agent_id, request_id, level, message.

Use the read replica. Connect via Cloud SQL Proxy:

Terminal window
cloud-sql-proxy \
--address 127.0.0.1 \
--port 5433 \
tappass-prod-eu-west1:europe-west1:tappass-prod-pg-replica &
psql -h 127.0.0.1 -p 5433 -U tappass-ro -d tappass

Your tappass-ro role is scoped read-only to the replica. You can’t mutate even if you try — the replica is streaming.

Common queries live in Data model → Useful read-only queries.

SELECT audit_id, ts, detections, policy_result, agent_id
FROM audit_events
WHERE request_id = :the_request_id;

request_id is what the customer saw in the response’s X-Request-Id header. Paste that into the query → you see exactly which step flagged it and which policy verdict fired.

  1. Drill into the Grafana dashboard “Pipeline Steps” — which step is slow?
  2. If it’s a detection backend: check its health endpoint (/backends/<name>/health)
  3. If it’s call_llm: check the upstream provider status page
  4. If it’s audit_write: Postgres is the bottleneck — check the Postgres dashboard
Terminal window
# What did Cloudflare actually send?
curl -sI https://app.tappass.ai/v1/chat/completions # watch cf-cache-status
# Bypass the edge
curl --resolve app.tappass.ai:443:<CF_ORIGIN_IP> https://app.tappass.ai/...
Terminal window
# Local profiling of a single test
pytest -k test_big_pipeline --profile --profile-svg

Generates prof/combined.svg. Opens as a flame graph.

The hot path is async. If a step goes CPU-bound, it blocks the event loop.

src/tappass/pipeline/steps/slow_step.py
async def run(self, ctx, config):
# BAD — blocks the loop
result = expensive_cpu_thing(ctx.user_message)
# GOOD — offload
result = await asyncio.to_thread(expensive_cpu_thing, ctx.user_message)

Telltale sign in Grafana: event_loop_lag climbs, but individual step duration_ms stays flat.

In order:

  1. Read the last 100 lines of logs for the affected request_id
  2. Read the audit event
  3. Read the test that covers this path (write one if it doesn’t exist)
  4. Re-read the relevant step’s code, top to bottom
  5. git log -p the file to see what changed recently
  6. Ask in #eng with a specific audit_id and what you’ve ruled out

Don’t ask “why is X broken?” without these five steps. Half the time, step 2 answers it.