Debugging

The golden rule

The audit trail is ground truth. If you want to know what happened to a request, start there. Logs are useful; Grafana is useful; the audit trail is the source of truth.

Local debugging

Getting into a running test

pytest --pdb tests/integration/test_pipeline.py::test_pii_block

Drops into pdb on failure. Use ! in pdb to run Python, u/d to walk frames.

For async code, breakpoint() in the test itself works with recent Python + asyncio.

Attaching to the running server

# Start the server with debugpy
python -m debugpy --listen 0.0.0.0:5678 --wait-for-client -m tappass.main

Then attach VS Code / Cursor with a launch.json entry for connect on port 5678.

We keep a template launch.json in .vscode/ inside tappass/.

Seeing what the pipeline did

A local server writes every audit event to data/audit.jsonl as well as the database — tail it:

tail -f data/audit.jsonl | jq '{ts, event_kind, agent: .agent_id, detections: [.detections[]?.category], verdict: .policy_result.verdict}'

Handy for sanity-checking a fresh step change without opening a DB client.

Hitting your local server from a real SDK

export TAPPASS_URL=http://localhost:9620
export TAPPASS_API_KEY=tp_dev_alice
python -c "from tappass import Agent; print(Agent(os.environ['TAPPASS_URL'], os.environ['TAPPASS_API_KEY']).chat('hi').content)"

make seed-dev-keys pre-populates tp_dev_alice, tp_dev_bob, etc. into the local DB.

The "why is this flag being ignored?" debug

Flags flow in via header → identity/api_key.py → PipelineContext.flags. If a flag isn't taking effect:

Log the final ctx.flags at engine entry: set TAPPASS_LOG_LEVEL=DEBUG.
Check the step actually reads the flag; many steps only read config, not per-request flags. Flag mapping lives in pipeline/flag_binding.py.

Production debugging

No SSH. No pdb. You have:

Grafana for metrics
Cloud Run logs via gcloud logging tail "resource.type=cloud_run_revision"
Audit trail queries against prod Postgres read replica
PagerDuty incident threads for timeline

Reading Cloud Run logs

gcloud logging tail \
  "resource.type=cloud_run_revision AND resource.labels.service_name=tappass" \
  --project=tappass-prod-eu-west1 \
  --format="value(timestamp, jsonPayload.level, jsonPayload.message)"

Filter further:

# Only errors
gcloud logging read \
  'resource.type=cloud_run_revision AND severity>=ERROR AND resource.labels.service_name=tappass' \
  --project=tappass-prod-eu-west1 --limit=100

# Specific audit_id
gcloud logging read \
  'resource.type=cloud_run_revision AND jsonPayload.audit_id="ae_01JC..."' \
  --project=tappass-prod-eu-west1

Every log line is JSON. Keys we always emit: audit_id, tenant_id, agent_id, request_id, level, message.

Querying the prod audit trail (read-only)

Use the read replica. Connect via Cloud SQL Proxy:

cloud-sql-proxy \
  --address 127.0.0.1 \
  --port 5433 \
  tappass-prod-eu-west1:europe-west1:tappass-prod-pg-replica &

psql -h 127.0.0.1 -p 5433 -U tappass-ro -d tappass

Your tappass-ro role is scoped read-only to the replica. You can't mutate even if you try — the replica is streaming.

Common queries live in Data model → Useful read-only queries.

Finding why a specific call was blocked

SELECT audit_id, ts, detections, policy_result, agent_id
FROM audit_events
WHERE request_id = :the_request_id;

request_id is what the customer saw in the response's X-Request-Id header. Paste that into the query → you see exactly which step flagged it and which policy verdict fired.

When Grafana says "latency spike"

Drill into the Grafana dashboard "Pipeline Steps" — which step is slow?
If it's a detection backend: check its health endpoint (/backends/<name>/health)
If it's call_llm: check the upstream provider status page
If it's audit_write: Postgres is the bottleneck — check the Postgres dashboard

The "I think it's the CDN" check

# What did Cloudflare actually send?
curl -sI https://app.tappass.ai/v1/chat/completions  # watch cf-cache-status

# Bypass the edge
curl --resolve app.tappass.ai:443:<CF_ORIGIN_IP> https://app.tappass.ai/...

Profiling

CPU

# Local profiling of a single test
pytest -k test_big_pipeline --profile --profile-svg

Generates prof/combined.svg. Opens as a flame graph.

Async blocking

The hot path is async. If a step goes CPU-bound, it blocks the event loop.

async def run(self, ctx, config):
    # BAD — blocks the loop
    result = expensive_cpu_thing(ctx.user_message)

    # GOOD — offload
    result = await asyncio.to_thread(expensive_cpu_thing, ctx.user_message)

Telltale sign in Grafana: event_loop_lag climbs, but individual step duration_ms stays flat.

When you're stuck

In order:

Read the last 100 lines of logs for the affected request_id
Read the audit event
Read the test that covers this path (write one if it doesn't exist)
Re-read the relevant step's code, top to bottom
git log -p the file to see what changed recently
Ask in #eng with a specific audit_id and what you've ruled out

Don't ask "why is X broken?" without these five steps. Half the time, step 2 answers it.