Testing

Core server tests live in tappass/tests/. SDK tests live in tappass-sdk/tests/. Same philosophy across both.

Philosophy in one sentence

Integration tests with real dependencies beat mocked unit tests for anything on the request path.

We were burned once by mocks — the tests passed, prod broke. Rule since then: if your change touches Postgres, an external API, or the pipeline, there must be an integration test that hits a real Postgres or a recorded/fake HTTP layer. See this feedback memory for the incident.

Layers

Layer	Tool	Runs against	When to write
Unit	`pytest`	Pure Python, no I/O	Domain logic, helpers, pure functions
Integration	`pytest` + real Postgres + `httpx.MockTransport` for upstreams	Real DB, fake HTTP	API routes, pipeline, vault, audit
Contract	`pytest` + recorded upstream responses	Real provider APIs (periodic, not on every PR)	Provider client changes
E2E	Playwright	Full server + frontend in staging	Release candidate smoke

Everything in CI runs unit + integration on every PR. Contract and E2E run nightly against staging.

Running tests

# All (from tappass/ repo root)
pytest

# Fast subset (unit only, no DB)
pytest -m "not integration"

# Single file
pytest tests/unit/pipeline/steps/test_detect_pii.py

# With pdb on failure
pytest -x --pdb

# With coverage
pytest --cov=src/tappass --cov-report=term-missing

# Show the slowest 20 tests
pytest --durations=20

Integration tests need a real Postgres

The test suite spins one up per session via testcontainers:

@pytest.fixture(scope="session")
def pg_container() -> Iterator[PostgresContainer]:
    with PostgresContainer("postgres:15") as pg:
        yield pg

On first run this pulls the Postgres image (~200 MB). After that, tests reuse the container for the session.

Per-test isolation happens via transaction rollback, not DB drop:

@pytest.fixture
async def db(pg_session) -> AsyncIterator[AsyncSession]:
    async with pg_session.begin() as tx:
        yield tx
        await tx.rollback()

So a test cannot "leak" state into the next test. Don't fight this — use factories, not shared state.

Factories (no fixtures-by-name-everywhere)

We use factory-boy with async support. One factory per model:

class TenantFactory(AsyncSQLAlchemyFactory):
    class Meta:
        model = Tenant

    name = Faker("company")
    slug = LazyAttribute(lambda o: slugify(o.name))


class AgentFactory(AsyncSQLAlchemyFactory):
    class Meta:
        model = Agent

    tenant = SubFactory(TenantFactory)
    label = Faker("slug")

In a test:

async def test_something(db):
    agent = await AgentFactory.create(session=db)
    # agent is a real row with real relationships

Mocking external HTTP

Never import unittest.mock. Use httpx.MockTransport:

from httpx import AsyncClient, MockTransport, Response

def _handler(request):
    if request.url.path == "/v1/chat/completions":
        return Response(200, json={"choices": [{"message": {"content": "hello"}}]})
    return Response(404)

async def test_openai_client_retries():
    transport = MockTransport(_handler)
    async with AsyncClient(transport=transport) as http:
        client = OpenAIClient(http=http, base_url="https://api.openai.com")
        reply = await client.chat(...)
        assert reply == "hello"

Lets you assert on requests, simulate 500s, timeouts, partial streaming — without patching.

Markers

@pytest.mark.integration      # requires real Postgres
@pytest.mark.slow             # > 1s; excluded by default in pre-commit
@pytest.mark.contract         # runs against live upstream; nightly only
@pytest.mark.flaky(reruns=3)  # for genuinely flaky-by-design (rare)

Default pytest runs integration but not contract. Pre-commit runs not slow.

What to test

For pipeline steps (see Pipeline step anatomy):

Unit: given a fake backend, the step returns the right Detection[]
Unit: on_detection=redact actually modifies ctx.user_message
Integration: POST to /v1/chat/completions with a triggering payload → event lands in audit_events with the expected detections

For API routes:

Happy path: 200 with expected shape
Auth: 401 with no key, 403 with wrong tenant's key
Validation: 422 on malformed payload
Idempotency: two identical requests produce one audit event (if the route is idempotent)

For vault providers:

Round-trip: set then get returns what we stored
Encryption: raw Postgres row never contains the plaintext
Versioning: set twice, old version still retrievable by version id
Deletion: deleted secret is unreadable, audit event recorded

Snapshots

We use syrupy for snapshot tests on API response shapes. When you intentionally change a response:

pytest --snapshot-update

Review the diff carefully before committing. A snapshot diff in a PR without accompanying test assertions is a red flag — it means the contract changed silently.

Flake policy

We tolerate zero flakes in main. If a test is flaky:

Skip it with @pytest.mark.skip(reason="flaky, see #issue") — open an issue
Do not sprinkle retries; fix the root cause
Target: flakes fixed or removed within two weeks

Test data: never real

No real PII in fixtures — use faker
No real API keys anywhere in tests/ — tp_test_xxx is fine, live keys never
No production database dumps for seeding — use factories

gitleaks scans tests/ too; a committed live key will block CI.

Coverage

Target: > 85% on new code in a PR (per-file, not project-wide). Existing low-coverage files can slide, but any file you touch must meet the target.

Coverage is reported as a CI comment by the codecov action. A drop of more than 2% blocks merge.

SDK-specific notes

tappass-sdk/ tests are trickier because the SDK is what customers use — we can't break the public surface. Two extra layers:

API compatibility tests — every method of the Agent class has a test that asserts its signature. Adding a required argument fails the suite.
Version pinning in examples — tappass-examples/ pins the SDK version. A PR that changes SDK behaviour must update the example, or CI in the examples repo fails.