Database migration

TapPass migrations are plain .sql files in tappass/deploy/migrations/ numbered NNN_short_name.sql, applied via a Cloud Run Job (tappass-migrate) that runs the shared runner.sh. Terraform uploads the migration files to GCS, then a null_resource triggers the job — idempotent by schema_migrations tracking.

Golden rules

Always write forward-only migrations for prod. Rollback is hard once customer data has flowed through a new column — the safe recovery path is forward with a compensating migration, not DROP COLUMN.
Every migration is wrapped in BEGIN/COMMIT by the runner. If yours needs a different transaction shape (e.g. CREATE INDEX CONCURRENTLY which can't run in a transaction), split it out and flag it in the PR description.
Stage first, prod second. The staging DB is your canary.

Author a migration

1. Pick the next number

cd ~/tappass/tappass/deploy/migrations
ls [0-9]*.sql | tail -3
# Highest-numbered file + 1

2. Write the SQL

Name the file NNN_short_describe_change.sql. Keep it focused on a single change — mixing two migrations into one file makes partial rollback impossible.

-- deploy/migrations/036_agents_last_seen_index.sql
-- Why: /api/agents list spent 300ms on a full-table scan of
-- agents.last_seen; add a btree index so it drops to <10ms.

BEGIN;

CREATE INDEX CONCURRENTLY IF NOT EXISTS agents_last_seen_idx
  ON agents (last_seen);

COMMIT;

3. Regenerate `schema.sql`

schema.sql is the "if this DB were empty, build it from scratch" snapshot. CI has a guard that fails if it's out of sync.

cd ~/tappass/tappass
./deploy/scripts/dump-schema.sh   # or equivalent per repo conventions
git diff deploy/migrations/schema.sql   # sanity-check the diff

4. Test locally against docker-compose

cd ~/tappass/tappass
docker compose up -d postgres
./deploy/scripts/apply-migrations.sh
# Expect: one "applied NNN_..." line. Re-run: zero lines (idempotent).

Open psql and confirm the schema change:

docker compose exec postgres psql -U tappass -d tappass -c '\d agents'

5. Unit / integration tests

cd ~/tappass/tappass
.venv/bin/python -m pytest tests/integration -q -k "<area-you-changed>"

Commit with a conventional-commit subject: db: add agents_last_seen_idx.

Apply to staging

The normal path — via terraform apply

Terraform hashes the migrations directory and re-triggers the job whenever a file changes:

cd ~/tappass/tappass/deploy/terraform/environments/gcp-staging
terraform plan -target=module.tappass.null_resource.run_migrations
terraform apply -target=module.tappass.null_resource.run_migrations

Watch the Cloud Run Job execution:

gcloud run jobs executions list --job=tappass-migrate \
  --project=tappass-staging --region=europe-west1 --limit=1

gcloud run jobs executions describe <execution-name> \
  --project=tappass-staging --region=europe-west1 --format='value(status.conditions)'

The manual path — direct job execution

Useful when terraform is blocked on unrelated drift.

gcloud run jobs execute tappass-migrate \
  --project=tappass-staging --region=europe-west1 --wait

The --wait flag blocks until the execution finishes and returns a non-zero exit code on failure, which is what you want in a terminal session.

Verify the migration landed

# Connect via Cloud SQL Auth Proxy or the GCP console
gcloud sql connect tappass-db --project=tappass-staging --user=tappass

# In psql:
# \d <changed_table>              — verify the schema
# SELECT version, applied_at, source FROM schema_migrations ORDER BY version DESC LIMIT 5;

Apply to prod

Only after staging has run green for at least 24 hours (so background jobs, uptime checks, and any customer traffic have exercised the change).

cd ~/tappass/tappass/deploy/terraform/environments/gcp-prod
terraform plan -target=module.tappass.null_resource.run_migrations
terraform apply -target=module.tappass.null_resource.run_migrations

Or, manually:

gcloud run jobs execute tappass-migrate \
  --project=tappass-prod --region=europe-west1 --wait

Never skip the --wait — if the job fails partway through a multi-statement migration, the schema_migrations table won't have the version recorded and the next run will retry cleanly. Silent failures without --wait hide this from you.

Rollback

Rollback is risky — the file you want to undo is tracked in schema_migrations and customers may have written data into a new column or relied on a new constraint.

For a failed migration (transaction rolled back)

If the Cloud Run Job failed inside BEGIN/COMMIT, Postgres has already rolled back — nothing to do. Fix the SQL, push a new file with an incremented number, re-run.

For a successful migration you now regret

Don't delete the file. Write a compensating migration:

-- deploy/migrations/037_revert_agents_last_seen_index.sql
-- Why: the 036 index caused write amplification under burst load;
-- restoring the pre-036 state with a forward-only compensation.

BEGIN;
DROP INDEX IF EXISTS agents_last_seen_idx;
COMMIT;

Forward migration through deletion > backward migration via file removal, because the file sequence in schema_migrations is the audit trail.

Data-destructive rollback

If you need to remove a column that has customer data:

Export the column to an audit log or GCS bucket first.
Get written sign-off from the customer success owner.
Apply the DROP COLUMN migration in a quiet traffic window.
Keep the export for the retention period required by the contract / DPA.

See Restore from backup if the rollback also requires restoring an earlier snapshot.

Common failure modes

Symptom	Cause	Fix
`terraform plan` shows `null_resource.run_migrations must be replaced`	Any file under `deploy/migrations/` changed	Expected — that's how we trigger the job
Job exits with `connection refused`	Cloud SQL Auth Proxy not attached to the job	Check `migrations.tf` — `cloud_sql_instances` annotation
`ERROR: relation "X" already exists`	Someone else applied the migration out of band	`psql` → manually `INSERT INTO schema_migrations` for that version
`schema.sql` staleness guard fails in CI	You forgot step 3 (regenerate schema.sql)	Regenerate and commit
Migration succeeds but app errors about a missing column at startup	App image doesn't contain code that uses the new column yet — deploy order wrong	Deploy code that can tolerate both states first, migrate, then deploy code that requires the new state. Deploy core server

Also see

Deploy core server — order-of-ops when a release couples schema + code changes.
Restore from backup — if a migration corrupts data.
Ops → Deployments — where the migration job fits in the overall CI/CD flow.