Database migration
TapPass migrations are plain .sql files in
tappass/deploy/migrations/ numbered NNN_short_name.sql, applied
via a Cloud Run Job (tappass-migrate) that runs the shared
runner.sh. Terraform uploads the migration files to GCS, then a
null_resource triggers the job — idempotent by schema_migrations
tracking.
Golden rules
- Always write forward-only migrations for prod. Rollback is
hard once customer data has flowed through a new column — the safe
recovery path is forward with a compensating migration, not
DROP COLUMN. - Every migration is wrapped in
BEGIN/COMMITby the runner. If yours needs a different transaction shape (e.g.CREATE INDEX CONCURRENTLYwhich can't run in a transaction), split it out and flag it in the PR description. - Stage first, prod second. The staging DB is your canary.
Author a migration
Section titled “Author a migration”1. Pick the next number
Section titled “1. Pick the next number”cd ~/tappass/tappass/deploy/migrationsls [0-9]*.sql | tail -3# Highest-numbered file + 12. Write the SQL
Section titled “2. Write the SQL”Name the file NNN_short_describe_change.sql. Keep it focused on a
single change — mixing two migrations into one file makes partial
rollback impossible.
-- deploy/migrations/036_agents_last_seen_index.sql-- Why: /api/agents list spent 300ms on a full-table scan of-- agents.last_seen; add a btree index so it drops to <10ms.
BEGIN;
CREATE INDEX CONCURRENTLY IF NOT EXISTS agents_last_seen_idx ON agents (last_seen);
COMMIT;3. Regenerate schema.sql
Section titled “3. Regenerate schema.sql”schema.sql is the "if this DB were empty, build it from scratch"
snapshot. CI has a guard that fails if it's out of sync.
cd ~/tappass/tappass./deploy/scripts/dump-schema.sh # or equivalent per repo conventionsgit diff deploy/migrations/schema.sql # sanity-check the diff4. Test locally against docker-compose
Section titled “4. Test locally against docker-compose”cd ~/tappass/tappassdocker compose up -d postgres./deploy/scripts/apply-migrations.sh# Expect: one "applied NNN_..." line. Re-run: zero lines (idempotent).Open psql and confirm the schema change:
docker compose exec postgres psql -U tappass -d tappass -c '\d agents'5. Unit / integration tests
Section titled “5. Unit / integration tests”cd ~/tappass/tappass.venv/bin/python -m pytest tests/integration -q -k "<area-you-changed>"Commit with a conventional-commit subject: db: add agents_last_seen_idx.
Apply to staging
Section titled “Apply to staging”The normal path — via terraform apply
Section titled “The normal path — via terraform apply”Terraform hashes the migrations directory and re-triggers the job whenever a file changes:
cd ~/tappass/tappass/deploy/terraform/environments/gcp-stagingterraform plan -target=module.tappass.null_resource.run_migrationsterraform apply -target=module.tappass.null_resource.run_migrationsWatch the Cloud Run Job execution:
gcloud run jobs executions list --job=tappass-migrate \ --project=tappass-staging --region=europe-west1 --limit=1
gcloud run jobs executions describe <execution-name> \ --project=tappass-staging --region=europe-west1 --format='value(status.conditions)'The manual path — direct job execution
Section titled “The manual path — direct job execution”Useful when terraform is blocked on unrelated drift.
gcloud run jobs execute tappass-migrate \ --project=tappass-staging --region=europe-west1 --waitThe --wait flag blocks until the execution finishes and returns a
non-zero exit code on failure, which is what you want in a terminal
session.
Verify the migration landed
Section titled “Verify the migration landed”# Connect via Cloud SQL Auth Proxy or the GCP consolegcloud sql connect tappass-db --project=tappass-staging --user=tappass
# In psql:# \d <changed_table> — verify the schema# SELECT version, applied_at, source FROM schema_migrations ORDER BY version DESC LIMIT 5;Apply to prod
Section titled “Apply to prod”Only after staging has run green for at least 24 hours (so background jobs, uptime checks, and any customer traffic have exercised the change).
cd ~/tappass/tappass/deploy/terraform/environments/gcp-prodterraform plan -target=module.tappass.null_resource.run_migrationsterraform apply -target=module.tappass.null_resource.run_migrationsOr, manually:
gcloud run jobs execute tappass-migrate \ --project=tappass-prod --region=europe-west1 --waitNever skip the --wait — if the job fails partway through a
multi-statement migration, the schema_migrations table won't have
the version recorded and the next run will retry cleanly. Silent
failures without --wait hide this from you.
Rollback
Section titled “Rollback”Rollback is risky — the file you want to undo is tracked in
schema_migrations and customers may have written data into a new
column or relied on a new constraint.
For a failed migration (transaction rolled back)
Section titled “For a failed migration (transaction rolled back)”If the Cloud Run Job failed inside BEGIN/COMMIT, Postgres has
already rolled back — nothing to do. Fix the SQL, push a new file
with an incremented number, re-run.
For a successful migration you now regret
Section titled “For a successful migration you now regret”Don't delete the file. Write a compensating migration:
-- deploy/migrations/037_revert_agents_last_seen_index.sql-- Why: the 036 index caused write amplification under burst load;-- restoring the pre-036 state with a forward-only compensation.
BEGIN;DROP INDEX IF EXISTS agents_last_seen_idx;COMMIT;Forward migration through deletion > backward migration via file
removal, because the file sequence in schema_migrations is the
audit trail.
Data-destructive rollback
Section titled “Data-destructive rollback”If you need to remove a column that has customer data:
- Export the column to an audit log or GCS bucket first.
- Get written sign-off from the customer success owner.
- Apply the
DROP COLUMNmigration in a quiet traffic window. - Keep the export for the retention period required by the contract / DPA.
See Restore from backup if the rollback also requires restoring an earlier snapshot.
Common failure modes
Section titled “Common failure modes”| Symptom | Cause | Fix |
|---|---|---|
terraform plan shows null_resource.run_migrations must be replaced | Any file under deploy/migrations/ changed | Expected — that's how we trigger the job |
Job exits with connection refused | Cloud SQL Auth Proxy not attached to the job | Check migrations.tf — cloud_sql_instances annotation |
ERROR: relation "X" already exists | Someone else applied the migration out of band | psql → manually INSERT INTO schema_migrations for that version |
schema.sql staleness guard fails in CI | You forgot step 3 (regenerate schema.sql) | Regenerate and commit |
| Migration succeeds but app errors about a missing column at startup | App image doesn't contain code that uses the new column yet — deploy order wrong | Deploy code that can tolerate both states first, migrate, then deploy code that requires the new state. Deploy core server |
Also see
Section titled “Also see”- Deploy core server — order-of-ops when a release couples schema + code changes.
- Restore from backup — if a migration corrupts data.
- Ops → Deployments — where the migration job fits in the overall CI/CD flow.