Skip to content

Database migration

TapPass migrations are plain .sql files in tappass/deploy/migrations/ numbered NNN_short_name.sql, applied via a Cloud Run Job (tappass-migrate) that runs the shared runner.sh. Terraform uploads the migration files to GCS, then a null_resource triggers the job — idempotent by schema_migrations tracking.

Golden rules

  • Always write forward-only migrations for prod. Rollback is hard once customer data has flowed through a new column — the safe recovery path is forward with a compensating migration, not DROP COLUMN.
  • Every migration is wrapped in BEGIN/COMMIT by the runner. If yours needs a different transaction shape (e.g. CREATE INDEX CONCURRENTLY which can't run in a transaction), split it out and flag it in the PR description.
  • Stage first, prod second. The staging DB is your canary.
Terminal window
cd ~/tappass/tappass/deploy/migrations
ls [0-9]*.sql | tail -3
# Highest-numbered file + 1

Name the file NNN_short_describe_change.sql. Keep it focused on a single change — mixing two migrations into one file makes partial rollback impossible.

-- deploy/migrations/036_agents_last_seen_index.sql
-- Why: /api/agents list spent 300ms on a full-table scan of
-- agents.last_seen; add a btree index so it drops to <10ms.
BEGIN;
CREATE INDEX CONCURRENTLY IF NOT EXISTS agents_last_seen_idx
ON agents (last_seen);
COMMIT;

schema.sql is the "if this DB were empty, build it from scratch" snapshot. CI has a guard that fails if it's out of sync.

Terminal window
cd ~/tappass/tappass
./deploy/scripts/dump-schema.sh # or equivalent per repo conventions
git diff deploy/migrations/schema.sql # sanity-check the diff
Terminal window
cd ~/tappass/tappass
docker compose up -d postgres
./deploy/scripts/apply-migrations.sh
# Expect: one "applied NNN_..." line. Re-run: zero lines (idempotent).

Open psql and confirm the schema change:

Terminal window
docker compose exec postgres psql -U tappass -d tappass -c '\d agents'
Terminal window
cd ~/tappass/tappass
.venv/bin/python -m pytest tests/integration -q -k "<area-you-changed>"

Commit with a conventional-commit subject: db: add agents_last_seen_idx.

Terraform hashes the migrations directory and re-triggers the job whenever a file changes:

Terminal window
cd ~/tappass/tappass/deploy/terraform/environments/gcp-staging
terraform plan -target=module.tappass.null_resource.run_migrations
terraform apply -target=module.tappass.null_resource.run_migrations

Watch the Cloud Run Job execution:

Terminal window
gcloud run jobs executions list --job=tappass-migrate \
--project=tappass-staging --region=europe-west1 --limit=1
gcloud run jobs executions describe <execution-name> \
--project=tappass-staging --region=europe-west1 --format='value(status.conditions)'

Useful when terraform is blocked on unrelated drift.

Terminal window
gcloud run jobs execute tappass-migrate \
--project=tappass-staging --region=europe-west1 --wait

The --wait flag blocks until the execution finishes and returns a non-zero exit code on failure, which is what you want in a terminal session.

Terminal window
# Connect via Cloud SQL Auth Proxy or the GCP console
gcloud sql connect tappass-db --project=tappass-staging --user=tappass
# In psql:
# \d <changed_table> — verify the schema
# SELECT version, applied_at, source FROM schema_migrations ORDER BY version DESC LIMIT 5;

Only after staging has run green for at least 24 hours (so background jobs, uptime checks, and any customer traffic have exercised the change).

Terminal window
cd ~/tappass/tappass/deploy/terraform/environments/gcp-prod
terraform plan -target=module.tappass.null_resource.run_migrations
terraform apply -target=module.tappass.null_resource.run_migrations

Or, manually:

Terminal window
gcloud run jobs execute tappass-migrate \
--project=tappass-prod --region=europe-west1 --wait

Never skip the --wait — if the job fails partway through a multi-statement migration, the schema_migrations table won't have the version recorded and the next run will retry cleanly. Silent failures without --wait hide this from you.

Rollback is risky — the file you want to undo is tracked in schema_migrations and customers may have written data into a new column or relied on a new constraint.

For a failed migration (transaction rolled back)

Section titled “For a failed migration (transaction rolled back)”

If the Cloud Run Job failed inside BEGIN/COMMIT, Postgres has already rolled back — nothing to do. Fix the SQL, push a new file with an incremented number, re-run.

Don't delete the file. Write a compensating migration:

-- deploy/migrations/037_revert_agents_last_seen_index.sql
-- Why: the 036 index caused write amplification under burst load;
-- restoring the pre-036 state with a forward-only compensation.
BEGIN;
DROP INDEX IF EXISTS agents_last_seen_idx;
COMMIT;

Forward migration through deletion > backward migration via file removal, because the file sequence in schema_migrations is the audit trail.

If you need to remove a column that has customer data:

  1. Export the column to an audit log or GCS bucket first.
  2. Get written sign-off from the customer success owner.
  3. Apply the DROP COLUMN migration in a quiet traffic window.
  4. Keep the export for the retention period required by the contract / DPA.

See Restore from backup if the rollback also requires restoring an earlier snapshot.

SymptomCauseFix
terraform plan shows null_resource.run_migrations must be replacedAny file under deploy/migrations/ changedExpected — that's how we trigger the job
Job exits with connection refusedCloud SQL Auth Proxy not attached to the jobCheck migrations.tfcloud_sql_instances annotation
ERROR: relation "X" already existsSomeone else applied the migration out of bandpsql → manually INSERT INTO schema_migrations for that version
schema.sql staleness guard fails in CIYou forgot step 3 (regenerate schema.sql)Regenerate and commit
Migration succeeds but app errors about a missing column at startupApp image doesn't contain code that uses the new column yet — deploy order wrongDeploy code that can tolerate both states first, migrate, then deploy code that requires the new state. Deploy core server