Zero-Downtime PostgreSQL Schema Migrations: The Complete Guide to Changing Production Databases Without Breaking Anything
It's 2 AM. Your team just deployed a migration that adds a NOT NULL column to a table with 50 million rows. The migration acquired an ACCESS EXCLUSIVE lock, blocking every query on that table. Your API response times spiked from 50ms to 30 seconds. Your on-call engineer's phone is exploding. Customers are tweeting screenshots of error pages.
You've just learned, the hard way, that ALTER TABLE ... ADD COLUMN ... NOT NULL DEFAULT in PostgreSQL can be a weapon of mass destruction in production.
This guide exists so you never have to learn that lesson the hard way. We'll cover every pattern, tool, and gotcha involved in changing PostgreSQL schemas without taking your application offline. Not theory. Production-tested strategies that work on tables with hundreds of millions of rows.
Why Schema Migrations Are Dangerous
Most developers treat database migrations as simple code deployments. Write an ALTER TABLE, run it during deploy, move on. This works fine in development, where your users table has 50 rows. In production, where it has 50 million rows, the same migration can bring down your entire application.
Here's why:
PostgreSQL's Lock System
Every DDL statement in PostgreSQL acquires locks. The problem isn't that locks exist โ it's the lock queue. When a DDL statement requests an ACCESS EXCLUSIVE lock and can't get it immediately (because active queries hold conflicting locks), it waits in the queue. While it waits, every new query that needs any lock on that table also queues behind it.
Timeline of disaster:
00:00 Active SELECT queries running (hold ACCESS SHARE locks)
00:01 ALTER TABLE requests ACCESS EXCLUSIVE lock โ waits in queue
00:02 New SELECT query arrives โ queues behind ALTER TABLE
00:03 New SELECT query arrives โ queues behind ALTER TABLE
00:04 New SELECT query arrives โ queues behind ALTER Table
... Every query is now queued. Application appears frozen.
00:30 Original SELECT finishes โ ALTER TABLE acquires lock
00:31 ALTER TABLE runs (could take minutes on large tables)
02:00 ALTER TABLE completes โ queued queries finally execute
That's a 2-minute outage caused by a single ALTER TABLE statement. With a busy table, this can cascade into connection pool exhaustion, application crashes, and cascading failures across your entire system.
The Operations That Kill You
Not all schema changes are equally dangerous. Here's the risk matrix:
| Operation | Lock Type | Risk Level | Duration on 50M rows |
|---|---|---|---|
ADD COLUMN (nullable, no default) | ACCESS EXCLUSIVE | ๐ข Low | Milliseconds |
ADD COLUMN ... DEFAULT (PG 11+) | ACCESS EXCLUSIVE | ๐ข Low | Milliseconds |
ADD COLUMN ... NOT NULL DEFAULT (PG 11+) | ACCESS EXCLUSIVE | ๐ก Medium | Milliseconds (but lock queue risk) |
DROP COLUMN | ACCESS EXCLUSIVE | ๐ข Low | Milliseconds |
ALTER COLUMN TYPE | ACCESS EXCLUSIVE | ๐ด Critical | Minutes to hours (full table rewrite) |
ADD INDEX | SHARE lock | ๐ด Critical | Minutes (blocks writes) |
ADD INDEX CONCURRENTLY | No lock | ๐ข Low | Minutes (but non-blocking) |
ADD NOT NULL CONSTRAINT | ACCESS EXCLUSIVE | ๐ด Critical | Minutes (full table scan) |
ADD FOREIGN KEY | SHARE ROW EXCLUSIVE | ๐ด Critical | Minutes (full table scan) |
The operations marked ๐ด are the ones that cause outages. Let's learn how to make each of them safe.
The Expand-Contract Pattern
The expand-contract pattern (also called "parallel change") is the foundational strategy for zero-downtime migrations. The idea is simple: never make a breaking change in a single step. Instead, break it into multiple, non-breaking steps.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EXPAND-CONTRACT PATTERN โ
โ โ
โ Phase 1: EXPAND โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Add new column/table alongside old one โ โ
โ โ Deploy code that writes to BOTH old and new โ โ
โ โ Old readers continue to work unchanged โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ Phase 2: MIGRATE โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Backfill existing data to new column/table โ โ
โ โ Verify data consistency โ โ
โ โ Switch readers to new column/table โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ Phase 3: CONTRACT โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Remove old column/table โ โ
โ โ Remove compatibility code โ โ
โ โ Clean up โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Let's see this applied to the most common and dangerous migrations.
Example: Renaming a Column
You want to rename users.full_name to users.display_name. A naive ALTER TABLE users RENAME COLUMN full_name TO display_name will break every query referencing full_name the instant it runs.
Step 1: Expand โ Add the new column
-- Deploy 1: Add new column (instant, no rewrite) ALTER TABLE users ADD COLUMN display_name TEXT; -- Create a trigger to keep both columns in sync CREATE OR REPLACE FUNCTION sync_display_name() RETURNS TRIGGER AS $$ BEGIN IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN IF NEW.display_name IS NULL AND NEW.full_name IS NOT NULL THEN NEW.display_name := NEW.full_name; ELSIF NEW.full_name IS NULL AND NEW.display_name IS NOT NULL THEN NEW.full_name := NEW.display_name; END IF; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER trigger_sync_display_name BEFORE INSERT OR UPDATE ON users FOR EACH ROW EXECUTE FUNCTION sync_display_name();
Step 2: Backfill โ Copy existing data
-- Backfill in batches to avoid long-running transactions DO $$ DECLARE batch_size INT := 10000; rows_updated INT; BEGIN LOOP UPDATE users SET display_name = full_name WHERE id IN ( SELECT id FROM users WHERE display_name IS NULL AND full_name IS NOT NULL LIMIT batch_size FOR UPDATE SKIP LOCKED ); GET DIAGNOSTICS rows_updated = ROW_COUNT; EXIT WHEN rows_updated = 0; RAISE NOTICE 'Updated % rows', rows_updated; PERFORM pg_sleep(0.1); -- Brief pause to reduce load COMMIT; END LOOP; END $$;
Step 3: Switch readers โ Update application code to read from display_name
Step 4: Contract โ Remove old column and trigger
-- Deploy 3: After all code reads from display_name DROP TRIGGER trigger_sync_display_name ON users; DROP FUNCTION sync_display_name(); ALTER TABLE users DROP COLUMN full_name;
Four deployments instead of one. But zero downtime.
Safe Index Creation
Creating indexes on large tables is one of the most common causes of production outages. A standard CREATE INDEX acquires a SHARE lock, blocking all writes (INSERT, UPDATE, DELETE) for the entire duration of the index build.
Always Use CONCURRENTLY
-- โ DANGEROUS: Blocks all writes CREATE INDEX idx_users_email ON users(email); -- โ SAFE: Non-blocking CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CREATE INDEX CONCURRENTLY builds the index without holding a lock that blocks writes. It does this by scanning the table twice โ once to build the initial index, and once to capture any changes that happened during the first scan.
The CONCURRENTLY Gotchas
There are important caveats you must know:
1. It can fail silently. If CREATE INDEX CONCURRENTLY encounters an error (e.g., a unique constraint violation), it leaves behind an INVALID index. Always verify:
-- Check for invalid indexes after creation SELECT indexname, indexdef FROM pg_indexes WHERE schemaname = 'public' AND indexname = 'idx_users_email'; -- Check validity SELECT pg_index.indisvalid FROM pg_index JOIN pg_class ON pg_index.indexrelid = pg_class.oid WHERE pg_class.relname = 'idx_users_email';
If the index is invalid, drop it and try again:
DROP INDEX CONCURRENTLY IF EXISTS idx_users_email; -- Then retry CREATE INDEX CONCURRENTLY
2. It cannot run inside a transaction. You cannot wrap CREATE INDEX CONCURRENTLY in a BEGIN...COMMIT block. Most migration tools run each migration in a transaction by default โ you need to disable this for concurrent index creation.
3. It takes longer. Because it scans the table twice and doesn't hold an exclusive lock, concurrent index creation takes 2-3x longer than a regular CREATE INDEX. On a 100 million row table, this could mean 30+ minutes. Plan accordingly.
Adding NOT NULL Constraints Safely
Adding a NOT NULL constraint to an existing column is deceptively dangerous. PostgreSQL must scan the entire table to verify no NULL values exist, and it holds an ACCESS EXCLUSIVE lock while doing so.
The Safe Pattern
-- Step 1: Add a CHECK constraint with NOT VALID (instant, no scan) ALTER TABLE users ADD CONSTRAINT users_email_not_null CHECK (email IS NOT NULL) NOT VALID; -- Step 2: Validate the constraint in a separate transaction -- This scans the table but only acquires a -- SHARE UPDATE EXCLUSIVE lock (allows reads AND writes) ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null; -- Step 3 (optional): Convert to a proper NOT NULL constraint -- In PostgreSQL 12+, if a valid CHECK constraint exists, -- adding NOT NULL is instant (no table scan needed) ALTER TABLE users ALTER COLUMN email SET NOT NULL; -- Step 4: Drop the now-redundant CHECK constraint ALTER TABLE users DROP CONSTRAINT users_email_not_null;
Why this works: NOT VALID tells PostgreSQL "I promise this constraint holds for new rows, but don't check existing rows yet." The VALIDATE CONSTRAINT step then checks existing rows with a weaker lock that doesn't block reads or writes.
Adding Foreign Keys Safely
Foreign key creation is another silent killer. By default, ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY scans both the referencing and referenced tables while holding locks.
-- โ DANGEROUS: Locks both tables during validation ALTER TABLE orders ADD CONSTRAINT fk_orders_user_id FOREIGN KEY (user_id) REFERENCES users(id); -- โ SAFE: Two-step approach -- Step 1: Add constraint without validating (instant) ALTER TABLE orders ADD CONSTRAINT fk_orders_user_id FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID; -- Step 2: Validate separately (weaker lock) ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user_id;
The NOT VALID trick works the same way as with CHECK constraints. New rows are validated immediately, and existing rows are validated in a separate step with a less restrictive lock.
Changing Column Types
Changing a column type (e.g., INT to BIGINT, or VARCHAR(50) to VARCHAR(255)) typically requires a full table rewrite. On a table with millions of rows, this means minutes of downtime.
The Expand-Contract Approach
-- Step 1: Add new column with desired type ALTER TABLE orders ADD COLUMN amount_v2 BIGINT; -- Step 2: Create sync trigger CREATE OR REPLACE FUNCTION sync_amount_v2() RETURNS TRIGGER AS $$ BEGIN NEW.amount_v2 := NEW.amount; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER trigger_sync_amount_v2 BEFORE INSERT OR UPDATE ON orders FOR EACH ROW EXECUTE FUNCTION sync_amount_v2(); -- Step 3: Backfill (batched) UPDATE orders SET amount_v2 = amount WHERE id BETWEEN 1 AND 1000000; -- ... repeat for all ranges -- Step 4: Switch application code to use amount_v2 -- Step 5: Clean up DROP TRIGGER trigger_sync_amount_v2 ON orders; DROP FUNCTION sync_amount_v2(); ALTER TABLE orders DROP COLUMN amount; ALTER TABLE orders RENAME COLUMN amount_v2 TO amount;
The Exception: Varchar Length Increase
Some column type changes are actually safe because they don't require a table rewrite:
-- โ SAFE: Increasing VARCHAR length (no rewrite needed) ALTER TABLE users ALTER COLUMN name TYPE VARCHAR(255); -- Only safe when increasing, not decreasing! -- โ SAFE: Removing VARCHAR limit entirely ALTER TABLE users ALTER COLUMN name TYPE TEXT; -- โ SAFE: Changing VARCHAR to TEXT -- TEXT and VARCHAR are stored identically in PostgreSQL
Lock Timeout: Your Safety Net
Every migration should set a lock timeout. Without it, a migration will wait indefinitely for a lock, queueing all other queries behind it.
-- Set a 5-second lock timeout for this session SET lock_timeout = '5s'; -- Now try the migration ALTER TABLE users ADD COLUMN bio TEXT; -- If the lock can't be acquired within 5 seconds, -- PostgreSQL raises an error instead of waiting forever
In your migration scripts, always set a lock timeout:
-- At the top of every migration file SET lock_timeout = '5s'; SET statement_timeout = '30s'; -- Your migration DDL here ALTER TABLE users ADD COLUMN bio TEXT; -- Reset for safety RESET lock_timeout; RESET statement_timeout;
Retry Logic
When using lock timeouts, you need retry logic. The migration might fail because a long-running query held a conflicting lock. That's okay โ retry after a brief pause:
async function safeMigrate(sql: string, maxRetries = 5): Promise<void> { for (let attempt = 1; attempt <= maxRetries; attempt++) { try { await db.query('SET lock_timeout = \'5s\''); await db.query(sql); console.log(`Migration succeeded on attempt ${attempt}`); return; } catch (error) { if (error.code === '55P03' && attempt < maxRetries) { // Lock timeout - wait and retry console.log( `Lock timeout on attempt ${attempt}, ` + `retrying in ${attempt * 2}s...` ); await sleep(attempt * 2000); } else { throw error; } } } }
Batched Backfills: The Art of Moving Data
When you need to update millions of existing rows (e.g., backfilling a new column), doing it in a single UPDATE is dangerous. It creates one massive transaction that:
- Holds row-level locks on all affected rows
- Generates massive WAL (Write-Ahead Log) volume
- Can cause replication lag
- Bloats the table (dead tuples that need vacuuming)
The Batch Pattern
-- Backfill in chunks of 10,000 rows DO $$ DECLARE batch_size INT := 10000; last_id BIGINT := 0; max_id BIGINT; rows_updated INT; BEGIN SELECT MAX(id) INTO max_id FROM orders; WHILE last_id < max_id LOOP UPDATE orders SET amount_cents = amount * 100 WHERE id > last_id AND id <= last_id + batch_size AND amount_cents IS NULL; GET DIAGNOSTICS rows_updated = ROW_COUNT; last_id := last_id + batch_size; RAISE NOTICE 'Processed up to id %, updated % rows', last_id, rows_updated; -- Pause briefly to let replicas catch up -- and allow autovacuum to process dead tuples PERFORM pg_sleep(0.05); -- Commit each batch separately COMMIT; END LOOP; END $$;
Monitoring Your Backfill
While a backfill is running, monitor these metrics:
-- Check replication lag (crucial for read replicas) SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes FROM pg_stat_replication; -- Check table bloat (dead tuples accumulating) SELECT relname, n_live_tup, n_dead_tup, round(n_dead_tup::numeric / GREATEST(n_live_tup, 1) * 100, 2) AS dead_pct, last_autovacuum FROM pg_stat_user_tables WHERE relname = 'orders'; -- Check for long-running queries that might conflict SELECT pid, now() - query_start AS duration, state, left(query, 100) AS query_preview FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '30 seconds' ORDER BY duration DESC;
Migration Tooling for Production
pgroll: Zero-Downtime Migration Tool
pgroll is a schema migration tool specifically designed for zero-downtime changes. It automatically implements the expand-contract pattern:
{ "name": "add_display_name", "operations": [ { "rename_column": { "table": "users", "from": "full_name", "to": "display_name" } } ] }
pgroll handles the expand phase (creating views, triggers, and temporary columns), the migration phase, and the contract phase automatically. It creates versioned schema views so old and new application versions can coexist.
Reshape
Reshape follows a similar philosophy โ declarative, zero-downtime migrations with automatic expand-contract:
[[actions]] type = "alter_column" table = "users" column = "full_name" [actions.changes] name = "display_name"
sqitch + Custom Scripts
For teams preferring more control, sqitch combined with custom scripts provides a lightweight alternative:
# sqitch workflow sqitch add rename-user-column \ -n "Rename full_name to display_name (phase 1: expand)" sqitch deploy sqitch verify
Framework-Specific Tools
| Framework | Tool | Zero-Downtime Support |
|---|---|---|
| Rails | strong_migrations gem | Blocks dangerous operations, suggests safe alternatives |
| Django | django-pg-zero-downtime-migrations | Adds lock timeouts and safe patterns |
| Laravel | No built-in | Manual patterns required |
| Node.js/TypeScript | graphile-migrate, node-pg-migrate | Good control, manual patterns |
| Go | goose, atlas | Atlas has declarative migrations with safety checks |
The Pre-Migration Checklist
Before running any migration in production:
1. Check Active Locks and Long-Running Queries
-- Kill any query running longer than 5 minutes on target table SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query ILIKE '%your_table%' AND state != 'idle' AND query_start < now() - interval '5 minutes';
2. Test on a Production-Sized Dataset
Never test migrations only on your development database. Create a staging environment with production-scale data:
# Dump production table structure and row count pg_dump --schema-only production_db > schema.sql # Generate realistic test data at production scale pgbench -i -s 1000 staging_db
3. Set Timeouts
SET lock_timeout = '5s'; SET statement_timeout = '30m'; -- for long backfills
4. Have a Rollback Plan
Every migration should have a tested rollback:
-- migration.sql ALTER TABLE users ADD COLUMN bio TEXT; -- rollback.sql ALTER TABLE users DROP COLUMN IF EXISTS bio;
5. Monitor During Deployment
-- Watch for lock contention in real-time SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.query AS blocking_query, now() - blocked.query_start AS blocked_duration FROM pg_stat_activity blocked JOIN pg_locks blocked_locks ON blocked.pid = blocked_locks.pid JOIN pg_locks blocking_locks ON blocked_locks.locktype = blocking_locks.locktype AND blocked_locks.database = blocking_locks.database AND blocked_locks.relation = blocking_locks.relation AND blocked_locks.pid != blocking_locks.pid JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid WHERE NOT blocked_locks.granted;
Real-World Migration Playbook
Here's a complete, copy-paste playbook for the most common production migrations:
Playbook 1: Add a New Required Column with Default
-- Step 1: Add nullable column (instant) SET lock_timeout = '5s'; ALTER TABLE users ADD COLUMN subscription_tier TEXT; -- Step 2: Backfill existing rows DO $$ DECLARE batch_size INT := 5000; rows_updated INT; BEGIN LOOP UPDATE users SET subscription_tier = 'free' WHERE subscription_tier IS NULL AND id IN ( SELECT id FROM users WHERE subscription_tier IS NULL LIMIT batch_size FOR UPDATE SKIP LOCKED ); GET DIAGNOSTICS rows_updated = ROW_COUNT; EXIT WHEN rows_updated = 0; COMMIT; PERFORM pg_sleep(0.05); END LOOP; END $$; -- Step 3: Add NOT NULL constraint safely ALTER TABLE users ADD CONSTRAINT users_subscription_tier_not_null CHECK (subscription_tier IS NOT NULL) NOT VALID; ALTER TABLE users VALIDATE CONSTRAINT users_subscription_tier_not_null; -- Step 4: Set NOT NULL (instant with valid CHECK constraint) ALTER TABLE users ALTER COLUMN subscription_tier SET NOT NULL; -- Step 5: Set default for future rows ALTER TABLE users ALTER COLUMN subscription_tier SET DEFAULT 'free'; -- Step 6: Clean up CHECK constraint ALTER TABLE users DROP CONSTRAINT users_subscription_tier_not_null;
Playbook 2: Create Index on Large Table
-- Step 1: Create index concurrently SET statement_timeout = '0'; -- Disable for long-running index build CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC); -- Step 2: Verify index is valid DO $$ BEGIN IF NOT EXISTS ( SELECT 1 FROM pg_index i JOIN pg_class c ON i.indexrelid = c.oid WHERE c.relname = 'idx_orders_customer_created' AND i.indisvalid = true ) THEN RAISE EXCEPTION 'Index is invalid! Drop and retry.'; END IF; END $$;
Playbook 3: Replace a Table (Full Schema Change)
-- When the changes are so extensive that expand-contract -- on individual columns doesn't make sense -- Step 1: Create new table with desired schema CREATE TABLE users_v2 ( id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, display_name TEXT NOT NULL, email TEXT NOT NULL UNIQUE, subscription_tier TEXT NOT NULL DEFAULT 'free', created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); -- Step 2: Create indexes on new table CREATE INDEX idx_users_v2_email ON users_v2(email); CREATE INDEX idx_users_v2_created ON users_v2(created_at); -- Step 3: Copy data in batches (similar to backfill pattern) -- Step 4: Create a trigger on old table to sync new rows -- to users_v2 -- Step 5: Swap tables atomically BEGIN; ALTER TABLE users RENAME TO users_old; ALTER TABLE users_v2 RENAME TO users; COMMIT; -- Step 6: Update sequences, foreign keys, etc. -- Step 7: Drop old table after verification period
Common Mistakes and How to Avoid Them
Mistake 1: Running Migrations During Peak Traffic
Schedule schema migrations during your lowest-traffic window. Even "safe" migrations benefit from lower concurrency.
Mistake 2: Not Testing the Rollback
Every migration rollback should be tested. "Just drop the column" is a rollback plan that destroys data. Consider whether you need to preserve data during rollback.
Mistake 3: Forgetting About ORMs
Your ORM might generate SQL that references columns by name. When using expand-contract, ensure your ORM version can handle the transitional state (both old and new columns existing).
Mistake 4: Ignoring Replication Lag
If you use read replicas, schema changes propagate via replication. A backfill that writes 10 million rows can cause significant replication lag, making your read replicas return stale data.
Solution: Monitor pg_stat_replication during backfills and throttle if lag exceeds your threshold:
async function throttledBackfill(batchSize: number) { while (true) { const lag = await getReplicationLag(); if (lag > MAX_LAG_BYTES) { console.log(`Replication lag ${lag} bytes, pausing...`); await sleep(5000); continue; } const updated = await updateBatch(batchSize); if (updated === 0) break; await sleep(50); // baseline throttle } }
Mistake 5: Deploying Application Code and Migration Simultaneously
The application code and the migration should be deployed in separate steps. Deploy the migration first. Verify it succeeded. Then deploy the code that uses the new schema. This decoupling is essential for safe rollbacks.
The Migration Safety Checklist
Before every production migration:
- Lock timeout is set (
SET lock_timeout = '5s') - Statement timeout is set for long operations
- Migration tested on production-scale dataset
- Rollback script written and tested
- No concurrent deployments or maintenance windows
- Replication lag monitoring is active
- Backfill uses batched updates (not single UPDATE)
- Indexes created with CONCURRENTLY
- NOT NULL constraints added via CHECK + VALIDATE pattern
- Foreign keys added with NOT VALID + VALIDATE
- Application code is backward-compatible with old schema
- On-call engineer is aware of the migration
- Traffic is at its lowest point (if possible)
Schema migrations don't have to be scary. The patterns in this guide have been battle-tested on databases serving millions of requests per day. The key insight is simple: never make a breaking change in a single step. Expand first, contract later. Set timeouts. Backfill in batches. Monitor everything.
Your database is the foundation of your application. Treat its schema changes with the same care you'd give to open-heart surgery โ careful planning, precise execution, and constant monitoring.
Explore Related Tools
Try these free developer tools from Pockit