Zero-downtime schema changes with expand/contract pattern

Zero-downtime schema changes with expand/contract pattern | Koder.ai

Why schema changes cause outages

Downtime from a database change isn't always a clean, obvious outage. To users it can look like a page that loads forever, a checkout that fails, or an app that suddenly shows "something went wrong." For teams it shows up as alerts, rising error rates, and a backlog of failed writes that need cleanup.

Schema changes are risky because the database is shared by every running version of your app. During a release you often have old and new code live at the same time (rolling deploys, multiple instances, background jobs). A migration that looks correct can still break one of those versions.

Common failure modes include:

New code writes to a column that doesn't exist yet, causing immediate errors.
Old code reads a column or table that a migration renamed or dropped, causing crashes after deploy.
A backfill or index build spikes CPU or locks rows, making normal requests slow or timing out.
A "quick" constraint change (like NOT NULL) blocks writes while the table is checked.

Even when the code is fine, releases get blocked because the real problem is timing and compatibility across versions.

Zero-downtime schema changes come down to one rule: every intermediate state must be safe for both old and new code. You change the database without breaking existing reads and writes, ship code that can handle both shapes, and only remove the old path once nothing depends on it.

That extra effort is worth it when you have real traffic, strict SLAs, or lots of app instances and workers. For a tiny internal tool with a quiet database, a planned maintenance window may be simpler.

Expand/contract in plain words

Most incidents from database work happen because the app expects the database to change instantly, while the database change takes time. The expand/contract pattern avoids that by breaking one risky change into smaller, safe steps.

For a short period, your system supports two "dialects" at once. You introduce the new structure first, keep the old one working, move data gradually, then clean up.

The pattern is simple:

Expand: add what you need (columns, tables, indexes) without breaking the current app.
Run both paths: deploy code that works with old and new structures so mixed versions still behave.
Contract: once everything uses the new structure, remove the old schema and the old code.

This plays nicely with rolling deploys. If you update 10 servers one by one, you'll briefly run old and new versions together. Expand/contract keeps both compatible with the same database during that overlap.

It also makes rollbacks less scary. If a new release has a bug, you can roll back the app without rolling back the database, because the old structures still exist during the expand window.

Example: you want to split a PostgreSQL column full_name into first_name and last_name. You add the new columns (expand), ship code that can write and read both shapes, backfill old rows, then drop full_name once you're confident nothing uses it (contract).

What "expand" usually includes

The expand phase is about adding new options, not removing old ones.

A common first move is adding a new column. In PostgreSQL, it's usually safest to add it as nullable and without a default. Adding a non-null column with a default can trigger a table rewrite or heavier locks, depending on your Postgres version and the exact change. A safer sequence is: add nullable, deploy tolerant code, backfill, then later enforce NOT NULL.

Indexes also need care. Creating a normal index can block writes longer than you expect. When you can, use concurrent index creation so reads and writes keep flowing. It takes longer, but avoids the release-stopping lock.

Expand can also mean adding new tables. If you're moving from a single column to a many-to-many relationship, you might add a join table while keeping the old column in place. The old path keeps working while the new structure starts collecting data.

In practice, expand often includes:

Adding new nullable columns or new tables alongside existing ones
Adding indexes in a non-blocking way when possible
Using feature flags to control when new reads or writes turn on
Writing to both old and new fields (dual-write) when needed
Keeping reads backward compatible (old, new, or fallback)

After expand, old and new app versions should be able to run at the same time without surprises.

Deploying code that stays compatible

Most release pain happens in the middle: some servers run new code, others still run old code, while the database is already changing. Your goal is straightforward: any version in the rollout should work with both the old and the expanded schema.

A common approach is dual-write. If you add a new column, the new app writes to both the old and the new column. Old app versions keep writing only the old one, which is fine because it still exists. Keep the new column optional at first, and delay strict constraints until you're sure all writers have been upgraded.

Reads usually switch more carefully than writes. For a while, keep reads on the old column (the one you know is fully populated). After backfill and verification, switch reads to prefer the new column, with a fallback to the old if the new is missing.

Also keep your API output stable while the database changes underneath. Even if you introduce a new internal field, avoid changing response shapes until all consumers are ready (web, mobile, integrations).

A rollback-friendly rollout often looks like this:

Release 1: add the new column and ship code that can read old data and write both columns.
Release 2: backfill existing rows, then ship code that prefers reading the new column but can fall back.
Release 3: stop writing the old column (but keep it present).
Release 4: remove old reads, then remove the old column.

The key idea is that the first irreversible step is dropping the old structure, so you postpone it until the end.

Backfill data safely (without overloading the DB)

Snapshot before schema work

Capture a snapshot before risky changes so you can revert quickly if needed.

Use Snapshots

Backfilling is where many "zero-downtime schema changes" go wrong. You want to fill the new column for existing rows without long locks, slow queries, or surprise load spikes.

Batching matters. Aim for batches that finish quickly (seconds, not minutes). If each batch is small, you can pause, resume, and tune the job without blocking releases.

To track progress, use a stable cursor. In PostgreSQL that's often the primary key. Process rows in order and store the last id you completed, or work in id ranges. This avoids expensive full-table scans when the job restarts.

Here is a simple pattern:

UPDATE my_table
SET new_col = ...
WHERE new_col IS NULL
  AND id > $last_id
ORDER BY id
LIMIT 1000;

Make the update conditional (for example, WHERE new_col IS NULL) so the job is idempotent. Reruns only touch rows that still need work, which reduces unnecessary writes.

Plan for new data arriving during the backfill. The usual order is:

Update application code first so new writes also populate the new field.
Backfill historical rows in batches.
Run a short catch-up loop that rechecks recent rows.
If needed, add a guardrail (like a trigger or default) to prevent new NULLs.

A good backfill is boring: steady, measurable, and easy to pause if the database gets hot.

Verifying the migration is really done

The riskiest moment isn't adding the new column. It's deciding you can rely on it.

Before you move to contract, prove two things: the new data is complete, and production has been reading it safely.

Start with completeness checks that are fast and repeatable:

Confirm the new column has no unexpected NULLs.
Compare how many rows are eligible vs how many were filled.
Spot-check a handful of IDs and compare old vs new values.
Test edge cases (empty strings, zero, very old records).
Rerun the same checks later to ensure nothing drifts.

If you're dual-writing, add a consistency check to catch silent bugs. For example, run a query hourly that finds rows where old_value <> new_value and alert if it's not zero. This is often the quickest way to discover that one writer still updates only the old column.

Watch basic production signals while the migration runs. If query time or lock waits spike, even your "safe" verification queries may be adding load. Monitor error rates for any code paths that read the new column, especially right after deploys.

How long should you keep both paths? Long enough to survive at least one full release cycle and one backfill rerun. Many teams use 1-2 weeks, or until they're confident no old app version is still running.

Contract phase: removing the old path

Contract is where teams get nervous because it feels like the point of no return. If expand was done right, contract is mostly cleanup, and you can still do it in small, low-risk steps.

Pick the moment carefully. Don't drop anything right after a backfill finishes. Give it at least one full release cycle so delayed jobs and edge cases have time to show themselves.

A safe contract sequence usually looks like this:

Stop dual-write and confirm new writes land only in the new column(s).
Remove old reads in the application so the fallback is gone.
Delete dead code paths, feature flags, and background jobs that reference the old schema.
Remove temporary triggers, sync jobs, or compatibility views.
Drop old indexes and constraints, then drop the old column.

If you can, split contract into two releases: one that removes code references (with extra logging), and a later one that removes database objects. That separation makes rollback and troubleshooting much easier.

PostgreSQL specifics matter here. Dropping a column is mostly a metadata change, but it still takes an ACCESS EXCLUSIVE lock briefly. Plan for a quiet window and keep the migration fast. If you created extra indexes, prefer dropping them with DROP INDEX CONCURRENTLY to avoid blocking writes (it can't run inside a transaction block, so your migration tooling needs to support that).

Common mistakes and traps

Make migrations repeatable

Set up a repeatable runbook for expand, verify, and contract steps.

Get Started

Zero-downtime migrations fail when the database and the app stop agreeing on what's allowed. The pattern works only if every intermediate state is safe for both old code and new code.

Traps that break production

These mistakes show up often:

Adding NOT NULL too early, while an older app version can still write rows without the new field.
Backfilling a huge table in one transaction, which can hold locks, bloat tables, and cause timeouts.
Assuming a default is free. In PostgreSQL, some defaults trigger a table rewrite.
Switching reads to the new column before writes are reliably populating it.
Forgetting other writers and readers (cron jobs, workers, exports, reporting queries).

A realistic scenario: you start writing full_name from the API, but a background job that creates users still only sets first_name and last_name. It runs at night, inserts rows with full_name = NULL, and later code assumes full_name is always present.

How to avoid getting stuck mid-migration

Treat each step like a release that may run for days:

Keep the new column nullable during the transition, and enforce "required" in code first.
Backfill in small batches with pauses, and watch DB load.
Make code tolerant: read both paths, write both paths when needed, handle missing values.
Audit every place that touches the table, including workers and reporting.

Quick checklist before each release

A repeatable checklist keeps you from shipping code that only works in one database state.

Before you deploy, confirm the database already has the expanded pieces in place (new columns/tables, indexes created in a low-lock way). Then confirm the app is tolerant: it should work against the old shape, the expanded shape, and a half-backfilled state.

Keep the checklist short:

Expansion is present: new schema objects exist and were added in a low-lock way.
Compatibility is real: the app works with old and expanded schema, including workers and admin paths.
Backfill is controlled: small batches, pausable, with basic progress metrics.
Read switch is planned: you know exactly when reads move, and how you roll back if results look wrong.
Contract is delayed: you wait at least one or two release cycles before dropping old objects.

A migration is only done when reads use the new data, writes no longer maintain the old data, and you've verified the backfill with at least one simple check (counts or sampling).

A realistic example: replace a column without downtime

Prototype with PostgreSQL fast

Spin up a Postgres-backed app and iterate on schema safely as you ship.

Build Now

Say you have a PostgreSQL table customers with a column phone that stores messy values (different formats, sometimes blank). You want to replace it with phone_e164, but you can't block releases or take the app down.

A clean expand/contract sequence looks like this:

Expand: add phone_e164 as nullable, with no default, and no heavy constraints yet.
Compatible deploy: update code to write both phone and phone_e164, but keep reads on phone so nothing changes for users.
Backfill: convert existing rows in small batches (for example, 1,000 at a time).
Switch reads: deploy code that reads phone_e164 first, and falls back to phone if it's still NULL.
Contract: once you're sure everything uses phone_e164, remove the fallback, drop phone, then add stricter constraints if you still need them.

Rollback stays simple when each step is backward compatible. If the read switch causes issues, roll back the app and the database still has both columns. If backfill causes load spikes, pause the job, reduce batch size, and continue later.

If you want the team to stay aligned, document the plan in one place: the exact SQL, which release flips reads, how you measure completion (like percent non-NULL phone_e164), and who owns each step.

Next steps: make it repeatable

Expand/contract works best when it feels routine. Write a short runbook your team can reuse for every schema change, ideally one page and specific enough that a new teammate can follow it.

A practical template covers:

Expand (exact migrations)
Code changes (what must stay backward compatible, and where dual-read or dual-write is used)
Backfill (batch size, rate limits, pause/resume)
Verify (the queries and metrics that prove correctness)
Contract (what gets removed, and when)

Decide ownership up front. "Everyone thought someone else would do contract" is how old columns and feature flags live for months.

Even if the backfill runs online, schedule it when traffic is lower. It's easier to keep batches small, watch DB load, and stop quickly if latency climbs.

If you're building and deploying with Koder.ai (koder.ai), Planning Mode can be a useful way to map the phases and checkpoints before you touch production. The same compatibility rules still apply, but having the steps written down makes it harder to skip the boring parts that prevent outages.

FAQ

Why do schema changes cause outages even when the SQL looks correct?

Because your database is shared by every running version of your app. During rolling deploys and background jobs, old and new code can run at the same time, and a migration that changes names, drops columns, or adds constraints can break whichever version wasn’t written for that exact schema state.

What does “zero-downtime schema change” really mean?

It means you design the migration so every intermediate database state works for both old and new code. You add new structures first, run with both paths for a while, then remove the old structures only after nothing depends on them.

What’s the difference between the expand and contract phases?

Expand adds new columns, tables, or indexes without removing anything the current app needs. Contract is the cleanup phase where you remove the old columns, old reads/writes, and temporary sync logic after you’ve proven the new path is fully working.

What’s the safest way to add a new column in PostgreSQL?

Adding a nullable column with no default is usually the safest starting point, because it avoids heavy locks and keeps old code working. Then you deploy code that can handle the column being missing or NULL, backfill gradually, and only later tighten constraints like NOT NULL.

When should I use dual-write, and what does it do?

It’s when the new app version writes to both the old field and the new field during the transition. That keeps data consistent while you still have older app instances and jobs that only know about the old field.

How do I backfill data without slowing down production?

Backfill in small batches that finish quickly, and make each batch idempotent so reruns only update rows that still need work. Keep an eye on query time, lock waits, and replication lag, and be ready to pause or shrink batch size if the database starts to heat up.

How can I verify the migration is actually done before dropping anything?

First, check completeness, like how many rows still have NULL in the new column. Then do a consistency check comparing old and new values for a sample (or continuously for all rows if it’s cheap), and watch production errors right after deploys to catch code paths still using the wrong schema.

What are the most common migration steps that break production?

NOT NULL or new constraints can block writes while the table is validated, and normal index creation can hold locks longer than you expect. Renames and drops are also risky because older code may still reference the old names during a rolling deploy.

When is it safe to do the contract step and drop the old column?

Only after you’ve stopped writing the old field, switched reads to the new field without relying on fallbacks, and waited long enough to be confident no old app versions or workers are still running. Many teams treat this as a separate release so a rollback stays simple.

Do I always need expand/contract, or can I just take a maintenance window?

If you can tolerate a maintenance window and there’s little traffic, a simple one-shot migration may be fine. If you have real users, multiple app instances, background workers, or an SLA, expand/contract is usually worth the extra steps because it keeps rollouts and rollbacks safer; in Koder.ai Planning Mode, writing down the phases and checks ahead of time helps you avoid skipping the “boring” steps that prevent outages.