Services

Backend Engineering, Infrastructure

Industry

B2B Marketplace

Year

2023

Four million rows. Zero seconds of downtime.

A B2B marketplace needed a new non-nullable column on its largest table - 4 million rows of listing data, queried on every page load. Django's default migration strategy locks the entire table for the duration of the backfill. On a 4M-row table, that means 30+ seconds of blocked reads, blocked writes, connection pool exhaustion, and HTTP 502s cascading across the entire application. The column had to be added during peak traffic without anyone noticing.

01.
THE CHALLENGE

The 34-Second Blackout

Django's AddField with null=False and a default value issues a single ALTER TABLE statement. PostgreSQL acquires an ACCESS EXCLUSIVE lock on the table - the most restrictive lock level. Every other transaction that touches the table queues behind it. On a 4M-row table with average row width, the backfill takes 34 seconds. During those 34 seconds, every listing page, every search query, every admin panel view returns a 504 gateway timeout. The connection pool fills up. The load balancer starts returning 502s. Users see error pages. Monitoring fires. And the migration is not even halfway done.

ALTER TABLE with a default is not a schema change. It is a 34-second outage wearing a migration number.

02.
THE SOLUTION

Decompose into Four Safe Steps

The solution breaks the single dangerous migration into four operations that never hold a table lock for more than 10 milliseconds. Step 1: AddField with null=True - instant, no data rewrite. Step 2: A data migration that backfills in batches of 5,000 rows using iterator() and bulk_update(), with a 100ms sleep between batches. Step 3: AlterField to set null=False - a constraint-only operation, no data rewrite. Step 4: A RunSQL to set the column default at the database level. Traffic flows uninterrupted between every step.

The batch backfill data migration:

Python
def forwards(apps, schema_editor):
    Listing = apps.get_model('listings', 'Listing')
    batch_size = 5_000
    total = Listing.objects.filter(category_v2__isnull=True).count()

    for start in range(0, total, batch_size):
        batch = list(
            Listing.objects
            .filter(category_v2__isnull=True)
            .order_by('pk')
            .values_list('pk', flat=True)[:batch_size]
        )
        Listing.objects.filter(pk__in=batch).update(
            category_v2=F('category')  # copy from old column
        )
        time.sleep(0.1)  # yield I/O to production queries

Watch the Difference

A side-by-side comparison. The naive approach locks the table for 34 seconds while traffic piles up. The zero-downtime approach processes the same 4M rows without blocking a single request.

4.0M rows
NaiveALTER TABLE ... SET DEFAULT
0s34s
Zero-downtime4-step decomposition
0s42s
--
Downtime (naive)
--
Downtime (safe)
--
Blocked requests
--
Blocked requests

SeparateDatabaseAndState for ORM accuracy:

Python
class Migration(migrations.Migration):
    operations = [
        # Step 3: add NOT NULL constraint only
        migrations.SeparateDatabaseAndState(
            state_operations=[
                migrations.AlterField(
                    model_name='listing',
                    name='category_v2',
                    field=models.CharField(max_length=64),
                ),
            ],
            database_operations=[
                migrations.RunSQL(
                    sql='ALTER TABLE listings_listing '
                        'ALTER COLUMN category_v2 '
                        'SET NOT NULL;',
                    reverse_sql='ALTER TABLE listings_listing '
                                'ALTER COLUMN category_v2 '
                                'DROP NOT NULL;',
                ),
            ],
        ),
    ]

Production Safeguards

Batch Sleep Interval

A 100ms sleep between batches prevents the migration from monopolizing database I/O. Without the sleep, the backfill runs faster but competes with production queries for disk bandwidth. The sleep turns a 28-second migration into a 42-second migration, but p99 query latency stays flat throughout. The extra 14 seconds are invisible to users; the alternative 34-second lock is not.

SeparateDatabaseAndState

Django's migration framework tracks both the database schema and the ORM's internal model state. When a migration is split across multiple files, the ORM state can drift from reality. SeparateDatabaseAndState lets you tell Django 'the database already has this column; just update your model definition.' This prevents Django from trying to add a column that already exists on the next migrate run.

Staging Replica Validation

Every decomposed migration runs on a staging database restored from a production snapshot. pg_stat_activity monitors lock waits in real time. If any step holds an ACCESS EXCLUSIVE lock for more than 50ms, the migration is redesigned before it touches production. This validation caught two issues: a forgotten index rebuild that would have locked the table, and a CHECK constraint that PostgreSQL validates by scanning every row.

03.
THE RESULT

Invisible Infrastructure Changes

The 4-step migration processed 4 million rows in 42 seconds with zero downtime. Request latency stayed within normal p99 bounds throughout. No connection pool warnings. No 502s. No monitoring alerts. The product team did not know the migration had run until they checked the changelog. The pattern has since been applied to 11 additional schema changes on tables ranging from 500K to 12M rows, all during peak traffic hours.

KEY METRICS

0MRows Migrated
0sDowntime
0<1msp99 Impact
WHAT THE CLIENT SAYS

"We used to schedule migrations at 3 AM and hope nothing broke. Now they run during business hours and nobody notices. That shift changed how we think about database changes entirely."

Lead Developer

B2B Marketplace · Engineering

FAQ

Why not use pt-online-schema-change or similar tools?

What if the backfill fails halfway?

Does this work for all column types?

TECHNOLOGY STACK