Services

Backend Engineering, DevOps

Industry

B2B Marketplace

Year

2023-2024

Surviving 50,000 webhook events per day without losing one.

A B2B marketplace relied on SendGrid for transactional email: signup confirmations, password resets, invoice notifications. SendGrid fires webhook events for every delivery, open, click, and bounce. At scale, these events arrive in bursts of hundreds per second. The existing signal handler dropped events silently whenever two webhooks hit the same database row at the same time. Nobody noticed until a compliance audit revealed gaps in the delivery logs.

01.
THE CHALLENGE

Silent Data Loss Under Concurrency

The webhook endpoint processed events synchronously inside a Django signal handler. Under normal load, it worked fine. During email campaigns, when thousands of messages went out within minutes, SendGrid would fire hundreds of tracking events simultaneously. Two webhooks trying to create the same EmailHooks record at the same instant caused database deadlocks. Django caught the IntegrityError, the signal handler crashed, and the event was lost. No retry, no log entry, no alert. The compliance team discovered the gaps weeks later when delivery confirmations did not match SendGrid's own records.

Two webhooks hitting the same row at the same instant. One wins, one crashes silently. Nobody notices for weeks.

02.
THE SOLUTION

Deadlock-Resilient Signal Processing

Atomic Writes

Every webhook event enters a transaction.atomic() block with get_or_create keyed on the event_id. If the record already exists, the handler returns early. This makes the operation idempotent by design. No duplicate writes, no partial state.

Automatic Retry

If a deadlock is detected, the handler retries up to five times with progressive backoff: 0.5s, 1.0s, 1.5s, up to 2.5s. The staggering breaks the thundering-herd pattern where concurrent retries would immediately deadlock again. Only genuine deadlocks get retried. An IntegrityError from a real constraint violation gets raised immediately.

Deferred Enrichment

Once the database write succeeds, transaction.on_commit() queues a Celery task to call the SendGrid Messages API and enrich the record with full metadata. The enrichment never runs if the write fails. This guarantees the async task only fires after the data is committed.

LIVE SIMULATION
Waiting for events
0Processed
0Retries
0Data Loss

Under The Hood

The signal handler with retry logic and deferred enrichment:

Python
MAX_RETRIES = 5

@receiver(event_received)
def handle_webhook_event(sender, event_data, **kwargs):
    for attempt in range(MAX_RETRIES):
        try:
            with transaction.atomic():
                hook, created = EmailHooks.objects.get_or_create(
                    event_id=event_data['sg_event_id'],
                    defaults={
                        'email': event_data['email'],
                        'event': event_data['event'],
                        'timestamp': parse_datetime(
                            event_data['timestamp']
                        ),
                        'sg_message_id': event_data.get('sg_message_id'),
                    }
                )
                if not created:
                    return  # Already processed — idempotent

                transaction.on_commit(
                    lambda pk=hook.pk: enrich_hook.delay(pk)
                )
            return

        except (IntegrityError, OperationalError) as exc:
            if 'Deadlock' in str(exc) and attempt < MAX_RETRIES - 1:
                sleep(0.5 * (attempt + 1))
                continue
            raise

The Celery task that enriches each record via the SendGrid API:

Python
@shared_task(bind=True, max_retries=3)
def enrich_hook(self, hook_id):
    hook = EmailHooks.objects.get(pk=hook_id)

    try:
        msg = sg_client.client.messages._(
            hook.sg_message_id
        ).get()

        hook.subject = msg['subject']
        hook.from_email = msg['from_email']
        hook.opens_count = msg.get('opens_count', 0)
        hook.clicks_count = msg.get('clicks_count', 0)
        hook.status = 'enriched'
        hook.save(update_fields=[
            'subject', 'from_email',
            'opens_count', 'clicks_count', 'status',
        ])

    except Exception as exc:
        raise self.retry(exc=exc, countdown=30)
03.
THE RESULT

Zero Data Loss at Scale

The rebuilt handler processed over 50,000 webhook events per day during peak campaign periods without a single lost event. The retry mechanism caught deadlocks transparently. The compliance team never saw another gap in the delivery logs. The Celery enrichment pipeline added full message metadata within seconds of each event, giving the support team real-time visibility into email delivery status.

KEY METRICS

0+Daily Events
0Events Lost
0%Retry Success
WHAT THE CLIENT SAYS

"We went from discovering missing delivery records weeks after the fact to having real-time visibility into every email event. The compliance team finally trusts the data."

Engineering Lead

B2B Marketplace · Platform Operations

FAQ

Why not use a dedicated message queue instead of Django signals?

What happens if all five retries fail?

How do you prevent the Celery task from running on a rolled-back transaction?

TECHNOLOGY STACK