Don't Trust Your Event Stream Alone — Pair It With a Periodic Diff | Joshua Antony

On one project, 40 million product prices got dumped to FTP every 30 minutes. That sounds extreme until you realise why it was there: the event-driven sync running alongside it kept dropping deltas.

This is the universal observation in continuous data sync — events drop. Service buses, queues, retries, dead-letter queues, idempotency keys: none of it stops a missing emit at the source. The trigger doesn’t fire, the CDC log rotates past the consumer, the iDoc sticks in a workflow, the message gets ack’d before the downstream commits. By the time anyone notices the drift, reconciliation is its own project.

The pattern

Run the event-driven path as normal. Then also schedule a periodic full export from the source — every 30 minutes, every hour, nightly, whatever cadence the source can sustain. A small process diffs consecutive dumps and emits the missing rows onto the same downstream pipeline as the events. Same destination, same writer, same idempotency keys. The event path keeps latency low between dumps. The dump-and-diff path catches whatever the event path missed.

The other property worth naming: the whole flow stays unidirectional. Data goes source → dump → diff → destination, and the destination never calls back to the source. That sounds minor until your first big DLQ replay — the common “event carries an ID, consumer fetches the row” pattern melts source rate limits the moment you replay anything at scale. Removing the back-channel removes a whole class of incidents. Reconciliation stops being a quarterly fire drill and becomes the slower lane of normal traffic.

                     ┌──────────────────────────┐
                     │  Source (Oracle / SAP)   │
                     │  Primary OLTP            │
                     └──────────┬───────────────┘
                                │
           ┌────────────────────┴────────────────────┐
           │                                         │
    events │ (CDC / iDocs /              snapshot to │
           │  triggers)                  read replica│
           ▼                                         ▼
    ┌────────────┐                       ┌────────────────────┐
    │  Event     │                       │  Periodic dump     │
    │  stream    │                       │  → Parquet (e.g.   │
    │            │                       │     every 30 min)  │
    └─────┬──────┘                       └──────────┬─────────┘
          │                                         │
          │                                         ▼
          │                                ┌────────────────┐
          │                                │  DuckDB diff   │
          │  fat deltas                    │  T  vs  T-1    │
          │  (full row,                    │  → fat deltas  │
          │   idempotent)                  └──────┬─────────┘
          │                                       │
          ▼                                       ▼
         ┌─────────────────────────────────────────────┐
         │  Shared destination pipeline                │
         │  (queue / bus, idempotent writer)           │
         └─────────────────────┬───────────────────────┘
                               ▼
                       ┌────────────────┐
                       │  Destination   │
                       │  RDBMS         │
                       └────────────────┘

    all arrows point downstream — destination never queries source

Tooling: Parquet for the dump, DuckDB for the diff

The format and engine matter more than they sound. Dump as Parquet — columnar, compressed; a 40-million-row price snapshot lands at a few hundred megabytes. Diff with DuckDB — single binary, reads Parquet directly, and a FULL OUTER JOIN (or EXCEPT) across two snapshots returns the changed rows in seconds on a laptop. No Spark cluster, no warehouse spend, no “we’re standing up Airflow for this.” The diff is a SQL file and a cron entry.

When this works

Extraction runs against a read replica, snapshot, or staging area — not the primary OLTP. Table size stops mattering once this is true; 40 million rows every 30 minutes is comfortable when you’re not touching production.
The source can express per-row ordering. An updated_at column, an Oracle SCN, a Postgres LSN, SAP CHANGED_ON + CHANGED_AT, or a dedicated version — something monotonic per row. Without this you can’t tell stale from fresh at the destination, and the dump-and-diff path quietly overwrites live state with snapshot-old values. This is the unsung prerequisite. (See the version-ordering section at the bottom — this is the one that bites the business.)
Your delta records carry the full row, not just IDs. Dump-and-diff produces fat deltas naturally; make sure your destination writer doesn’t degrade to ID-only and force a query-back. Lose this and the unidirectional benefit collapses.
The dump completes well inside the cadence interval. A 30-minute cycle needs a dump that finishes in well under 30 — ideally half. Otherwise dumps overlap, the diff falls behind, and your reconciliation latency is whatever dump-plus-diff actually takes, not the cadence you wrote on the diagram.
Destination writes are idempotent and version-ordered. Applying the same row twice doesn’t break anything, and a stale delta can’t overwrite a fresher one. Both contracts are explained at the bottom — they’re where the business risk lives.
You can store at least two dumps to diff against. Cheap on object storage, FTP, or NFS.
“Eventually consistent within the dump interval” is acceptable to the consumer.

When it doesn’t

No read replica or staging area — every export hits the primary and degrades it.
The source has no per-row ordering signal — no timestamp, no version, no SCN/LSN. You can’t differentiate stale from fresh at the destination, and the diff path will eventually clobber live state with snapshot-old values.
You can’t fit the full row in the delta and the destination has to call back anyway. The unidirectional property is the whole point; without it you’re doing batch on top of events for no real win.
Destination writes aren’t idempotent or version-aware — overlap between the event path and the diff path causes double-apply, or a stale diff-delta silently overwrites a fresher event.
You need sub-second reconciliation. The dump interval is your floor, and a 30-minute floor is too high.

Idempotent writes, in business terms

If your destination writer treats every delta as new business — INSERT INTO billing_events ..., increment a counter, append to a ledger — running events and diff together turns into a corruption machine. Both paths will eventually emit the same change, and the destination will:

bill the customer twice for one logical price update
double-count the metric for one logical event
trip a unique-constraint violation, alarm, and stall the pipeline

The fix is the shape of the write, not retry logic:

UPSERT keyed on the source’s natural ID — INSERT ... ON CONFLICT (product_id) DO UPDATE (Postgres), MERGE (Oracle/SQL Server). Two writes, one row.
Set absolute values, never increment. Store price = 11, not price = price + 1. Increment-style writes are inherently non-idempotent — replay them and you double the effect.
Dedupe by content hash when the business event genuinely is “this thing happened” (a payment, a click) rather than “this is the new state.” Hash the payload, swallow duplicates by hash before applying.

The destination should shrug if you tell it the same true thing twice.

Version-ordered writes (the bigger one)

Idempotency stops you double-applying. It does not stop a stale delta from clobbering a fresh one. The diff path produces deltas with values from the last snapshot — those values can be 30 minutes old by the time they hit the destination. If a fresh event for the same row arrives first, and the stale diff-delta lands second, the destination ends up with last-snapshot’s price overwriting today’s.

The business consequences land here:

selling at last week’s price for the next half-hour
shipping to an address the customer changed an hour ago
charging a customer who churned this morning
invoices going out at the wrong tax rate

The fix is a monotonic version on every record at the source, plus a conditional write at the destination. Common version sources:

An updated_at timestamp on the row (watch for clock skew on multi-master sources, and sub-second ties)
An Oracle SCN, Postgres LSN, or SAP CHANGED_ON + CHANGED_AT
A dedicated version column the source bumps on every write

The destination’s write becomes conditional:

INSERT INTO products (id, price, version)
VALUES (:id, :price, :version)
ON CONFLICT (id) DO UPDATE
   SET price = EXCLUDED.price, version = EXCLUDED.version
 WHERE EXCLUDED.version > products.version;

Stale deltas hit the predicate and become no-ops. This is the property that lets two parallel paths run without one quietly overwriting the other.

Of the two contracts: idempotency is “don’t break things on overlap.” Version-ordering is “don’t go backwards.” The second is the one that costs the business money.

Most teams overengineer this

I’ve watched teams reach for Kafka + Debezium + a stream-processing framework + a feature store + a CDC orchestrator before they had a single user complaining about latency. Most companies don’t run at Amazon or Google scale. Most data-sync use cases tolerate a 30-minute drift, or a 6-hour drift, just fine. If your business can wait the dump interval to converge — and most businesses can — the pattern in this post is genuinely all you need. A SQL diff over two Parquet snapshots beats a streaming pipeline you have to babysit at 2 AM.

So here’s the checklist. Reach for this pattern when most of these tick:

Drift window of minutes-to-hours is acceptable to the business — 30 minutes, an hour, overnight. If the consumer can wait the dump interval, you don’t need streaming.
You can extract from a read replica, snapshot, or staging area — not the primary OLTP. If every export hits production, stop here.
The source has a per-row ordering signal — updated_at, SCN, LSN, version column. Without this, version-ordered destination writes are impossible and the diff path is unsafe.
The dump finishes well inside the cadence interval. Half the cycle is a comfortable rule of thumb.
Delta records carry the full row, not just IDs. Otherwise the destination calls back to the source and you’ve lost the unidirectional benefit.
You can store at least two dumps to diff against. Object storage, FTP, or NFS — all cheap.
Destination writes are idempotent — UPSERT on the natural key, set-don’t-increment.
Destination writes are version-aware — conditional on incoming.version > stored.version.

If most of those tick, you don’t need streaming CDC. You need cron, Parquet, DuckDB, and a writer that respects versions. Build the streaming pipeline the day a real consumer asks for sub-second freshness — not before.

First in a short series on data sync patterns I’d reach for again.

The pattern#

Tooling: Parquet for the dump, DuckDB for the diff#

When this works#

When it doesn’t#

Idempotent writes, in business terms#

Version-ordered writes (the bigger one)#

Most teams overengineer this#