What Is Idempotence in Data Sync?

What Is Idempotence in Data Sync?

What Is Idempotence in Data Sync?

Photo of Utku Zihnioglu

Utku Zihnioglu

CEO & Co-founder

A subscription renewal triggers in Stripe. The sync to HubSpot succeeds, then the worker crashes before it writes the cursor. Twelve seconds later the retry runs. Now you have two contacts with the same email, the renewal automation fires twice, and the MRR widget on the revenue dashboard is off by one customer. Nobody saw an error. The pipeline did exactly what it was told. It was just told to do something that wasn't idempotent.

Idempotence is the property that would have prevented all of it. A sync is idempotent when running it twice produces the same destination state as running it once. Hit retry, hit it again, hit it ten times. The CRM still has one contact, the renewal email still goes out once, and the revenue widget still shows one customer. For analytics pipelines this matters because duplicate rows skew aggregates. For operational sync between SaaS tools, it matters because every duplicate is a customer who gets emailed twice or a sales rep who calls the wrong John Smith.

If you read the Wikipedia entry on idempotence, you'll see it defined as a math property: applying a function more than once gives you the same result as applying it once. abs(abs(-3)) is still 3. The data sync version is the same idea, dragged into the messy real world of partial failures, network blips, and webhooks that fire twice on retry. The bar isn't "the function is pure." The bar is "this pipeline can crash and resume without poisoning the destination."

What idempotence means for data sync

In a sync context, idempotence shows up in three places: how records are identified, how they're written, and how progress is tracked.

  • Primary keys. Every record has a stable identifier the destination recognizes. A Stripe customer's cus_xxx ID maps to a HubSpot contact's external_id. The next sync looks at the ID, finds the existing contact, and updates it instead of creating a new one.

  • Upserts, not inserts. The write path is "update if it exists, insert if it doesn't." A blind INSERT is the opposite of idempotent: run it twice and you get two rows. An upsert keyed on the primary key collapses both runs into the same end state.

  • Durable cursors. The pipeline tracks how far it got using a checkpoint that survives a crash. On restart it picks up at the last confirmed cursor, which means some records get processed again. That's fine, because the keys and upserts make the replay safe.

Pull any one of those out and idempotence breaks. Upserts without primary keys means you're picking the wrong row to update. Cursors without upserts means a replayed batch creates duplicates. Drop the cursor entirely and the pipeline either re-syncs from scratch every time or silently loses data after a failure.

What goes wrong without idempotence in operational tools

The warehouse-loading guides will tell you that non-idempotent pipelines duplicate rows in Snowflake, which makes your COUNT(DISTINCT customer_id) queries lie. That's true, and it's also the least painful version of the problem. When the destination is a CRM, support tool, or marketing platform, the consequences leak into the world.

A few specific failure modes we've seen on customer accounts:

  • Duplicate contacts. Sync retries after a network timeout. The retry doesn't recognize the contact it already created, so it creates a second one with the same email. Now your sales team sees two records for the same person, your deduplication report grows, and one of them is missing the deal you just attached.

  • Double-fired marketing emails. The sync writes a lifecycle_stage change from lead to customer. The destination's automation is wired to send a welcome email when that field flips. The retry writes the change again, the automation runs again, and the customer gets two welcome emails twelve seconds apart.

  • Inflated counts and metrics. A weekly MRR report sums subscription_amount across all active contacts. After a partial sync failure and replay, three high-value customers exist twice. The number on the dashboard is real, the customers behind it are not.

  • Webhook storms. A non-idempotent webhook handler treats every retry as a new event. Stripe retries failed webhooks for up to three days. A bad handler can turn one renewal into thirty.

These aren't theoretical. Any team that has ever filed a "why does HubSpot have three of me" ticket has felt them. The damage is operational, not analytical, and the people who feel it first are not on the data team.

How idempotent sync works: keys, upserts, and cursors

The architecture is simpler than the failure modes make it sound. Three rules cover most of it.

Component

What it does

What goes wrong without it

Stable primary key

Identifies the same record across runs

Same record written as two different rows on retry

Upsert write semantics

Updates in place if the key exists

INSERT creates a duplicate every replay

Durable cursor

Lets the pipeline resume from a checkpoint

Full re-sync on every restart, or silent data loss

For database sources, the primary key is usually the database's own primary key, and the cursor is a last_modified_at timestamp or a transaction log position. For SaaS sources the primary key is the tool's record ID (Stripe customer ID, HubSpot contact ID, Intercom user ID), and the cursor is whatever modification timestamp the API exposes. Some APIs are easier to make idempotent than others. The ones that don't expose a stable record ID or a useful change timestamp are the reason webhook idempotency keys exist as a separate concept.

HTTP method semantics actually got this right early. PUT is defined as idempotent because it sets a resource to a state. POST isn't, because it creates a new resource each time. A sync that uses PUT-shaped operations on a known key is idempotent for free. One that uses POST everywhere has to invent its own deduplication layer, usually as an idempotency key the client attaches to every request.

There's also a soft-failure version of the problem that's worth a paragraph. Hard failures crash the pipeline and force a clean restart from the last cursor. Soft failures load corrupted data without raising an error: a partial batch, a truncated field, a record mapped to the wrong tenant. Idempotence helps with hard failures because the replay is safe. Soft failures still need human investigation, but at least the rollback-and-replay step doesn't make things worse.

Idempotence in operational sync vs. warehouse loading

Most of what's written about idempotence assumes a warehouse on the receiving end. Snowflake or BigQuery, batch loads, cursors at the batch boundary, the occasional duplicate row that an analyst has to clean up later. The architecture is well-understood and the failure mode is "queries return wrong numbers."

Operational sync flips two things about that picture:

  1. The destination is a SaaS tool, not a table. You can't run DELETE FROM contacts WHERE created_at > <timestamp> to clean up a bad replay. You have to call the destination's API, find the duplicates, decide which one is canonical, and merge them. That's a per-record manual step that doesn't scale.

  2. Writes have side effects. Writing a row to a warehouse triggers nothing except maybe a downstream dbt model. Writing a record to HubSpot can trigger a workflow, an email, a Slack notification, a Salesforce sync, and a Zapier zap. A non-idempotent sync into an operational tool isn't just a data problem. It's an automation problem.

That's why idempotence is the floor for operational sync, not a brag-worthy feature. A warehouse loader can survive without it because the cleanup is SQL. A CRM sync can't, because the cleanup is your support team explaining to a customer why they got two invoices.

I'll admit the line is fuzzier than it sounds. Reverse-ETL tools sit between these two worlds, with a warehouse on the source side and an operational tool on the destination side. The good ones treat idempotence the same way an operational sync would. The bad ones leak warehouse semantics into the destination and produce exactly the duplicate-contact problem above. Either way, the bar is set by the destination, not the source.

How Oneprofile makes sync idempotent by default

Oneprofile's sync engine handles the three components above without configuration. Every connected tool exposes its native record IDs as primary keys. Every write is an upsert keyed on those IDs, with field-level change tracking so only the changed fields are sent. Cursors are durable and per-sync-config, so a crash mid-batch resumes from the last confirmed checkpoint instead of starting over. There are no idempotency keys for you to attach, no dedupe job for you to schedule nightly, no middleware to harden against replays.

The honest scope: this works because every Oneprofile integration is a record-based connector with stable IDs. If you wire up a generic webhook source where the payload doesn't include a stable record ID, you're back to needing idempotency keys at the application layer. Nobody can fully solve that one. What we do remove is the case where idempotence is a configuration choice you have to make per integration. For Stripe, HubSpot, Postgres, Salesforce, Intercom, and the rest of the catalog, it's the default and you can't turn it off.

If you only take one thing from this post: a sync tool that brags about idempotence is selling you a floor as a ceiling. Ask the question backward instead. What does it cost when sync isn't idempotent? Two of every contact, two of every email, two of every line on the revenue dashboard. That's the bill you stop paying.

What does idempotence mean in data sync?

Why is idempotence important for CRM sync?

How do you make a data pipeline idempotent?

Is idempotence the same as deduplication?

Does my SaaS sync tool need idempotency keys?

Ready to get started?

No credit card required

Free 100k syncs every month