Identity Field Sync: Stop Duplicate Records

Identity Field Sync: Stop Duplicate Records

Identity Field Sync: Stop Duplicate Records

Photo of Utku Zihnioglu

Utku Zihnioglu

CEO & Co-founder

You map email as the identity field, save the sync, and walk away. The next morning your CRM has 14,000 contacts instead of 10,000. Four thousand of them are blank shells with a creation date and nothing else. Nobody touched the CRM overnight. The sync did this, and it did exactly what you told it to.

Here's what happened. Of your 10,000 Stripe customers, 4,000 had a blank email field. Plenty of B2B accounts get created from an invoice or a manual entry where nobody filled in an email. When the sync hit those 4,000 records, it had no value to match on. So it followed its rule. No match found, create a new record. Four thousand empty contacts, one per blank-email customer.

This is the failure mode nobody writes about. Search for identity resolution and you'll get a thousand articles arguing deterministic versus probabilistic matching, identity graphs, customer 360. Useful eventually. But none of them mention the thing that actually breaks a small team's first identity field sync: the field you picked is only filled on half your records.

How a bad identity field sync turns into a duplicate factory

The identity field is the value a sync uses to decide whether a source record is the same customer as an existing destination record. Email, customer ID, phone number. When two records share that value, the sync updates the existing record. When it finds no match, behavior depends on your sync mode. In "update or create" mode, no match means create.

That last sentence is where the trouble lives. A blank identity value never matches anything, because blank does not equal blank in matching logic. The sync treats every blank-email record as a brand-new customer it has never seen. Run it twice and the blanks duplicate again, because the second run also can't match them to the empties it created on the first run.

So the math compounds. A field that's missing on 40% of records doesn't cause 40% trouble. It causes a fresh batch of orphans on every run, and each orphan is invisible until someone opens the CRM and wonders why half the contact list has no name.

The people who get burned by this aren't careless. Email is the obvious choice. It's the canonical identity field in every example, every tutorial, every competitor's docs. The problem isn't the choice of email as a concept. The problem is that nobody checked whether email was actually present on the source records before wiring it up as the match key. This is the most common of the field mapping mistakes I see, and it has nothing to do with picking the wrong type of field. It's about coverage.

Why match key coverage beats the field's data type

When teams pick an identity field, they reason about it the way the textbooks tell them to. Is it unique? Is it stable? Is it the same across both systems? All good questions. All beside the point if the field is empty.

Coverage is the percentage of source records where the field has a value. A field can be perfectly unique, perfectly stable, and a terrible match key because it's only filled 55% of the time. The other 45% will either duplicate or get skipped, and which one depends on your sync mode in ways that are easy to get wrong.

Think about it from the destination's side. The destination doesn't care that email is theoretically the right identity. It cares whether the incoming record has a value it can match against. A high-coverage field that's slightly less "correct" produces a cleaner sync than a perfect-on-paper field that's blank on a third of your rows. This is the part that surprises people: in a real customer data sync, identity is decided by what's filled in, not by what's semantically ideal.

A quick way to think about the tradeoff:

  • High coverage, decent uniqueness: Usually your best match key. Customer ID from your billing system, filled on 100% of rows.

  • Perfect uniqueness, low coverage: A trap. Email or phone that's blank on a big slice of records.

  • High coverage, low uniqueness: Risky for a different reason. A field like "company domain" might be filled everywhere but shared across many contacts, which merges records that shouldn't merge.

The deterministic matching everyone debates only works when the deterministic key is present. Record linkage, the academic name for this whole problem, has been studied for decades, and the literature is full of clever algorithms for fuzzy cases. For most small teams, you never reach the clever part. You just need a key that's actually there.

How to read source field fill rates before an identity field sync

Before you save any mapping, look at the fill rate of the field you want to use as the identity. This takes two minutes and saves you a cleanup job.

If your source is a database, it's one query. For a customers table where you're considering email as the match key:

SELECT
  count(*) AS total,
  count(email) AS filled,
  round(100.0 * count(email) / count(*), 1) AS pct_filled
FROM customers;
SELECT
  count(*) AS total,
  count(email) AS filled,
  round(100.0 * count(email) / count(*), 1) AS pct_filled
FROM customers;
SELECT
  count(*) AS total,
  count(email) AS filled,
  round(100.0 * count(email) / count(*), 1) AS pct_filled
FROM customers;

count(email) ignores nulls while count(*) counts every row, so the ratio is your coverage. (Note that a field storing empty strings instead of nulls will fool this. An empty string is not null, so count() will treat it as filled even though it's useless as a match key. Check for '' too if your source is sloppy about that.)

If your source is a SaaS tool, you usually can't run SQL against it, so you read field statistics from the API or the tool's own field-usage view. A lot of CRMs and billing tools expose how many records have a given field populated. The number you want is the same: of all records, how many have a non-blank value in this field.

What counts as good coverage? My rough rule: above 95% and you're fine, 80 to 95% means you'll get some duplicates and should decide whether you can live with them, below 80% and you should pick a different identity field. These aren't hard lines. A field at 88% might be perfectly acceptable if the missing records are test accounts you don't care about. Run the count first, then make the call with real numbers instead of a guess.

When email or user ID is the wrong identity field sync key

Email feels like the universal identity, and for consumer products with mandatory signup it usually is. But there are whole categories of source data where the obvious field is the wrong one.

Billing systems are the classic example. A Stripe customer can be created from a checkout (email present) or from an invoice your finance team raised manually (email often blank). If you sync Stripe to your CRM keyed on email, the invoice-created customers all duplicate. The better identity field is the Stripe customer ID, which exists on every customer object because Stripe assigns it. It's less human-friendly, but it's filled 100% of the time, which is the only property that matters for a match key sync.

Databases have the opposite trap. Your users table has a primary key that's filled on every row, so coverage is perfect. But if the destination tool has no place to store your internal ID, that key is useless as a cross-system identity. You need a field that exists and is filled on both sides. Sometimes that's email after all, sometimes it's an external ID column you added for exactly this reason.

I'll be honest about where this gets hard. If your source genuinely has no high-coverage field that the destination also understands, no amount of checking fixes that. You either backfill the missing identifier upstream or accept some duplicates and dedupe later. There's no trick that conjures a match key out of data that doesn't have one. That's the case where the heavyweight identity-resolution machinery the enterprise vendors sell actually earns its keep, and I won't pretend a direct sync replaces it.

Catching low-coverage fields before the identity field sync runs

Everything above is a manual checklist, and manual checklists get skipped at 5pm on a Friday. The real fix is for the tool to refuse to let you make the mistake.

This is the part we built Oneprofile around, because we kept watching people get burned by it. When you pick an identity field in the mapping form, Oneprofile reads the actual field statistics from the source and shows you the coverage right there. Pick a field that's filled on 60% of records and the form tells you 40% of records will fail to match, suggests a higher-coverage field from the same source schema, and makes you confirm before it'll save. The duplicate factory never gets a chance to run, because the form caught it at mapping time instead of you catching it at the duplicate-cleanup stage three days later.

We did this for a selfish reason. Cleaning up duplicate records is the worst kind of support ticket. By the time someone notices, downstream tools have already emailed the empty contacts, sales has already assigned them, and untangling it means reconciling state across several systems. Stopping it at the form is cheaper for everyone, us included.

That's about all I want to say about the product. The point of this post isn't the feature. The point is that identity in a sync is an engineering question disguised as a data-modeling question. Before you argue about deterministic versus probabilistic, before you draw an identity graph, run one count query against the field you're about to map. If it comes back at 62%, you've just saved yourself a very bad morning.

What is an identity field in a data sync?

Why does my sync create duplicate records?

Should I always use email as the match key?

How do I check identity field coverage before mapping?

Can I fix duplicate records after a sync runs?

Ready to get started?

No credit card required

Free 100k syncs every month