Last month we shipped sync alerts, and it was one of those features where the internal reaction was basically "how did we not have this already." We'd been looking at sync run logs manually. Clicking into individual runs to check if something failed. Not on any schedule, just whenever someone remembered to look. We built data sync alerting because we got tired of finding out about failures hours after they happened, and I suspect most teams running data sync pipelines are in the same spot.
The weird thing about data sync alerting is that the infrastructure around it is almost always overkill or nonexistent. Either you have nothing and find out about sync failures when a sales rep complains, or you've stitched together a Grafana dashboard polling a log connector that writes to your warehouse so you can build charts that nobody actually watches. There's very little middle ground.
Why silent sync failures cost more than downtime
When an API goes down, you know. Your monitoring fires, customers report it, the team scrambles. The damage window is small because the problem is loud.
Sync failures are the opposite. A sync from Stripe to your CRM fails at 3 AM because Stripe rate-limited the export. No error in your CRM. No error in Stripe. The sync tool logged it somewhere, but nobody checks that log at 3 AM. By 9 AM, your support team is looking at a customer whose subscription renewed yesterday but whose CRM record still says "trial." They offer a discount to someone who already pays full price.
That support interaction costs real money and real trust. Multiply it across a team of reps working stale records all morning, and the cost of a single silent failure compounds fast.
The insidious part: nobody traces the bad interaction back to the sync failure. The rep blames the CRM. The ops person blames the rep for not checking Stripe directly. The actual root cause sits in a sync run log that nobody opened. Without sync failure alerts, the feedback loop between "something broke" and "someone notices" stretches from minutes to hours to days.
Downtime is a fire. Silent sync failure is carbon monoxide.
Five data sync alerting conditions worth monitoring
Not everything is worth an alert. Alert fatigue is real, and the moment your team starts ignoring Slack notifications from your sync tool, you've lost the entire system. These five conditions cover the failure modes that actually cause operational harm.
Run failure is the baseline. If a sync run fails entirely, you need to know within minutes, not the next time someone opens the dashboard. This covers API auth expiration, destination downtime, and network failures. Any sync tool that doesn't alert on run failure is asking you to poll a status page manually.
Error rate spikes catch partial failures. A sync run can succeed overall while rejecting 40% of its records due to field validation errors or missing required properties. If your normal error rate is 0.5% and it jumps to 15%, something changed in the source data or the destination schema. This is where thresholds matter: alert on deviation from baseline, not on any individual error.
Duration anomalies surface performance degradation before it becomes a failure. If a sync that normally takes 3 minutes suddenly takes 45, the destination API is probably throttling you, or the source dataset grew unexpectedly. You want to investigate before the next run times out entirely.
Missed schedules catch the failure mode that log-based monitoring misses completely. If a sync is supposed to run every 15 minutes and it doesn't run at all, there's no log entry to find. No run means no log. The only way to catch this is a system that knows when a sync was expected and flags when it doesn't happen.
Profile count deviation is the one most teams overlook. If you normally sync 50,000 profiles and a run processes 12, something is wrong with the source query or the source data itself. A percentage-based threshold against the rolling average catches data truncation, accidental filters, and upstream schema changes.
Here's how these map to the damage they prevent:
Condition | What it catches | Damage if missed |
|---|---|---|
Run failure | Auth expiration, API outages | Hours of completely stale data |
Error rate spike | Schema drift, field validation | Partial data gaps across records |
Duration anomaly | Rate limiting, data volume growth | Future run timeouts and cascading delays |
Missed schedule | Scheduler bugs, infrastructure issues | Undetectable data staleness |
Profile count deviation | Source query errors, upstream changes | Sync runs successfully on wrong data |
How to set up data sync alerts without a monitoring stack
The standard approach in the data integration world goes something like this: enable a platform log connector, route logs into your warehouse, build dbt models to aggregate sync statistics, create a BI dashboard, then set up alerts on the dashboard metrics. I've seen this architecture described in vendor blog posts as if it's a reasonable thing to ask of a team that just wants to know when their CRM data goes stale.
That's five moving parts to answer the question "did my sync fail?" And every one of those parts is itself something that can fail silently.
The better approach is alerting built into the sync tool itself. No intermediate warehouse, no log connector, no dashboard you have to remember to check. The sync engine knows when a run failed, what the error rate was, how long it took, and whether it ran at all. It should be able to tell you directly.
What that looks like in practice:
You define a rule: "alert me when error rate exceeds 5% on any sync."
You pick a destination: Slack channel, email, in-app notification.
The sync engine evaluates the condition after every run and fires the alert if the threshold is crossed.
Three configuration steps. No warehouse in the middle. No dbt model. No BI tool.
I'm not arguing against data pipeline observability platforms for teams that need them. If you're managing 200 pipelines across multiple orchestrators, Datadog or Monte Carlo earn their keep. But if you're running 5-20 sync configurations between SaaS tools, the overhead of an external data sync monitoring stack is probably more work than the sync itself.
Data sync alerting vs. logging: what gets you to the problem faster
Logging and alerting solve different problems, and most sync tools only give you the first one.
Logs are an audit trail. They tell you what happened after you already know something is wrong. You open the sync run history, find the failed run, expand the error details, and read the message. Logs are essential for diagnosis but useless for detection.
Alerting is detection. It tells you something is wrong before anyone notices the downstream effects. The support rep hasn't opened the stale CRM record yet. The marketing automation hasn't sent the wrong email yet. You have a window to fix the problem before it causes damage.
Most data integration tools give you excellent logging and zero alerting. Run history, error details, per-record outcomes, sometimes even per-record data sync error handling with retry queues. All of which is great when you're debugging a problem you already know about.
The gap is the detection layer. Who's watching the logs? In my experience, nobody is, unless you've built a dedicated monitoring pipeline to do it. And that monitoring pipeline is itself a data pipeline that can fail.
There's something circular about building a data pipeline to monitor your data pipeline. At Oneprofile, we wanted to avoid that. When we built sync alerts, the design goal was: the sync engine watches itself. Five condition types, configurable thresholds, Slack and in-app destinations. No external dependencies.
The practical difference: with logging alone, you find out about a sync failure when the CRM data is wrong. With alerting, you find out about a sync failure when the sync fails.
What should I alert on for data sync?
Is data sync monitoring the same as data observability?
Do I need Datadog or PagerDuty for sync alerts?
How quickly should sync failure alerts fire?
