You find out your Stripe integration broke three days ago because a support rep tells you the CRM shows the wrong plan for a customer. You check the sync logs. Every run for the past 72 hours errored on authentication. The OAuth token expired on Friday. Nobody got notified. Three days of subscription changes, cancellations, and upgrades never reached HubSpot. This is the gap that integration monitoring is supposed to close.
Most teams only find out about broken connections after the damage is done. Not through an alert. Through a person downstream who trusted the data and got burned. The problem is that most teams think of monitoring as watching sync runs, not watching the connections underneath them.
For how this fits into the broader picture of keeping your data accurate and traceable, the customer data governance guide covers the audit trail and change attribution side.
Why integrations fail silently: expired tokens, revoked keys, and rotated credentials
The failure modes are boring and predictable. That is exactly why they catch people off guard.
OAuth 2.0 tokens expire. Every provider sets different lifetimes. Some tokens last an hour. Others last 90 days. A few last indefinitely until someone rotates them. When a token expires, the API returns a 401, and every sync config using that connection fails on its next run. If the sync runs every 15 minutes, you get a failure within 15 minutes. If it runs daily, the failure waits until midnight and you find out Monday morning.
API keys get revoked when an admin cleans up old credentials, when someone leaves the company and IT resets their tokens, or when the SaaS vendor rotates keys as part of a security incident response. Revoked keys produce the same 401 error as expired tokens, but without the predictability. There is no countdown timer on a revocation.
Permission scopes change more subtly. An OAuth connection that was granted read and write access gets downgraded to read-only when someone re-authorizes without checking the right boxes, a common pitfall in role-based access control models. The connection still works for reading data. Writes fail silently or with unhelpful error messages that don't surface as credential issues.
What makes all of this dangerous is the gap between the failure and the discovery. A sync run failing is noisy. An error shows up in logs, a dashboard turns red, maybe an alert fires. But the gap between the credential breaking and the first sync run attempt can be hours or days. During that gap, the integration looks healthy to everyone because nothing has tried to use it yet.
The difference between sync monitoring and integration health monitoring
Sync monitoring tells you whether a sync run succeeded. Integration monitoring tells you whether the connection underneath those runs is alive.
These are different questions with different timelines.
What's monitored | When you find out | What broke | Damage window |
|---|---|---|---|
Sync runs | After a run fails | Could be data, mapping, rate limits, OR credentials | Minutes to hours (per run schedule) |
Integration health | Before a run attempts | Credentials, permissions, API availability | Zero (caught proactively) |
Most teams only have the first one. They watch for failed sync runs and investigate when something turns red. The problem is that sync run monitoring is reactive. By the time a run fails because the token expired, you have already missed one full sync cycle. If your syncs run hourly, you lost an hour of data. If they run daily, you lost a day.
Integration monitoring flips the timeline. A health check probes the connection itself, independent of any sync schedule. It authenticates against the API, confirms the token is valid, checks that required scopes are present, and reports back. If the probe fails, you know the connection is broken before the next sync even tries.
The observability article in this cluster covers the five signals every data practice should track. Integration health is a prerequisite for all five. Freshness, volume, schema, lineage, and quality all assume the connection is alive. If the connection is dead, none of them matter.
What a real integration health check actually tests
A health check is not a ping. Checking whether api.hubspot.com responds to an HTTP request tells you that HubSpot is up. It tells you nothing about whether your specific connection still works.
A meaningful health check tests three things:
Authentication. Can this credential actually authenticate? The check makes an API call using the stored token or API key and confirms it gets a successful response. A 401 means the credential is expired or revoked. A 403 might mean scopes changed. Anything other than success means the integration is degraded.
Scope validation. Does this connection have the permissions it needs? A CRM integration that needs to read contacts and write custom properties should verify both. An integration that lost write access but still has read access will look healthy until you try to sync data to it.
API availability. Is the provider's API responding within expected latency? This is less about your credentials and more about the provider's infrastructure. If HubSpot or Salesforce is experiencing degraded performance, your health check should reflect that before your next sync attempts 10,000 record writes against a struggling endpoint.
A good health check runs all three and produces a single status: healthy, warning, or error. Healthy means everything passed. Warning means the connection works but something is off (maybe latency is high or a non-critical scope is missing). Error means the integration is broken and syncs will fail.
The check itself should be lightweight. One or two API calls, not a full data pull. You want to test the connection, not simulate a sync run.
Designing proactive integration monitoring: scheduled probes, on-demand checks, and alerting
Integration monitoring has three components, and you need all three for it to actually work.
Scheduled probes. Health checks that run automatically on a recurring schedule. Every few hours is a reasonable starting point. The probe cycle should be independent of your sync schedule. Your syncs might run every 15 minutes, but the health check doesn't need to run that often because credential failures don't happen every 15 minutes. They happen when someone rotates a key, when an OAuth token reaches its expiration window, or when an admin changes permissions. Checking every 4-6 hours catches most of these within a reasonable window.
On-demand checks. The ability to trigger a health check manually, right now. This matters when you are debugging. A sync failed and you want to know whether it is a data problem or a credential problem. Running a health check against the connection answers that question in seconds instead of reading through error logs.
Dedicated alerting. A health check that fails but doesn't alert anyone is theater. The alert needs to be specific: "Integration Health Check Failed: HubSpot production connection. Token expired." Not a generic "something went wrong" notification that gets ignored. The alert condition should be separate from sync failure alerts. You want to know that the connection broke, not just that a run using it failed. The distinction matters because fixing a broken credential is a different workflow than fixing a bad field mapping.
This three-part setup turns integration breakage from a multi-day silent failure into a same-day notification. The probe catches it. The alert tells you. The on-demand check lets you verify the fix worked.
How to monitor integration health without building your own probe infrastructure
The DIY version of this is writing a script that authenticates against each provider's API on a cron job, parsing the response, and sending a Slack message when something fails. Teams have been doing this forever. It works until you have 15 integrations and the cron job script is 400 lines of Python that nobody wants to maintain.
The alternative is a sync platform that treats connection health as a first-class concept, not an afterthought.
There is a broader pattern here. The data quality problem across tools is almost always a connectivity problem in disguise. Stale data, broken silos, inconsistent records: these are symptoms. The root cause is usually a broken or misconfigured connection that nobody noticed. Integration monitoring addresses the root cause directly.
Oneprofile runs health checks on every integration on a schedule and on demand. Each integration shows a status indicator: healthy, warning, or error. The check history logs every probe result, so you can see when a connection started degrading. A dedicated "Integration Health Check Failed" alert condition fires the moment a credential breaks, separate from sync run alerts. No separate monitoring vendor, no cron job to maintain, no 400-line Python script.
Whether your team builds its own probes or uses a platform that includes them, the principle is the same. Test the connection, not just the sync. Catch the credential failure before it becomes a data quality failure. Most of the silent breakage that produces stale CRM data, missed subscription updates, and confused support interactions starts with a broken credential that nobody tested.
What is integration monitoring?
How often should integration health checks run?
What causes integrations to fail silently?
Do I need a separate monitoring tool for integrations?
