Cross-Vendor Outage Diagnostic

Outages that cross vendor boundaries are a different problem than outages contained within a single layer. A pure carrier outage is bad. A pure cloud outage is bad. An outage where the carrier says it's the cloud, the cloud provider says it's the network, and the IT team can't see far enough into either to disprove either claim — that's where minutes turn into hours and hours turn into days.

The pattern that wastes the most time

It looks like this. A multi-site organization loses application access at twelve locations. Users can't reach the SaaS platform they depend on. The IT team opens tickets with the carrier and the cloud provider in parallel. Both vendors come back within an hour saying the issue isn't on their side.

The carrier says their circuits are healthy and the BGP sessions are up. The cloud provider says their region is operating normally and the application is responding to test traffic. The application vendor says they're seeing connections from some of the customer's sites but not the affected ones.

Meanwhile, the locations are still down. The IT team has three vendors each saying it isn't their problem, no shared visibility into the path between them, and a clock that's already at three hours and counting.

This is the failure mode that fragmented vendor relationships produce. Every individual vendor is doing their job. Nobody is doing the job of figuring out where in the path the failure actually lives.

What a cross-vendor diagnostic actually looks like

The diagnostic starts with one principle: don't accept any vendor's assertion that the problem isn't on their side until you've verified the layers that bracket their layer. Here's the sequence we run.

Step one: confirm the symptom precisely

"The application is down" is not a symptom. It's a category. Get specific: which sites, which users, which actions, what error message, at what step. Capture timestamps. Capture screenshots. The shape of the failure tells you which layers can be ruled out and which ones can't.

If twelve sites are down and forty-eight aren't, the failure isn't at the application or cloud layer — those layers don't differentiate between sites. The problem is somewhere in the path that's different for those twelve sites.

Step two: validate the carrier path layer by layer

Don't accept "circuits are up." Get the specific data. From each affected site, can the local network reach the carrier's gateway? Can it reach the carrier's next hop? Can it reach the public internet? Can it resolve DNS? Can it complete a TCP handshake to the destination?

If the carrier has SD-WAN visibility, pull the path metrics for the affected sites. If they don't, run traceroutes from the locations themselves. If you can't access the locations, ask someone on-site to run a single command and report back.

The data either confirms the carrier's claim or contradicts it. Don't move past this step until one of those is true.

Step three: validate the cloud path with actual traffic, not status pages

"The region is healthy" means the region is healthy in aggregate. It doesn't mean the path from your specific carrier, through your specific peering arrangement, into your specific VPC, is healthy. Status pages publish what the cloud provider sees from their side. They don't publish what you're experiencing from yours.

From a known-working location, can you reach the cloud endpoint? From the affected locations? From a third-party network outside both the carrier and cloud? The triangulation tells you whether the failure is in the cloud provider's network, in the peering between carrier and cloud, or somewhere on the carrier side that the carrier hasn't identified yet.

Step four: validate the application layer separately from the network

If the network path is healthy end-to-end and the application is still failing, the failure is at the application layer. This is where most diagnostics stop too early — the network is exonerated, the cloud is exonerated, and the team starts looking at firewall rules or DNS records that haven't changed in six months.

Get authentication logs. Get application logs. Get the timestamps of the last successful transaction from each affected site versus the unaffected sites. The pattern almost always tells you something.

Why this is hard to do without a platform

Each step above requires data that lives in a different vendor's system. The carrier has the path data. The cloud provider has the cloud-side path data. The application vendor has the application logs. The internal team has the local network data.

If those four data sources don't come together in one place, the diagnostic has to be run by phone calls, screenshots, and email threads. That's the friction that turns three-hour outages into eighteen-hour outages.

The platform answer isn't magic. It's a single team with read access to all four layers and the relationships with each vendor to escalate when the data conflicts. The diagnostic moves from "whose problem is this" to "where in the path is the failure," which is a different question that produces an answer faster.

The shorter version

Outages aren't solved by vendors who confirm the problem isn't theirs. They're solved by someone who can read all the layers and trace the failure to whichever layer it's actually in. When that someone exists, outages compress from days to hours. When they don't, every vendor is innocent and the lights are still off.

That's the work. It isn't a slide. It's the difference between a partner you call when something is wrong and a vendor you can't reach until they've already cleared themselves.

When the carrier says network and IT says cloud: running a cross-vendor outage diagnostic