What can we learn from the recent AWS outage, and how can we apply those lessons to our own infrastructure?
What Happened?
On October 20, 2025, AWS experienced a major disruption that rippled across the internet (and social media), affecting widely used services such as Zoom, Microsoft Teams, Slack, and Atlassian. The issue originated not in a single data center or customer workload, but in the AWS control plane, the management layer that coordinates how resources like EC2 instances, DynamoDB tables, and IAM roles operate.
The initial trigger appears to be a DNS failure around the DynamoDB API endpoint in the US-EAST-1 region, compounded by a malfunction in the subsystem that monitors network load balancers. Because those health-monitoring services also run in US-EAST-1, AWS throttled new EC2 instance launches while restoring the subsystem.
Though AWS markets its regions as isolated clusters of data centers with independent power and cooling, this incident showed that core control-plane functions remain centralized, creating hidden dependencies that can cascade globally.
Root Cause: A Single-Region Control Plane
Analysts quickly identified that US-EAST-1 hosts AWS’s common control plane that supports a host of global services. Workloads running in Europe or Asia actually rely on API calls that route back through or to US-EAST-1, so the failure there had global consequences.
When the region’s DNS and health-check subsystems failed, those control-plane calls stalled worldwide. The end result was a global slowdown in EC2 launches, configuration updates, and authentication, despite the other regions being technically “healthy.”
AWS’s own design guidance encourages customers to spread workloads across availability zones for resiliency, but these customer-facing resiliency mechanisms ultimately depend on the same centralized control plane. In other words, data-plane isolation worked as designed, but control-plane isolation did not.
This pattern has surfaced before, not just at AWS. Cloudflare, Microsoft, and Google have all suffered outages triggered by control-plane or configuration failures that propagated globally. The lesson here is that in modern distributed systems, control-plane fragility can become a single point of failure.
The Broader Pattern
AWS may be in the spotlight now, but looking across the industry, nearly every major cloud or CDN provider, AWS, Cloudflare, Microsoft, Google, has experienced control-plane-related outages in the past five years. These are rarely caused by attacks; more often, they stem from routine operational changes, misconfigurations, or centralized service dependencies.
The October 2025 AWS outage simply demonstrates that no cloud provider is immune. The best defense is architectural: distribute risk, decouple dependencies, and design for graceful degradation.
