AWS Outage Points to High Risk of Single Cloud

AWS, the world’s largest cloud infrastructure service, on Monday suffered a technical glitch affecting websites and businesses worldwide, along with millions of users. Which raises questions about the reliability and security of hyperscaler cloud services.

The outage, discovered early Monday morning ET, reportedly took down 113 AWS services, knocking out online access to a multitude of high-profile customers, including Snapchat, Disney+, Reddit, the Wall Street Journal, ChatGPT, Signal, Coinbase, McDonald’s, United Airlines, and Canva, among many others. Despite AWS declaring that it solved the issue earlier today, its network continues to be troubled by performance issues as of this midday writing.

A Compounded Problem

There has been no simple fix for the AWS outage. Initially, AWS described the problem as “related to DNS [Domain Name Service] resolution of the DynamoDB API endpoint in US-EAST-1 [a key datacenter in AWS’s network].” That DNS issue was resolved a couple of hours later, by which time other, related problems surfaced.

AWS posted that “requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates.” EC2 is, of course, AWS's Elastic Compute Cloud, which generates scalable virtual servers for customer application deployment.

The EC2 launch problem was still being tackled when AWS a couple of hours later reported: “We have confirmed multiple AWS services experienced network connectivity issues in the US-EAST-1 Region.” The connectivity problem, seemingly related to the EC2 issue, was traced to “an underlying internal subsystem responsible for monitoring the health of our network load balancers.” Mitigation efforts were ongoing as of this writing.

A Datacenter Under a Cloud

US-EAST-1, located in northern Virginia, was spotlighted unfavorably in the news. This particular facility, dedicated to AWS, has been foundational for its U.S. services since at least 2006. Notably, that datacenter also was involved in a couple of other AWS outages in 2020 and 2021.

US-EAST-1 is key to AWS’s U.S. availability zones, which are networked datacenters within a region. Availability zones can be one or more physical datacenters, or datacenters within which virtual resources comprise availability zones. US-EAST-1 hosts six availability zones, with another under construction for 2026 release.

AWS relies heavily on US-EAST-1, and likewise so do its customers. A number of posters on Reddit complained that AWS depends heavily on US-EAST-1 for its core architecture, as do many AWS customers. And the datacenter has become notorious. "It’s always been like this," wrote one. "Every time we’ve seen a huge outage like this, it’s been us-east-1.”

A Multi-Faceted Cluster of Issues

While the US-EAST-1 datacenter may or may not have flaws in its management, AWS’s cluster of issues also reflects the interdependence of applications and services in cloud-based environments based on microservices. That these complex hyperscaler architectures can be thrown off by a malfunction in a DNS server or load balancer shows how interconnected the cloud functions are and how difficult it can be to recover services once a problem occurs. In domino fashion, one service outage can quickly escalate to other services, and one compromised location can affect thousands of sites and millions of users, as happened today.

More customers, increased use of AWS APIs, more reliance on AWS services all mean more exposure when something goes awry. Fixing the problems also becomes more complicated. And users must rely on AWS to fix things. Backup to another cloud is costly and complicated, which has fed a trend away from multicloud services and toward hybrid cloud architectures with substantial on-premises infrastructure and some repatriation of cloud services to on-prem.

Of course, there will be customer backlash from the outage, which has cost businesses and consumers significant losses that could range “in the hundreds of billions” of dollars according to one CNN source. AWS protects itself by service level agreements that stipulate the amount of uptime it is responsible for providing to each customer, and contracts vary. Still, most of the compensation will probably be in the form of service credits.

Futuriom Take: The recent massive AWS outage highlights the complex, fragile construction of cloud services and the risks involved in relying on them. The specific problems AWS encountered point to the potential need for changes to its infrastructure to ensure future reliability.