Select Page
Blog - Why One Bug Caused Widespread Internet Chaos

Why One Bug Caused Widespread Internet Chaos

Anybody familiar with the concept of the butterfly effect is likely to have some sympathy for the Amazon Web Services engineer who inadvertently caused an Internet meltdown on February 28, 2017.

The cloud-based infrastructure provider suffered an hours-long outage that crippled a significant proportion of the Internet, with a plethora of high-profile websites and apps stopped in their tracks. The outage was centered in the Northern Virginia region and ensured that AWS was unable to service requests for around five hours.

Although there was no evidence that this was linked to a malicious attack, people (unsurprisingly) wanted to know what was going on.

And it turns out that human error was to blame. To be more specific, a member of the Simple Storage Service (S3) team typed in the wrong command as part of a debugging process intended to speed up the S3 billing process.

Yes, a typo brought parts of the Internet to its knees.

According to Amazon Web Services, the debugging was only supposed to affect a limited number of servers. The incorrect command not only took out a larger set of servers than intended in the US-EAST-1 datacenter—a huge location that is also Amazon’s oldest, ZDNet reported—but also removed support for two other S3 subsystems—the index subsystem and the placement subsystem. Both of these subsystems required a full restart and safety checks that took “longer than expected.”

“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” said AWS, in a press release. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

The full AWS explanation for the outage can be read here.

Never Hurts To Plan For An Unexpected Event

The outage only lasted for a few hours but it highlights the ripples that can be created when the human factor comes into play.

AWS said that it builds its systems with the assumption that things might fail but admitted that it had not restarted either of these subsystems in its larger regions for years. S3 has experienced massive growth, which means that any outage or bug is always going to have a significant effect on the digital world.

To be fair to AWS, it has already made several changes to its operational practices as a result of this unforeseen event.

These changes include modifying the tools used to remove capacity and adding safeguards to ensure that other subsystems are not affected by an incorrect input. AWS has also begun auditing its other operational tools to speed up recovery time, splitting services into small partitions or cells and reducing the dependency that the AWS Service Health Dashboard—which was also taken out by the incorrect command—has on S3.

AWS has apologized for the impact that this event had for its thousands of customers, but it acknowledged that one small bug caused a whole heap of problems. With that in mind, this widespread outage through human error should be a wakeup call for any companies that test for bugs on an infrequent basis.

Want to see more like this?
View all blogs ⟶
Published: March 3, 2017
Reading Time: 4 min

Beyond Traditional Testing: Advanced Methodologies for Evaluating Modern AI Systems

As AI systems continue to demonstrate ever more complex behaviors and autonomous capabilities, our evaluation methodologies must adapt to match these emergent properties if we are to safely govern these systems without hindering their potential.

Integrating CX Into Everyday QA Testing

Enhancing quality through a focus on customer experience

European Accessibility Act: IAAP Brno Hybrid Event Recap

European Accessibility Act: IAAP Brno Hybrid Event Recap My Applause colleague Jason Munski and I attended the ...

Agents and Security: Walking the Line

Common security measures like captchas can prevent AI agents from completing their tasks. To enable agentic AI, organizations must rethink how they protect data.

Crowdtesting Pilot Blueprint: Onboarding the Right Way

Take a step-by-step look at the crowdtesting pilot process

How Agentic AI Changes Software Development and QA

Agentic AI introduces new ways to develop and test software. To safely and effectively make the most of this new technology, teams must adopt new ways of thinking.
No results found.