Why One Bug Caused Widespread Internet Chaos
Anybody familiar with the concept of the butterfly effect is likely to have some sympathy for the Amazon Web Services engineer who inadvertently caused an Internet meltdown on February 28, 2017.
The cloud-based infrastructure provider suffered an hours-long outage that crippled a significant proportion of the Internet, with a plethora of high-profile websites and apps stopped in their tracks. The outage was centered in the Northern Virginia region and ensured that AWS was unable to service requests for around five hours.
Although there was no evidence that this was linked to a malicious attack, people (unsurprisingly) wanted to know what was going on.
And it turns out that human error was to blame. To be more specific, a member of the Simple Storage Service (S3) team typed in the wrong command as part of a debugging process intended to speed up the S3 billing process.
Yes, a typo brought parts of the Internet to its knees.
According to Amazon Web Services, the debugging was only supposed to affect a limited number of servers. The incorrect command not only took out a larger set of servers than intended in the US-EAST-1 datacenter—a huge location that is also Amazon’s oldest, ZDNet reported—but also removed support for two other S3 subsystems—the index subsystem and the placement subsystem. Both of these subsystems required a full restart and safety checks that took “longer than expected.”
“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” said AWS, in a press release. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
The full AWS explanation for the outage can be read here.
Never Hurts To Plan For An Unexpected Event
The outage only lasted for a few hours but it highlights the ripples that can be created when the human factor comes into play.
AWS said that it builds its systems with the assumption that things might fail but admitted that it had not restarted either of these subsystems in its larger regions for years. S3 has experienced massive growth, which means that any outage or bug is always going to have a significant effect on the digital world.
To be fair to AWS, it has already made several changes to its operational practices as a result of this unforeseen event.
These changes include modifying the tools used to remove capacity and adding safeguards to ensure that other subsystems are not affected by an incorrect input. AWS has also begun auditing its other operational tools to speed up recovery time, splitting services into small partitions or cells and reducing the dependency that the AWS Service Health Dashboard—which was also taken out by the incorrect command—has on S3.
AWS has apologized for the impact that this event had for its thousands of customers, but it acknowledged that one small bug caused a whole heap of problems. With that in mind, this widespread outage through human error should be a wakeup call for any companies that test for bugs on an infrequent basis.