What put Amazon into a web service outage that took the entire internet down?
In case you have been living under a rock the past week, Amazon took down a big part of the internet that left many without their daily used programs. Tens of thousands of websites using Amazon’s AWS cloud computing service was down due to a typo. YES, a typo. One incorrectly entered command caused the entire country to go into panic.
Amazon released a letter explaining exactly what happened. You can read the full letter here, however what’s listed below gets right to the point.
We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended”
Four hours and 17 minutes later that this issue was finally resolved. The reason being according to Amazon, was that both systems that were down required a full restart, and this isn’t as simple as rebooting your laptop.
Amazon sincerely apologizes and to make sure this never happens again, Amazon has rewritten its software tools so their software engineers will not be allowed to make the same mistake again.
It is truly remarkable to think how a simple mistake affected thousands of people. Did the outage affect you in any way? Let us know in the comments!