AWS Solution Architect Associate: AWS says a typo caused the massive S3 failure this week

Everybody makes mistakes. But working with Amazon Web Services means that incorrect entry can lead to a massive failure that will paralyze popular sites and services. This is apparently what happened earlier this week, when the AWS Storage (S3) service at the Northern Virginia vendor experienced a 11-hour system failure.

All other Amazon services in the US-ESTE-1 region that rely on S3, such as Elastic Block Store, Lambda, and the launch of a new forum to offer them the Elastic Compute Cloud service infrastructures have been impacted by the interruption.

AWS has apologized for the incident at an autopsy published Thursday. The blackout affected the likes of Netflix, Reddit, Imgur and Adobe. More than half of the 100 online shopping sites have experienced slower load times during the failure, according to APICA's monitoring service site.

Here is what caused the failure, and what Amazon plans:

According to Amazon S3 authorized employee has executed a command that was supposed to be "extracting a small number of servers from one of the subsystems S3 S3 that is used by the billing process" in response to the service of the billing processes running more Slow than expected. One of the parameters of the command was incorrectly entered and fired at a large number of servers that support a pair of S3 critical subsystems.

The subsystem handles metadata index and location information for all S3 objects within the region, while the inversion subsystem manages the allocation of new storage facilities and requires the index subsystem to function properly. Fault-tolerant, the number of servers stopped requires both a full reboot.

As it turns out, Amazon has not completely restarted these systems in their larger regions for several years, and S3 has experienced massive growth in the interim time. Restarting these subsystems has taken longer than expected, which added to the duration of the outage.

In response to this incident, AWS makes several changes to its internal tools and processes. The responsible for the failure tool has been modified to remove the slower servers and block transactions that have capacity below the control security levels. AWS also evaluates its other tools to ensure they have similar security systems.

AWS engineers will also begin refactoring the S3 index subsystem to help accelerate restarts and reduce the Blast Radius problems in the future.

The cloud service provider has also modified its dashboard health service dashboard to run in several regions. AWS employees were unable to update the dashboard during the crash because the S3 console depended on the affected area.

AWS Solution Architect Associate

Featured Post

How to Pass AWS Certified Solutions Architect Associate SAA-C02 Exam in 2022?

Thursday, March 2, 2017

AWS says a typo caused the massive S3 failure this week

No comments:

Post a Comment