Amazon’s Cloud Storage Goes Down – What Went Wrong and What Can Companies Learn?
On February 28th, at around 9:37 am PDT, a large number of cloud servers were unintentionally switched off, leading to widespread disruption of services for nearly 150,000 websites and apps.
However, this crash wasn’t caused by hackers, server failure, or other nefarious activity—no, this outage was caused by a typo.
Yes, a typo caused the largest cloud failure of 2017.
A simple data entry error caused massive website outages with reports of entire sites crashing and thousands more with broken links and missing images. Such a small, insignificant action—a misplaced keystroke, a common everyday mistake, broke the internet and brought digital communication to a halt.
According to Amazon, the error occurred during a routine system update to fix a billing service bug. An authorized Amazon Web Services S3 team member used “an established playbook” command designed to shut down a small number of servers for repair. However, “one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
An Unresponsive Web
For nearly 4 hours, Amazon’s cloud-based solution, Simple Storage Service (S3), had difficulty sending and receiving clients’ data. This inability to communicate affected many of the most popular sites on the web; among the many websites that were impacted, sites like Medium, Soundcloud, Kickstarter, The Verge, and even the SEC all experienced outages or complete crashes.
Many popular apps were impacted by the outage as well. People were unable to connect and control Nest thermostats and users of Lyft were unable to acquire rides.
Why did so many of the biggest apps and sites on the web experience outages? AWS S3 isn’t some small scale cloud service, it’s one of the largest cloud storage solutions used today. In fact, Amazon provides more than 40 percent of cloud computing.
S3 is used for just about everything on the web; from building websites and apps to data storage including housing images, customer data, and customer transactions. It is used by more than half a million customers.
According to cloud analyst Dave Bartoletti, AWS S3 has more than “3 to 4 trillion pieces of data stored in it.” That’s a lot of information suddenly unavailable.
In other words, when Amazon’s S3 went down it impacted hundreds of thousands of sites and millions of lives. eCommerce sites were unable to conduct business, image sharing sites were unable to upload or display images, news sites were unable to publish, and music streaming services fell silent.
Head in the Cloud
Cloud storage solutions helped create the modern, mobile web. In fact, cloud solutions have become such an integral aspect of our digital experience it’s hard to imagine life without their benefits. That is, until something happens that disrupts our ability to access the cloud in our day to day lives—like not being able to turn up the heat, conduct business, or hail a ride.
Flexibility is a primary reason cloud computing services are so valuable for the internet. Companies, websites, and individuals can outsource many of the most expensive and time-consuming computer operations to professionals, freeing them up to focus attention on their business, website, or app. Cloud storage providers handle a wide range of operations including:
- Cyber Security
- Site Security
- Data Storage
- Data Processing
- Payment Processing
- Technology Upkeep
- Software Updates
With so many benefits, it’s no wonder that most of the web operates in the cloud. It’s simply a more efficient way to utilize the benefits of the web without having to undertake the maintenance required to maximize its use. It democratizes computing power, allowing thousands of users to share the benefit of state of the art servers and expert maintenance. In short, it helps create a reliable, interactive digital experience for both website administrators and visitors.
Cloud storage solutions provide small websites the same benefits and digital infrastructure as much larger sites. This helps level the playing field between the biggest and smallest players on the internet.
It means small mom and pop operations can utilize the same services as global corporations. Cloud solutions provide smaller companies and entrepreneurs the benefit of state-of-the-art technology as well as access to experts in cyber security, engineering, programming, and other data storage and processing specialists they may otherwise not have access to.
Clouds on the Horizon
Obviously, cloud computing services aren’t going anywhere. In fact, 41% of all enterprise workloads are housed in the cloud. That number is only expected to grow, reaching 60% by 2018.
The web is migrating to the cloud, and this digital exodus isn’t likely to change anytime soon.
According to Amazon, S3 “has experienced massive growth over the last several years.” Which is why an error of this type has such a dramatic impact. Much of the web is rushing to, and relying on, the cloud for its data storage and processing.
Amazon isn’t the only cloud service that has experienced substantial growth.
According to Forbes, worldwide spending on cloud storage solutions will grow at nearly 20% from almost $70 billion in 2015 to more than $140 billion in 2019. The bulk of these cloud services are operated by giants in the tech industry—Amazon, Microsoft, Google, and Oracle.
Indeed, Morgan Stanley predicts Microsoft cloud products will account for 30% of their revenue by 2018. In other words, there are more clouds on the horizon. As it becomes more efficient and cost effective, cloud computing will only continue to grow.
Up to this point, the shift to cloud computing has been a highly efficient arrangement for modern commerce and the shape of the web. Website administrators and app developers can outsource the responsibilities of operating and maintaining costly servers to a company dedicated to cloud computing services at a more equitable cost.
By freeing up valuable time and resources that would otherwise be spent on data storage and processing, websites and apps are able to focus their energy on their business. Likewise, cloud providers can focus on data storage and processing solutions.
Even with all the benefits, there is still one glaring shortcoming of cloud services; vulnerability. With so much of the cloud under the control of just a few large companies, a small error (like a typo) can spell disaster for large swaths of the internet.
These kinds of outages could disrupt more than a few eCommerce sites and apps, as more of the web migrates to the cloud the probability of another widespread crash only increases. The next outage could be much worse and bring day to day life to a grinding halt.
After the outage, many were quick to point out that it might not be such a great idea to have so much of the internet running and relying on one cloud service. Because Amazon provides nearly half of all cloud services, an AWS outage unfortunately affects most of the internet.
The internet was intentionally designed with a decentralized hierarchy to prevent these types of failures. The ARPANET, an early predecessor of the internet, was designed around robustness and survivability. While some argue the original design was to create a communication network that could survive a nuclear attack, its decentralized design was crafted to withstand significant losses to large portions of the underlying network yet still operate.
In other words, the internet’s infrastructure was designed to be intentionally dispersed and decentralized. With this principle in mind, a failure or outage at one control center should not impact the overall effectiveness of the web and digital communication as a whole. However, as the web migrates from small private servers to large-scale public servers like Amazon’s S3 the risk of large-scale outages and disruptions only increases.
Cloud storage solutions are incredibly popular because they power the mobile web. The cloud frees up valuable time and resources otherwise spent on data storage and processing. The benefits of cloud computing for websites and apps are incalculable, however, the downside of trusting offsite servers to house valuable data becomes incredibly apparent when considering events like the Feb. 28th S3 crash.
In order to minimize the shortcomings of cloud storage, cloud protection protocols and fail safes must be updated. The vulnerability of cloud-based storage was uncomfortably highlighted during the outage of Feb. 28th. Amazon, and other large-scale cloud computing providers, should use this as an opportunity to reevaluate the way their cloud networks are supported, secured, and maintained.
Additionally, this failure underscores the decentralized design principles proposed by the early internet and ARPANET mechanisms. With cloud storage centers physically located in various geographic locations, one hack, failure, or breach at one center wouldn’t cause broad segments of the internet’s infrastructure to grind to a screeching halt.
A Learning Experience
According to Ed Anderson, an analyst at Gartner Inc., the “fact that an incorrect keyboard entry could bring down an entire region shows they have some operational issues.” Amazon, and other cloud providers, must make changes to the operation and maintenance of the cloud.
As more websites, more apps, and more data storage and processing migrate to the cloud, cloud storage solutions must adapt to new vulnerabilities. According to Amazon, the S3 disruption has caused them to make “several changes as a result of this operational event.” Amazon is using this as a learning experience to add safeguards and operational protocols.
In fact, Amazon has claimed they “will do everything…to learn from this event and use it to improve” their cloud services.
They are also “auditing [their] other operational tools to ensure [they] have similar safety checks.” These checks will attempt to prevent an “incorrect input from triggering a similar event in the future.”
One of the most important, and telling, changes Amazon in implementing after the outage “involves breaking services into small partitions” they call “cells.” This allows them to minimize outages and improve recovery times. This process will “reduce blast radius and improve recovery” times.
Breaking up the cloud into smaller clouds will only help for so long. The truth is the cloud needs to diversify and democratize.
A Smaller, More Efficient Cloud
If bigger isn’t better, what is the future of cloud computing services? The answer is a smaller, rapidly reactive, and more efficient cloud.
While large companies like Amazon and Microsoft make up the lion’s share of the cloud market, smaller, local cloud hosting companies provide cloud storage services without some of the vulnerabilities of larger cloud hosts.
What Can Companies Learn
This wasn’t the first, nor will it be the last, cloud outage. However, companies that offer cloud storage solutions must revisit their operational procedures. New protocols must be enacted that will mitigate user error from disabling the cloud.
In this regard, large-scale cloud hosts like Amazon can learn a lot from smaller cloud hosting services.
NIC Cloud Solutions
At NIC, we pride ourselves in promoting decentralization and vital redundancies. Located in Los Angeles, NIC offers top-quality cloud services including business email, data storage, and more. By utilizing a highly secure primary data center in Los Angeles and a secure recovery server in Phoenix, Arizona, NIC utilizes simultaneous, redundant storage to mitigate the kinds of outages caused by Amazon’s S3 failure.
Because data can be pulled from both centers, if something were to happen to one data center you wouldn’t lose data availability.
Bigger Isn’t Always Better
The streamlined efficiency of smaller cloud hosting services like NIC offers a roadmap for the future of cloud computing services. Huge corporations like Amazon and Microsoft can learn a lot from smaller cloud providers like NIC.
Maintaining redundant data centers ensures the risk of disruption is minimized. Smaller, more efficient server cells provide a decentralized data net that is less likely to crash.
All it took was one keystroke, one incorrect data entry by a team member to bring down the web for nearly 4 hours. This catastrophe demonstrates just how cloud-reliant the internet has become. It also demonstrates the need for updated protocols for securing the cloud.
As more human activity becomes controlled by the cloud, like thermostat control and the internet of things, cloud reliability and server protocols become more necessary.