|By Sven Hammar||
|October 26, 2012 11:00 AM EDT||
On Monday, Amazon Web Services — the leading provider of cloud services — suffered an outage, and as a result, a long list of well-known and popular websites went dark. According to Amazon’s Service Health Dashboard, the outage started out as degraded performance of a small number of Elastic Bloc Store (EBS) storage units in the US-EAST-1 Region, then evolved to include problems with the Relational Database Service and Elastic Beanstalk as well.
The only surprising thing about this AWS outage was that anyone was surprised by it. It wasn’t the first time AWS had a major outage or problems with this data center. If you remember, back in June a line of powerful thunderstorms knocked the power out at a major Amazon hosting center. The backup generator failed, then the software failed, and, well, you know the drill. A corollary of Murphy’s Law is that if multiple things can go wrong, they will all go wrong at once.
In both of these instances (and in all Amazon Web Services outages, in fact) some customers were knocked “off the air” while others continued running without a hiccup. You would think that eventually companies will learn to anticipate the inevitable AWS outages and take active steps to prepare for them. There are best practices and solutions on how to reduce vulnerability to an outage, but they’re rarely implemented. That’s because people don’t think that anything could happen to Amazon — obviously, things happen.
Instances like this are a learning opportunity if we take the time to think about why they happened and what could have been done to prevent them. Here are six lessons that I think we can learn from the Amazon Web Services outages.
Lesson 1 — Clouds are made of components that can fail. When people think of the cloud, they think that there is some amorphous and untouchable blog up in the sky. And while that’s a nice bit of marketing, it is not a useful model for operational planning. Be mindful of your cloud provider’s architecture and how it is built to manage failure of a component or a zone blackout. Then anticipate that failures can happen at any point in the cloud infrastructure.
Lesson 2 — The stress of failure will trigger a cascade of other failures. After reading a description of the outage, you get the sense that it was just one thing after another. What started as a small issue affecting one Northern Virginia data center quickly spread, causing a chain reaction and outage that disrupted much of the Internet for several hours. Remember Murphy and his law?
Lesson 3 – -Spikes matter. When a cloud fails, hundreds of customers are impacted. As they try to recover, they will be stressing the cloud provider’s infrastructure with a peak load that is guaranteed to cause even more problems. If you get these transition spikes, they get worse and worse. Every time you reboot, it takes longer and longer. If you have ten servers doing that, that’s bad. If you spike a thousand servers, that’s really bad. Something that would have taken five minutes to fix will now take five hours when you get into that transition type of syndrome.
Lesson 4 — Cloud providers provide the tools to manage failure, but it is up to you to put your own failover plans in place. AWS, for example, is broken into zones. If a component in the Virginia zone goes down and the whole matrix is dead, then (in theory) you should be able to move all your data to another zone. That other zone might be hosted, unaffected, in Ireland and then you are up and running again. This is one of the big differences between the cloud and more traditional approaches to IT. It is up to the application (and by extension, the application’s designer) to manage its interaction with the cloud environment, up to and including failover. Most cloud providers offer tools and frameworks to support failover, but you are responsible for implementing that best practice into your system operation and into the applications.
Lesson 5 — You need to put your failover plans through a full-blown load test. It’s not enough to have a strategy in place for failover. You have to test it under real-world conditions. Even the best laid failover plans, once implemented and designed, might have hiccups when a real outage occurs. A full-blown cloud load test can help you see how long the failover process will take to kick in and what other dependencies might need to be sorted out. Obviously this isn’t easy. If it was, Reddit, Foursquare, Airbnb and others wouldn’t have been impacted by the AWS outage.
Lesson 6 — Conduct fire drills. While a load test will confirm that your failover plan works as you expect, it will also give your team some real experience in executing the plan. Remember the fire drills you used to do in school? Fire drills help train students, teachers, and others to know exactly what they’re supposed to do and where they’re supposed to go in the event of an emergency. All the bugs in the process are worked out during the fire drill, and the more everybody does the drills, the more comfortable there are with what they need to do. And if a real emergency happens, everybody knows how to leave the building calmly. You want to do the same thing with your failover plan, and load testing can help you get there. Fire drills save lives and load tests save cloud apps.
Is your failure worth more than $28?
Amazon offers reimbursement to its customers based on the amount of downtime the customer experiences. The last time our Amazon Web Services went down, we got a $28 reimbursement. So my final lesson learned (I guess this makes for seven lessons) is this: The cost of downtime for your organization — in lost revenue, poor customer experience, etc. — is far, far greater than just what you are paying your cloud provider. $28 is not going to save your day. You have to make sure that you have a failover solution that’s ready and working. Don’t wait for Amazon to solve this problem for you, because it’s only a $28 problem for it.
The biggest lesson learned from these AWS outages is that you need to configure properly and you need to train your people. These types of events will always happen, and when they do, you need to be trained ahead of time. Load testing itself is a good way to validate and train. That way when a real emergency occurs, your team can react in a calm, collected manner to a situation they’ve experienced dozens of times before.
There will be new vendors providing applications, middleware, and connected devices to support the thriving IoT ecosystem. This essentially means that electronic device manufacturers will also be in the software business. Many will be new to building embedded software or robust software. This creates an increased importance on software quality, particularly within the Industrial Internet of Things where business-critical applications are becoming dependent on products controlled by software. Qua...
Jul. 26, 2016 06:15 AM EDT Reads: 1,416
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2016 Silicon Valley. The 19th Cloud Expo and 6th @ThingsExpo will take place on November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Interne...
Jul. 26, 2016 05:15 AM EDT Reads: 2,082
Large scale deployments present unique planning challenges, system commissioning hurdles between IT and OT and demand careful system hand-off orchestration. In his session at @ThingsExpo, Jeff Smith, Senior Director and a founding member of Incenergy, will discuss some of the key tactics to ensure delivery success based on his experience of the last two years deploying Industrial IoT systems across four continents.
Jul. 26, 2016 05:00 AM EDT Reads: 1,529
CenturyLink has announced that application server solutions from GENBAND are now available as part of CenturyLink’s Networx contracts. The General Services Administration (GSA)’s Networx program includes the largest telecommunications contract vehicles ever awarded by the federal government. CenturyLink recently secured an extension through spring 2020 of its offerings available to federal government agencies via GSA’s Networx Universal and Enterprise contracts. GENBAND’s EXPERiUS™ Application...
Jul. 26, 2016 03:45 AM EDT Reads: 1,836
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and shared the must-have mindsets for removing complexity from the develo...
Jul. 26, 2016 02:00 AM EDT Reads: 1,350
SYS-CON Events announced today that MangoApps will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. MangoApps provides modern company intranets and team collaboration software, allowing workers to stay connected and productive from anywhere in the world and from any device.
Jul. 26, 2016 01:45 AM EDT Reads: 1,318
The IETF draft standard for M2M certificates is a security solution specifically designed for the demanding needs of IoT/M2M applications. In his session at @ThingsExpo, Brian Romansky, VP of Strategic Technology at TrustPoint Innovation, explained how M2M certificates can efficiently enable confidentiality, integrity, and authenticity on highly constrained devices.
Jul. 26, 2016 01:30 AM EDT Reads: 1,014
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportuni...
Jul. 26, 2016 01:15 AM EDT Reads: 2,540
In today's uber-connected, consumer-centric, cloud-enabled, insights-driven, multi-device, global world, the focus of solutions has shifted from the product that is sold to the person who is buying the product or service. Enterprises have rebranded their business around the consumers of their products. The buyer is the person and the focus is not on the offering. The person is connected through multiple devices, wearables, at home, on the road, and in multiple locations, sometimes simultaneously...
Jul. 26, 2016 12:45 AM EDT Reads: 665
“delaPlex Software provides software outsourcing services. We have a hybrid model where we have onshore developers and project managers that we can place anywhere in the U.S. or in Europe,” explained Manish Sachdeva, CEO at delaPlex Software, in this SYS-CON.tv interview at @ThingsExpo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 26, 2016 12:00 AM EDT Reads: 1,547
"We've discovered that after shows 80% if leads that people get, 80% of the conversations end up on the show floor, meaning people forget about it, people forget who they talk to, people forget that there are actual business opportunities to be had here so we try to help out and keep the conversations going," explained Jeff Mesnik, Founder and President of ContentMX, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 25, 2016 11:15 PM EDT Reads: 1,321
From wearable activity trackers to fantasy e-sports, data and technology are transforming the way athletes train for the game and fans engage with their teams. In his session at @ThingsExpo, will present key data findings from leading sports organizations San Francisco 49ers, Orlando Magic NBA team. By utilizing data analytics these sports orgs have recognized new revenue streams, doubled its fan base and streamlined costs at its stadiums. John Paul is the CEO and Founder of VenueNext. Prior ...
Jul. 25, 2016 11:15 PM EDT Reads: 2,022
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
Jul. 25, 2016 10:00 PM EDT Reads: 2,524
The IoT is changing the way enterprises conduct business. In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, discussed how businesses can gain an edge over competitors by empowering consumers to take control through IoT. He cited examples such as a Washington, D.C.-based sports club that leveraged IoT and the cloud to develop a comprehensive booking system. He also highlighted how IoT can revitalize and restore outdated business models, making them profitable ...
Jul. 25, 2016 08:30 PM EDT Reads: 1,950
With 15% of enterprises adopting a hybrid IT strategy, you need to set a plan to integrate hybrid cloud throughout your infrastructure. In his session at 18th Cloud Expo, Steven Dreher, Director of Solutions Architecture at Green House Data, discussed how to plan for shifting resource requirements, overcome challenges, and implement hybrid IT alongside your existing data center assets. Highlights included anticipating workload, cost and resource calculations, integrating services on both sides...
Jul. 25, 2016 08:00 PM EDT Reads: 1,994
Big Data engines are powering a lot of service businesses right now. Data is collected from users from wearable technologies, web behaviors, purchase behavior as well as several arbitrary data points we’d never think of. The demand for faster and bigger engines to crunch and serve up the data to services is growing exponentially. You see a LOT of correlation between “Cloud” and “Big Data” but on Big Data and “Hybrid,” where hybrid hosting is the sanest approach to the Big Data Infrastructure pro...
Jul. 25, 2016 07:30 PM EDT Reads: 1,902
"We are a well-established player in the application life cycle management market and we also have a very strong version control product," stated Flint Brenton, CEO of CollabNet,, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 25, 2016 07:15 PM EDT Reads: 1,811
We all know the latest numbers: Gartner, Inc. forecasts that 6.4 billion connected things will be in use worldwide in 2016, up 30 percent from last year, and will reach 20.8 billion by 2020. We're rapidly approaching a data production of 40 zettabytes a day – more than we can every physically store, and exabytes and yottabytes are just around the corner. For many that’s a good sign, as data has been proven to equal money – IF it’s ingested, integrated, and analyzed fast enough. Without real-ti...
Jul. 25, 2016 07:15 PM EDT Reads: 1,035
I wanted to gather all of my Internet of Things (IOT) blogs into a single blog (that I could later use with my University of San Francisco (USF) Big Data “MBA” course). However as I started to pull these blogs together, I realized that my IOT discussion lacked a vision; it lacked an end point towards which an organization could drive their IOT envisioning, proof of value, app dev, data engineering and data science efforts. And I think that the IOT end point is really quite simple…
Jul. 25, 2016 06:30 PM EDT Reads: 1,046
A critical component of any IoT project is what to do with all the data being generated. This data needs to be captured, processed, structured, and stored in a way to facilitate different kinds of queries. Traditional data warehouse and analytical systems are mature technologies that can be used to handle certain kinds of queries, but they are not always well suited to many problems, particularly when there is a need for real-time insights.
Jul. 25, 2016 06:15 PM EDT Reads: 1,794