Amazon Web Services: A Scaling Issue?

The Internet as we know it is home to a gross amount of information and content, where everyone from individuals like you and me to businesses small and large are able to interact on a global scale in ways never before thought possible. Perhaps more importantly, the global network has changed the way that we as modern-day individuals go about our daily lives and how we express, share, and gain knowledge from ideas. And in order to orchestrate our Internet lives, online portals and communities serve as the hub for our linked interaction. After all, what would the Internet be like without social networks such as Facebook and community-driven news websites like Reddit? But while very few people would venture to challenge the importance of these incredibly revolutionary web services, the question as to how exactly these sites run rarely crosses our minds as Internet-centric users.

Simply put though, the “big” websites that have become everyday aspects of our modern Internet lives have amazingly large infrastructures to back what we as consumers see in the front-end. These ginormous infrastructures only gain in complexity and structure with the growing sizes of growing sites and applications. And while this complexity and scalability of in-house servers used to be one of the biggest things that would cause the CEO’s of large Internet-driven companies to lose sleep at night, “newer” services such as Amazon’s Elastic Cloud Compute (EC2) system have taken the hassle out of such nightmares, making it easy for websites and cloud-based services to scale as their success rises. This is because services like EC2 allow businesses to outsource their server needs and focus their creative efforts more efficiently, all whilst operating on platforms built to be more or less bulletproof; often times at more competitive pricing to boot.

With this in mind, it’s incredibly easy to see why Amazon EC2 has become such a popular choice for big (and growing) businesses and websites. But as I’m sure you’ve seen or heard recently, last week brought a surprising downtime to Amazon’s EC2 service, leaving countless websites offline. Left and right users were seeing messages stating that an issue with Amazon’s platform had crippled many major websites. But why exactly did EC2 – as service known for its amazing flexibility, scalability, and stability – suddenly go haywire?

In order to understand my theory and logic, one must first understand the pricing system used by Amazon Web Services. As you would guess Amazon has a handful of data centers in a variety of geographical regions, allowing businesses to operate servers around the world. However, because the cost of operating in different areas differs quite a bit, Amazon is forced to spike the prices of their services for users operating in areas with higher overhead. Even though this is really an entirely separate discussion, the fundamental concept is that Amazon is forced to offset higher taxes, regulatory fees, cost of living for their employees, and a list of other operational costs. This is why operating a server in California, for example, can easily cost a business about 25% more to run than one in Virginia.

The most cost-effective Amazon data-center for businesses is located in Virginia, and is the same data center that experienced technical issues last week.

Because of the nature of many websites and their operational infrastructure, using the cheapest data-center is a perfectly acceptable option even if it is geographically inconsistent with the majority of visitors and users. Having said this, EC2 stands for “Elastic Cloud Compute” and is meant to do just that; compute. This means that EC2 isn’t necessarily designed to handle traffic as much as it is to handle large amounts of data and process queries that consume local hardware resources. The bulk of the bandwidth-intensive content delivery (static files, images, etc.) for many large websites is often handled by other distribution systems such as Amazon’s own CloudFront CDN (Content Delivery Network).

In turn, this means that while EC2 servers are fundamental for the operating of a large and complex website or service, the location of the server really isn’t that important because users typically won’t see a noticeable difference in speed by using one data-center as opposed to another. This means that many companies would simply be wasting money by provisioning servers in non-Virginia locations simply because the increased cost wouldn’t deliver any better of a service.

So back to the issue at hand, this leaves the Virginia data-center being a very popular choice for businesses and large sites. As a matter of fact, I’d even go as far as to guess that Virginia is the most popular of Amazon’s data-centers.

If you cannot see the point I’m trying to get at by now, the concept is truly very simple. I think that Amazon is amidst a scaling issue because the popularity Virginia data-center is higher than it probably should be with its current limitations; especially with rapidly growing websites and services. Amazon’s own health status dashboard stated that in order to “safely re-mirror the stuck [EBS] volumes” and get websites back online the company had to “add the capacity” necessary to do so.

Sure, this was really what many people would consider to be a freak accident, and I can understand that in this particular situation Amazon’s resources were simply not enough to effectively handle the situation. And really, I don’t think that Amazon is at a point where its capacity is busting at the seams, but this still doesn’t change my opinion that Amazon could have likely handled the entire situation much more effectively had they had more resources and been more prepared for this type of situation.

So where does that leave us? Is the EC2 – the service people use when they fear the scalability of their infrastructure – amidst scaling issues of its own? As much as I’d love to give you an answer, I really can’t as Amazon is the only entity that possesses this knowledge. None the less, I think this unfortunate event with EC2 serves as a wakeup call that even services designed for scalability are not necessarily fool-proof either.

For Amazon’s sake I really do hope that this outage is not going to become a normal thing, as future problems would surely ruin the company’s reputation as a cloud service provider; a market that has a great deal of potential.

Lastly, if you’re looking for additional web service testing guides the tutorial site guru99 has some great how to’s.