How Resilient is your Cloud Application? Stackgenie

Resiliency testing plays an important role in the successful performance of applications on the cloud. At least 77 percent of enterprises have a minimum of one application or a portion of their enterprise applications running on cloud infrastructure. However, research shows that around two-thirds of these companies haven’t experienced any significant benefits in scalability, savings, or IT modernisation.

While cloud infrastructure does have the potential to make businesses leaner, more efficient, and secure, several underlying risks undermine its ability to do so. These risks range from a failure in performance of the migrated systems to underutilised business potential. Quite often, performance issues persist, especially in cases where on-premises applications and legacy systems are migrated without optimisation and testing.

For the purpose of this article, let’s review one of the most common failures seen in applications with high availability requirements. A closer look at the causes and possible methods of prevention could be eye-opening.

High Availability: The Ultimate Promise of the Cloud

High availability is the ability of a system to provide smooth, uninterrupted service no matter what the conditions are. This is one of the most appreciated advantages of migrating to the cloud. High availability is typically achieved through an integrated approach involving high levels of automation and monitoring, load balancing, clustering, early detection of imminent failures, and automated failover to a secondary system in the event the primary fails. What is important to note is that all these precautionary activities take place at the infrastructure layer.

Why is Resiliency Testing Ignored?

Often during an on-premise system migration to the cloud, customers overlook the robustness and resiliency of their system under the assumption that the cloud providers are completely responsible for assuring over 99% uptime and availability. This assumption is however only partly true. While cloud infrastructure providers do employ various strategies to guarantee high availability, application architecture and design for resiliency still remains the responsibility of the customer. Since most customers are unaware of this requirement and are ill-prepared for resiliency, system availability is often compromised despite having the best cloud infrastructure provider at their service.

What does Application Resilience Mean?

Application resilience refers to the ability of an application to continually provide and maintain acceptable levels of service even in the face of challenges and less than ideal conditions of operation. In simpler words, it is an application’s ability to be prepared for disruptions and unforeseen changes to the environment. This includes its ability to recover from faults and also, under extreme conditions, graceful degradation.

Whether an application is migrated from an on-premise location to the cloud or has been developed natively, it stands the risk of failure during operation if it hasn’t been architected and remediated for resilience before migration. For the full benefits of the cloud infrastructure, resiliency best practices are imperative.

Recently, Swedish music streaming service, Spotify, experienced a rate outage due to which their customers were not able to listen to their favourite songs for nearly an hour. The outage was caused by failures in several microservices triggered by a network problem. Unfortunately their system had not been adequately architected for early detection of failure and recovery, or tested for resilience. This resulted in a cascade of downstream failures.

How is Resilience Testing Different From Conventional Testing

Resiliency testing checks for a system’s capacity to remain operational without disrupting service under live conditions. Some of the challenges faced in resiliency testing involve the prospect of how the cloud application can be tested, evaluated, and characterised. Conventional testing, on the other hand, cannot adequately reveal application resiliency issues due to the following reasons:

Conventional testing methods are limited to business use-cases or are requirements-driven and therefore cannot uncover hidden architectural flaws.
Complex services often require the interplay of several software entities. The heterogeneous and multi-layer architecture needed to manage these complex services are a source of application failure.
Inadequate production usage patterns and the resultant inexperience to identify, classify, and manage previously unknown and emerging behaviours in cloud application architecture, especially in the cases of hybrid and multi-cloud systems is also a reason for failure.
Several failures are caused by internal system errors that remain latent and asymptomatic until certain environmental factors trigger them to cause large scale failures.
Multi-user authenticated usage across the cloud with each layer having different stakeholders and administrators lead to unpredictable configuration changes that result in interface breakdown in architecture.

Effective Resiliency Testing Strategies

In late 2015, a major outage shook the foundations of Amazon Web Services and resulted in hours of outage for most of the giant tech companies that depended on its service. However, Netflix recovered relatively unscathed with just a few minutes of downtime. The secret of Netflix’s success was the Netflix Simian Army that ran outage and system failure simulations alerting their cloud engineers of impending failure modes. Since then, this scenario has become a well researched case study, forever replacing cloud migration complacency with a new norm – failure is the rule, not the exception.

In the cloud, application resiliency is a challenge and all the more a necessity due to the multi-tier, multi-integrated technology infrastructure and distributed cloud system employed. The interplay of these varying elements can often cause surprising hiccups and outages even if the cloud infrastructure provider keeps up their side of the bargain. Therefore, cloud engineers will have to be on the watch for imminent failure by using the following strategies to test, evaluate and characterise application resilience:

Outline availability goals and determine application layer resilience attributes through collaboration with other cloud architects.
Observe usage patterns and/or extrapolate existing data to predict future usage patterns, hypothesise failure modes, analyse the impact of the failure on business, and prioritise testing of the impending failure modes.
Manufacture errors or rig the internal architecture for failure to study the effects of the failures on development and testing. These rigged errors could range from delayed responses, over-utilisation of resources, network outages, transient conditions, extreme conditions, user-generated errors, and more.
Alternate between varying combinations of fabricated errors introducing variability in severity of issues and combinations of issues to observe application layer behaviour.
Isolate anomalous behaviour and increase severity of the issue or increase complexity by introducing a combination of errors to determine failure criticality.

Organisations can be better prepared to harvest the benefits of the cloud by adopting an architecture-driven testing approach that provides insights into cloud application resiliency way before the application goes live, leaving sufficient time to perform necessary remediations.

GPU Sharing Techniques: A Comprehensive Guide to vGPU, MIG & Time Slicing

Blog

GPU Sharing Techniques: A Comprehensive Guide to vGPU, MIG & Time Slicing

Learn about GPU sharing techniques like vGPU, MIG & Time slicing to optimize resource allocation for AI/ML workloads, cloud platforms, etc. Find the right approach for your needs.

February 28, 2025