Resilient Microservice Applications, by Design, and without the Chaos
Fault injection testing is vital for assessing the resilience of distributed microservice applications against infrastructure and downstream service failures. Typically performed in production, where customers may be adversely affected by this testing, it often fails to identify application bugs, particularly infrequent ones or those which only affect a subset of customers. While academics recognize the problem of resilience bug detection, in development, and prior to deployment of application code to production, their research has been limited by access to industrial applications, which has resulted in solutions that may or may not be fully aligned with the industry’s needs.
This dissertation demonstrates that these types of resilience bugs can be identified during development, and before deployment of application code to production, through the use of a developer-centric fault injection technique and a principled approach to microservice application testing. It then demonstrates that it can be done in a manner that does align with industrial practitioner’s needs by co-evolving this fault injection technique and principled approach with an industrial partner, one of the largest food delivery services in the United States, which results in the discovery of deep, previously undiscovered, resilience bugs in their application.
This dissertation begins by first constructing a microservice application corpus and introducing a novel tracing technique that captures all inter-service communication in a microservice application. Combined with the corpus, this tracing technique enables the development of an exhaustive fault injection testing technique designed specifically for microservice environments. This technique is then refined by implementing a novel test case reduction strategy to minimize the exploration of redundant fault injection scenarios, thereby increasing the performance and usability of the technique. The practicality of these techniques is then validated using a case study taken from an industrial microservice application. While this case study confirms the fault injection technique’s effectiveness, it both highlights deficiencies in the application of the technique and identifies emergent behavior that is inherent to industrial microservice applications and their piecemeal approach to application resilience. These observations inform the design of a new principled approach for testing microservice applications for resilience, which extends the fault injection technique’s usability by ensuring that developers write tests for their applications that are sufficient for bug identification.
With this principled approach, it is shown that deep, previously undiscovered, resilience bugs can be identified in large-scale, industrial microservice applications, in development, and before code ships to production.
History
Date
2024-05-07Degree Type
- Dissertation
Department
- Software and Societal Systems (S3D)
Degree Name
- Doctor of Philosophy (PhD)