Carnegie Mellon University
Browse
Fu_cmu_0041E_10498.pdf (5.99 MB)

Non-Intrusive Causal Dependency Model-based Performance Anomaly Detection and Localization in Cloud Applications

Download (5.99 MB)
thesis
posted on 2020-01-15, 19:02 authored by Senbo FuSenbo Fu
Infrastructure-as-a-Service (IaaS) Cloud is a popular platform for providing virtual computing and storage
resource to millions of users all over the world. It allows many Cloud users to deploy their applications
in a simple and cost-effective way. Cloud applications are usually deployed in virtual components such as
virtual machines and containers. Each virtual component provides a specific function and they together
provide services to customers. Due to the complex and dynamic nature, Cloud applications are prone to
performance anomalies. Performance anomalies degrade the quality of experience for the users and may
cause loss of revenue for service providers. Performance anomalies could propagate from one component
to another through their interactions. A faulty component could cause abnormal behaviors in many other
components. When there is an anomaly, it is important to detect it and locate the faulty component as
quickly as possible. Virtual components owned by Cloud tenants do not provide visibility nor access to
Cloud providers. Existing IaaS Cloud infrastructures usually monitor resource consumption and activity of each component. However, the resource utilization metrics do not reflect the actual service performance. We propose decentralized methods for anomaly detection and localization in non-intrusive fashion. We detect performance anomalies and localize the faulty component in Cloud applications without any information about inner workings of virtual components. Our systems do not own these virtual components and treat them as black boxes. We monitor network traffic from each virtual component and its
interaction with other components. The interaction behavior is not affected by fault propagation if all component involved in the local interaction are normal. This discovery helps us quickly filter out normal components. We classify these interactions into three different dependency primitives. We show that these dependency primitives help achieve better anomaly detection and localization in Cloud applications.
We propose DMADL (Dependency Model-based Anomaly Detection and Localization) to estimate the mean response time of each component using the arrival and departure pattern of data packets. DMADL achieves anomaly localization through the dependency model and component impact analysis. We also propose DMFDL (Dependency Model-based Flow ratio analysis for Anomaly Detection and Localization) and DMCDL (Dependency Model-based flow Correlation analysis for Anomaly Detection and Localization) to model the relationship that the response flow always follow the request flow within an acceptable time limit for each component service. This relationship is true for varying workload conditions at any component of Cloud applications as long as it runs in normal operation. We evaluate our methods in realistic deployment scenarios using the CloudSuite web search application, the Olio web application, and the MediaWiki application. The results show that DMADL achieves accurate response time estimation at each component. DMADL has around 95% precision and 5% false
negative rate in both anomaly detection and localization under varying workload scenarios. DMFDL and DMCDL have, on average, 87% precision and 5% false negative rate in anomaly detection and localization. Compared to the anomaly detection methods based on resource utilization metrics, DMFDL and DMCDL achieve on average 18% higher precision, and 17% fewer false negatives. In anomaly localization, DMFDL and DMCDL achieve around 15% higher precision and 10% fewer false negatives than FChain, another black-box component-level fault localization method. FChain relies on the chronological changing order
of components without considering the dependency model. We also evaluate DMADL, DMFDL, and DMCDL with extensive chronic faults. We show that our methods detect anomaly within 5 minutes for extensive chronic faults.

History

Date

2020-01-02

Degree Type

  • Dissertation

Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Hyong Kim

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC