QoE Based Management and Control for Large-Scale VoD System in the Cloud
The Cloud infrastructure has become an ideal platform for large-scale applications, such as Video-on-Demand (VoD). As VoD systems migrate to the Cloud, new challenges emerge. The complexity of the Cloud system due to virtualization and resource sharing complicates the Quality of Experience (QoE) management. Operational failures in the Cloud can lead to session crashes. In addition to the Cloud, there are many other systems involved in the large-scale video streaming. These systems include the Content Delivery Networks (CDNs), multiple transit networks, access networks, and user devices. Anomalies in any of these systems can affect users’ Quality of Experience (QoE). Identifying the anomalous system that causes QoE degradation is challenging for VoD providers due to their limited visibility over these systems. We propose to apply end user QoE in the management and control of large-scale VoD systems in the Cloud. We present a QoE-based management and control systems and validate them in production Clouds. QMan, a QoE based Management system for VoD in the Cloud, controls the server selection adaptively based on user QoE. QWatch, a scalable monitoring system, detects and locates anomalies based on the end-user QoE. QRank, a scalable anomaly identification system, identifies the anomalous systems causing QoE anomalies. The proposed systems are developed and evaluated in production Clouds (Microsoft Azure, Google Cloud and Amazon Web Service). QMan provides 30% more users with QoE above the “good” Mean Opinion Score (MOS) than existing server selection systems. QMan discovers operational failures by QoE based server monitoring and prevents streaming session crashes. QWatch effectively detects and locates QoE anomalies in our extensive experiments in production Clouds. We find numerous false positives and false negatives when system metric based anomaly detection methods are used. QRank identifies anomalous systems causing 99.98% of all QoE anomalies among transit networks, access networks and user devices. Our extensive experiments in production Clouds show that transit networks are the most common bottleneck causing QoE anomalies. Cloud provider should identify bottleneck transit networks and determine appropriate peering with Internet Service Providers (ISPs) to bypass these bottlenecks.