Diagnosing User-Visible Performance Problems in Production High-Density Wi-Fi Networks
Large-scale, high-density Wi-Fi networks use hundreds of access points to serve thousands of closelypacked users within a large physical space (hundreds of thousands of square feet or more, such as in a stadium or arena). Because of their scale, these are complex and dynamic systems comprised of several layers and multiple components within each layer, and faults may be present in any one of these components. The problems that manifest from these faults are usually not network-wide and may be localized to a certain physical areas of the network. This makes these problems challenging to detect and diagnose; in most cases, only a small number of devices tend to be impacted by any given problem. However, many such problems may occur simultaneously in different areas of the network. Adding to the complexity is the dynamic nature of such networks, where the physical positions of radios (in end-user devices), human bodies, and other objects in the space are constantly changing, thereby creating a continually-changing RF environment. Taken together, these properties make problem diagnosis in large-scale, high-density Wi-Fi networks challenging. There are many existing techniques for diagnosing problems in Wi-Fi networks. Many of these approaches rely on data from only a single perspective of the network to diagnose problems, for example, either the client, the infrastructure (access points), or external Wi-Fi sensors that passively monitor the network. In addition, many of these approaches require the invasive modification of the network’s components in order to collect data, through techniques such as the installation of specialized software on clients, modifying the firmware on access points, or even physically installing specialized devices in the RF environment of the Wi-Fi network. Finally, many approaches rely on offline analysis of the collected instrumentation, in which case diagnosis cannot be done in real time (minutes or less). Many others require network connectivity for real-time diagnosis, in which case the device must be able to communicate using the Wi-Fi infrastructure (that may be experiencing a problem). As a result, many of these approaches are difficult to deploy in production networks (due to the high financial cost or maintenance effort required), and those that are deployed often fail to detect and diagnose problems that are localized to a small number of devices (10 or less) or problems that are only present for a short time (minutes or less). This dissertation takes a unique approach that contrasts with existing approaches in three key ways. First, we combine the Wi-Fi performance data from multiple layers of the Wi-Fi network and attempt to diagnose problems at all of these layers, rather than focusing on a single layer alone, and we introduce a fault model that includes faults that can occur across all layers of the system. Second, we require no invasive modification of the Wi-Fi network or its components in order collect data and perform problem diagnosis and mitigation. Third, we present an infrastructure-free approach to problem diagnosis that relies on Bluetooth communication with other devices nearby (peers) to perform diagnosis based on multiple perspectives of the Wi-Fi network. With this approach, our diagnosis algorithm is able to collect data from multiple network perspectives without relying on Wi-Fi infrastructure, which may be slow or unavailable. Our approach begins with the construction of an instrumentation and data-collection system to obtain Wi-Fi performance metrics from both the client and infrastructure perspectives of the network. We then build upon our instrumentation to determine when user-visible problems occur. We define a user-visible problem as a Wi-Fi-network-performance problem that causes users to disengage from using the network. Once we have detected a user-visible problem, we then proceed to diagnose the root cause of the problem as one of the faults in our fault model using an approach based on decision trees. Finally, based on the diagnosed fault, we apply an automated mitigation-strategy, which forces the device to associate with a different access point that will likely provide better performance. To validate our approach and demonstrate its real-world impact, we have conducted a number of studies to collect data in support of our approach from both a laboratory testbed and real-world production Wi-Fi networks. We used our instrumentation and data-collection system to obtain data from over 25 real-world, large-scale, high-density Wi-Fi networks located within collegiate and professional stadiums. Our diagnostic system was deployed in a real-world mobile video-streaming application used over the Wi-Fi networks in these stadiums. Using this data, we determined the thresholds for when a Wi-Fi performance problem becomes user visible, based on our study of when users disengage from using the video-streaming application in the face of buffering. In addition to obtaining real-world data, we have studied this phenomenon in a testbed for fault injection and diagnosis that has been deployed both in a lab environment and in an arena to collect data on the behavior of large-scale, high-density Wi-Fi networks and understand how best to diagnose problems. Using this testbed, we evaluated the performance of our problem-diagnosis approach in terms of its precision and recall on injected faults. We also evaluated the performance of our mitigation strategy on our testbed by injecting faults and verifying that the selected mitigation strategy successfully mitigated the problem caused by that fault. We found that our approach diagnoses the correct root cause of faults with high precision and recall (often above 90%) and can mitigate problems via alternative access-point selection in 100% of our test cases. While we have studied our approach in certain test environments and for video-streaming applications, we believe that our approach can be applied to any Wi-Fi network and many other applications outside of video streaming. Our work in this dissertation could be extended through the automated discovery of the parameters for our diagnosis and mitigation algorithms that provide the best performance in other Wi-Fi networks, along with further studies of how Wi-Fi performance problems manifest in other types of applications and under what conditions users disengage with those applications due to problems.