Interpretable Learning and Pattern Mining: Scalable Algorithms and Data-Driven Applications
thesisposted on 10.07.2020, 20:43 by Amin Hosseininasab
In recent years, data-driven methodologies have enjoyed great success due in large part to the increasing accessibility of highly accurate machine learning tools. The challenge, however, is that such tools generally offer little interpretability in their learning tasks. This is a significant concern in many applications, such as scenarios with ethical implications or high risk. On the other hand, current interpretable learning algorithms are typically less accurate for prediction tasks, and comparably not scalable to accommodate large size datasets. This motivates the theme of this dissertation, which is to develop interpretable, yet accurate and scalable learning algorithms and apply them to real-life data-driven applications in management science. As our first problem, we investigate the Multiple Sequence Alignment (MSA) problem. Our aim is to learn the optimal alignment between sequences of data. Applications of MSA include bioinformatics, high frequency trading, speech recognition, and computer vision. Although higher quality
alignments offer significantly more insight for practitioners, most MSA algorithms are heuristic, and have been shown to often generate alignments far below the accuracy of an optimal solution on benchmark instances. In fact, only one viable exact algorithm has been developed for MSA, which
is limited to solving small sized instances with five sequences and a total of 600 data entries. Using
tools from dynamic programming, mathematical programming, constraint programming, and decomposition
techniques, we develop a novel exact alignment algorithm capable of aligning up to ten sequences and 1,600 total data entries. Our method is able to close 37 out of 51 real-life benchmark instances to optimality for the first time, and considerably improves the alignment quality on the
remaining instances. In the next chapter, we develop novel techniques for constraint-based sequential pattern mining
(SPM). SPM is an unsupervised learning algorithm that involves finding frequent patterns in data. Frequent patterns are used, e.g., to extract knowledge from data, and to develop novel association rules. Constraint satisfaction is often critical in the practice of SPM, to prevent overwhelming the practitioner with millions of uninteresting patterns. Unfortunately, the literature accommodates only simple constraints, and is further limited to databases with up to a million data entries. Using novel modelling techniques, we increase the scalability of our SPM algorithm by an order of
magnitude, and further design and prove constraint-specific information for a number of complex constraints common in practice. Our algorithm is the first to handle complex constraints such as average and median, but is also competitive or more efficient when compared to a state-of-the-art SPM algorithms with simple constraints.
We next study data-driven sequential decision making with an emphasis on interpretability and sequential structure in SPM. Using a novel data tree model of the database, we are able to increase the capability of SPM algorithms from databases with ten million to three billion data entries.
We leverage data trees to design a pattern mining algorithm capable of extracting novel sequential patterns from large datasets, and design an interpretable knowledge tree equipped with statistical hypothesis tests, to increase reliability in data-driven decision making. Using our approach, we investigate two large-size real-world applications in marketing and finance. In marketing, we consider reducing the skip rate of users in an online music streaming platform. We find that almost all one billion user skips in the database can be explained using an average of 6,400 sequential patterns, with an average likelihood of 83%. In finance, we assess using historical sequential patterns of price change to aid investment decision making in the stock market. We find that, at best, 80% of nine hundred thousand monthly price change events can be explained using approximately 7,000 sequential patterns, with a low average likelihood of 53%. In the final chapter, we study the provider network selection and insurance design problem faced by a major healthcare insurance provider. The provider network consists of physicians and hospitals
that are under contract by the insurer and offer a wide range of healthcare services to insured patients. The problem involves choosing/changing the physicians and hospitals to contract, and designing insurance plans to target patients under competition with a rival firm. We provide a novel
methodology that incorporates the literature’s interpretable utility approach for patient behavior into a simultaneous multi-column-and-row generation optimization framework. By optimizing the provider network as a whole, we compose higher quality insurance plans that increase the profit of
the insurer by an average of 548% on test instances, while decreasing the overall patient healthcare costs by 36% and the overall premiums payed by patient by 21%. We showed how interpretable information can be extracted from the optimization framework to aid decision making, and lastly
analyzed the impact of inaccurate patient predictions on the insurer. Our results show that a more accurate yet interpretable predictor of patient choice can greatly enhance insurance plan design.
DepartmentTepper School of Business
- Doctor of Philosophy (PhD)