By Colleen M. Farrelly, Kaplan University
Real data can be messy, and, in many fields, it can include very few observations, making statistical analyses a challenge and pushing machine learning algorithms to their limit. Sometimes, data breaks the assumptions of most extant methods in statistics, complicating analysis and muddying the validity of results. However, recent developments in a field called topological data analysis (TDA) has provided a set of tools to wrangle messy and/or small data in a robust manner.
TDA--and the approach of applying topological concepts to statistical problems--is subfield of analytics developed from ideas in algebraic and differential topology. Though the theory is rooted in graduate level mathematics, a basic understanding and a package implementation of the algorithms allow it to be more accessible to data miners, particularly those with a strong mathematical background. This tutorial reviews two key concepts (along with associated R packages) and details an example problem of each with implementations in R. Many other topics and algorithms exist in TDA, and interested readers may find additional topics on my SlideShare site.
Topology is a field of mathematics studying properties of spaces that are constant under continuous changes (stretching, twisting…). Subfields of topology include differential topology (where spaces are studies through defined functions and their behavior on that space), algebraic topology (where spaces are classified either by the number of holes present or the ability to shrink paths to points), point set topology (determining properties of points and sequences in those spaces), and connections to differential geometry (determining if a space admits certain metrics and how metrics can be transformed into each other).Topology is also related to graph theory and knot theory. In recent years, discrete extensions in each branch have been investigated and developed that fit nicely within a data analysis framework.
A main concept underlying TDA is persistent homology. Homology counts the number of n-dimensional holes in a space (recorded as a sequence of numbers, called Betti numbers). For instance, a breakfast donut has two open circles (1st Betti number) and one connected component (0th Betti number), while a baseball has one open circle and one connected piece; thus, these are different topological spaces. Functions will behave differently on them. Persistent homology tracks these "holes" in data through slicing it through a series of lenses, analogous to how an MRI captures different layers of an organ, recording its findings as it examines the slice. Holes are tracked through slices to measure the lifetime of each hole, which relates to its importance as a feature. Two spaces can be topologically compared statistically by comparing diagrams of these holes (much like comparing histograms). The TDA package in R provides many functions for computing persistent homology (ripsDiag), plotting persistent homology (plot), and comparing diagrams (bottleneck, wasserstein). The comparison metrics obtained can be permuted or compared with a null distribution to test whether the distance between diagrams is a statistically-significant one.
One interesting property of persistence diagrams is the relationship between their 0th Betti number diagram and hierarchical clustering; single-linkage hierarchical clustering tracks these connected components over distance in a way that is equivalent to tracking connected components through persistent homology. Thus, depending on the goal of a project, one can use persistence diagrams or hierarchical clustering dendrograms to represent project results. Dendrograms offer a familiar way to present data, while retaining the same information used. An example is shown below (Figure 1) for a very simple psychometric measure of two traits (cut-off for persistence diagram slightly lower as a result of filter value choice).
This tool provided a way to analyze and validate a new psychometric survey, measuring identity through a scale that captures expression across contexts and identity components. This scale involved 91 items, and pilot data included 406 participants. Generally, psychometric tools require at least 5-10 participants for each item on the scale to assess validity and obtain stable results (ex. factor analysis); existing tools also assume a hierarchical structure within a psychometric survey, which does not exist for context-wise identity measure.
Single-linkage hierarchical clustering (pvclust R package) was used on repeated samples of 130 participants to obtain connected components at varying Euclidean distances between items on the scale; as was previously mentioned, this corresponds to tracking features using the 0th Betti numbers (Figure 2). Comparison across samples using a distance metric that can handle tree structures (Hausdorff distance) yielded differences similar to the simulated null distribution, suggesting that connected components did not differ across samples. Thus, subscales of the measure were consistent across samples, providing both subscale groupings for the identity survey and validation of the tool. This would not have been possible with existing methods, but TDA was able to handle both the relatively small sample size and the lack of survey structure.
Another key tool in topology is Morse functions. With properties highly compatible with use in a discrete setting, Morse functions are continuous functions with constraints on function behavior at peaks and valleys that make the function stable even under noise in a dataset. The hills and valleys defined by Morse functions on a topological space (such as a dataset) are used to partition the space, which essentially clusters data points originating from the same peaks and valleys. This is a bit like a soccer player dribbling a ball across a field, mentally noting the paths of his dribbles, such that during the game, he can understand into which region of the field a kicked ball will roll.
Clustering itself can be the goal-- such as with Reeb graphs, Morse-Smale clustering, or Mapper clustering--or it can be used to create partitioned regression models. Morse-Smale regression is a fairly new algorithm, which has shown good accuracy relative to many other machine learning algorithms on a variety of regression tasks. It first partitions a dataset and then fits an elastic net model to each partition. One of my recent papers details extensions of Morse-Smale regression to multivariate models fit to partitions. Thus, the algorithm can partition space, fit a model such as a tree or random forest to the partitions, and provide good visual insight into how the partitions differ with respect to predictors. This is particularly suited to insurance risk modeling and medical risk modeling (survival analysis in clinical trials, for instance). The R package msr contains the original algorithm, and additional Morse-Smale regression models can be built using this package with other regression functions, such as ranger (random forest) or evtree (evolved tree models).
Let’s consider a toy dataset, the Swedish 3rd party motor insurance dataset (found here: https://www.statsci.org/data/glm.html). The task is to predict 1977 payouts from insurance claims given several predictors. Not only would it be nice to have a predictive model with low error but also a model which provides insights into different types of claims. Morse-Smale regression allows for insight into subgroups of insurance claims and what predictors are important within those specific subgroups.
The original algorithm gives some insight into subgroup structure through its projection plots. In these, each group is represented by a line (Figure 3, left plot), and one of the three axes of the graph corresponds to the outcome function used to slice the data. The other two are created by projection methods (dimensionality reduction). Detailed plots show how the groups differ at each slice across each predictor in the model.
The models fit within each partition provide insight into which predictors are contributing to claim payout for that subgroup. Thus, from plotting the importance of random forest predictors within each subgroup, models can be compared across subgroups (Figure 3, right plot). This provides important insight into main drivers of claim payouts for different types of claims; from this, insurance policies might be modified to include tiers of risk related to those subgroups and the main predictors of claim payout within that subgroup.
Though it is a nascent branch of data analytics, TDA—and topology in general—have a lot to offer data science. Given its ability to study general properties of data and to extend extant methods (ameliorating their limitations), these methods are likely to flourish in the coming years. A few helpful resources are provided below for reader who would like to dive deeper into this field:
- Edelsbrunner, H., & Harer, J. (2008). Persistent homology-a survey. Contemporary mathematics, 453, 257-282.
- Farrelly, C. M., Schwartz, S. J., Amodeo, A. L., Feaster, D. J., Steinley, D. L., Meca, A., & Picariello, S. (2017). The analysis of bridging constructs with hierarchical clustering methods: An application to identity. Journal of Research in Personality, 70, 93-106.
- Farrelly, C. M. (2017). Extensions of Morse-Smale Regression with Application to Actuarial Science. arXiv preprint arXiv:1708.05712.
- Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013). Morse–smale regression. Journal of Computational and Graphical Statistics, 22(1), 193-214.
- Nielson, J. L., Paquette, J., Liu, A. W., Guandique, C. F., Tovar, C. A., Inoue, T., ... & Lum, P. Y. (2015). Topological data analysis for discovery in preclinical spinal cord injury and traumatic brain injury. Nature communications, 6.
- Singh, G., Mémoli, F., & Carlsson, G. E. (2007, September). Topological methods for the analysis of high dimensional data sets and 3d object recognition. In SPBG (pp. 91-100).
- Wasserman, L. (2016). Topological Data Analysis. arXiv preprint arXiv:1609.08227.
Bio: Colleen M. Farrelly is a data scientist at Kaplan University whose areas of expertise includes topology/topological data analysis, ensemble learning, nonparametric statistics, manifold learning, and explaining mathematics to lay audiences (https://www.quora.com/profile/Colleen-Farrelly-1). When she isn’t doing data science, she is a poet and author (https://www.amazon.com/Colleen-Farrelly/e/B07832WQG7).
- Quantum Machine Learning: An Overview
- A Primer on Web Scraping in R
- Topological Data Analysis – Open Source Implementations