Topological data analysis


In applied mathematics, topological data analysis is an approach to the analysis of datasets using techniques from topology. Extraction of information from datasets that are high-dimensional, incomplete and noisy is generally challenging. TDA provides a general framework to analyze such data in a manner that is insensitive to the particular metric chosen and provides dimensionality reduction and robustness to noise. Beyond this, it inherits functoriality, a fundamental concept of modern mathematics, from its topological nature, which allows it to adapt to new mathematical tools.
The initial motivation is to study the shape of data. TDA has combined algebraic topology and other tools from pure mathematics to allow mathematically rigorous study of "shape". The main tool is persistent homology, an adaptation of homology to point cloud data. Persistent homology has been applied to many types of data across many fields. Moreover, its mathematical foundation is also of theoretical importance. The unique features of TDA make it a promising bridge between topology and geometry.

Basic theory

Intuition

TDA is premised on the idea that the shape of data sets contains relevant information. Real high-dimensional data is typically sparse, and tends to have relevant low dimensional features. One task of TDA is to provide a precise characterization of this fact. For example, the trajectory of a simple predator-prey system governed by the Lotka–Volterra equations forms a closed circle in state space. TDA provides tools to detect and quantify such recurrent motion.
Many algorithms for data analysis, including those used in TDA, require setting various parameters. Without prior domain knowledge, the correct collection of parameters for a data set is difficult to choose. The main insight of persistent homology is to use the information obtained from all parameter values by encoding this huge amount of information into an understandable and easy-to-represent form. With TDA, there is a mathematical interpretation when the information is a homology group. In general, the assumption is that features that persist for a wide range of parameters are "true" features. Features persisting for only a narrow range of parameters are presumed to be noise, although the theoretical justification for this is unclear.

Early history

Precursors to the full concept of persistent homology appeared gradually over time. In 1990, Patrizio Frosini introduced a pseudo-distance between submanifolds, and later the size function, which on 1-dim curves is equivalent to the 0th persistent homology. Nearly a decade later, Vanessa Robins studied the images of homomorphisms induced by inclusion. Finally, shortly thereafter, Herbert Edelsbrunner et al. introduced the concept of persistent homology together with an efficient algorithm and its visualization as a persistence diagram. Gunnar Carlsson et al. reformulated the initial definition and gave an equivalent visualization method called persistence barcodes, interpreting persistence in the language of commutative algebra.
In algebraic topology the persistent homology has emerged through the work of Sergey Barannikov on Morse theory. The set of critical values of smooth Morse function was canonically partitioned into pairs "birth-death", filtered complexes were classified, their invariants, equivalent to persistence diagram and persistence barcodes, together with the efficient algorithm for their calculation, were described under the name of canonical forms in 1994 by Barannikov.

Concepts

Some widely used concepts are introduced below. Note that some definitions may vary from author to author.
A point cloud is often defined as a finite set of points in some Euclidean space, but may be taken to be any finite metric space.
The Čech complex of a point cloud is the nerve of the cover of balls of a fixed radius around each point in the cloud.
A persistence module indexed by is a vector space for each, and a linear map whenever, such that for all and whenever An equivalent definition is a functor from considered as a partially ordered set to the category of vector spaces.
The persistent homology group of a point cloud is the persistence module defined as, where is the Čech complex of radius of the point cloud and is the homology group.
A persistence barcode is a multiset of intervals in, and a persistence diagram is a multiset of points in.
The Wasserstein distance between two persistence diagrams and is defined as where and ranges over bijections between and. Please refer to figure 3.1 in Munch for illustration.
The bottleneck distance between and is This is a special case of Wasserstein distance, letting.

Basic property

Structure theorem

The first classification theorem for persistent homology appeared in 1994 via Barannikov's canonical forms. The classification theorem interpreting persistence in the language of commutative algebra appeared in 2005: for a finitely generated persistence module with field coefficients,
Intuitively, the free parts correspond to the homology generators that appear at filtration level and never disappear, while the torsion parts correspond to those that appear at filtration level and last for steps of the filtration.
Persistent homology is visualized through a barcode or persistence diagram. The barcode has its root in abstract mathematics. Namely, the category of finite filtered complexes over a field is semi-simple. Any filtered complex is isomorphic to its canonical form, a direct sum of one- and two-dimensional simple filtered complexes.

Stability

Stability is desirable because it provides robustness against noise. If is any space which is homeomorphic to a simplicial complex, and are continuous tame functions, then the persistence vector spaces and are finitely presented, and, where refers to the bottleneck distance and is the map taking a continuous tame function to the persistence diagram of its -th homology.

Workflow

The basic workflow in TDA is:
  1. If is a point cloud, replace with a nested family of simplicial complexes . This process converts the point cloud into a filtration of simplicial complexes. Taking the homology of each complex in this filtration gives a persistence module
  2. Apply the structure theorem to obtain the persistent Betti numbers, persistence diagram, or equivalently, barcode.
Graphically speaking,

Computation

The first algorithm over all fields for persistent homology in algebraic topology setting was described by Barannikov through reduction to the canonical form by upper-triangular matrices. The algorithm for persistent homology over was given by Edelsbrunner et al. Afra Zomorodian and Carlsson gave the practical algorithm to compute persistent homology over all fields. Edelsbrunner and Harer's book gives general guidance on computational topology.
One issue that arises in computation is the choice of complex. The Čech complex and the Vietoris–Rips complex are most natural at first glance; however, their size grows rapidly with the number of data points. The Vietoris–Rips complex is preferred over the Čech complex because its definition is simpler and the Čech complex requires extra effort to define in a general finite metric space. Efficient ways to lower the computational cost of homology have been studied. For example, the α-complex and witness complex are used to reduce the dimension and size of complexes.
Recently, Discrete Morse theory has shown promise for computational homology because it can reduce a given simplicial complex to a much smaller cellular complex which is homotopic to the original one. This reduction can in fact be performed as the complex is constructed by using matroid theory, leading to further performance increases. Another recent algorithm saves time by ignoring the homology classes with low persistence.
Various software packages are available, such as , , , , , , , and . A comparison between these tools is done by Otter et al. is a Python package dedicated to integrating TDA in the machine learning workflow by means of a scikit-learn API. An R package is capable of calculating recently invented concepts like landscape and the kernel distance estimator. The is specialized for continuous data defined on manifolds of low dimension, as typically found in scientific visualization. is optimized for large grayscale image data in dimension 1, 2 or 3 using cubical complexes and discrete Morse theory. Another R package, , uses the Ripser library to calculate persistent homology.

Visualization

High-dimensional data is impossible to visualize directly. Many methods have been invented to extract a low-dimensional structure from the data set, such as principal component analysis and multidimensional scaling. However, it is important to note that the problem itself is ill-posed, since many different topological features can be found in the same data set. Thus, the study of visualization of high-dimensional spaces is of central importance to TDA, although it does not necessarily involve the use of persistent homology. However, recent attempts have been made to use persistent homology in data visualization.
Carlsson et al. have proposed a general method called MAPPER. It inherits the idea of Jean-Pierre Serre that a covering preserves homotopy. A generalized formulation of MAPPER is as follows:
Let and be topological spaces and let be a continuous map. Let be a finite open covering of. The output of MAPPER is the nerve of the pullback cover, where each preimage is split into its connected components. This is a very general concept, of which the Reeb graph and merge trees are special cases.
This is not quite the original definition. Carlsson et al. choose to be or, and cover it with open sets such that at most two intersect. This restriction means that the output is in the form of a complex network. Because the topology of a finite point cloud is trivial, clustering methods are used to produce the analogue of connected sets in the preimage when MAPPER is applied to actual data.
Mathematically speaking, MAPPER is a variation of the Reeb graph. If the is at most one dimensional, then for each, The added flexibility also has disadvantages. One problem is instability, in that some change of the choice of the cover can lead to major change of the output of the algorithm. Work has been done to overcome this problem.
Three successful applications of MAPPER can be found in Carlsson et al. A comment on the applications in this paper by J. Curry is that "a common feature of interest in applications is the presence of flares or tendrils".
A free implementation of MAPPER written by Daniel Müllner and Aravindakshan Babu is available . MAPPER also forms the basis of Ayasdi's AI platform.