Tutorial on Vietoris-Rips Filtration¶

This tutorial introduces how to compute persistence diagrams using the Vietoris-Rips filtration.

Advantages of Vietoris-Rips Filtration¶

Computable from a distance matrix: It can be computed as long as you have the distance information between all pairs of points.
- This makes it applicable not only in Euclidean spaces but also in non-Euclidean spaces, such as for measuring distances between genes.
Handles high-dimensional data: While the alpha filtration is currently limited to 2 or 3 dimensions, the Vietoris-Rips filtration can be applied to higher-dimensional data.

Disadvantages and Remarks¶

High computational cost: It suffers from a combinatorial explosion.
- Computing higher-order homology is particularly difficult.
- For practical applications, it's necessary to limit the maximum dimension of homology to be computed.
- It is also a common practice to ignore edges above a certain distance threshold to reduce computation time.
Approximates geometric information: The method only provides an approximation of the geometric features.
- Compared to an alpha filtratoin, it may not accurately reflect the underlying geometric structure.
- For this reason, you should take care when interpreting the results.

Preparation¶

Loads the necessary libraries.

In [2]:

import numpy as np
import homcloud.interface as hc
from scipy.spatial import distance_matrix
import matplotlib.pyplot as plt

100 points in 200 dimensional space is randomly generated.

In [3]:

data = np.random.uniform(size=(100, 200))

We will compute a distance matrix using scipy.spatial.distance_matrix.

In [4]:

dmatrix = distance_matrix(data, data)

The shape of the matrix is 100x100. Each element is non-negative value.

In [5]:

dmatrix.shape

Out[5]:

(100, 100)

Computing Persistence Diagrams¶

You can compute persistence diagrams from a distance matrix via a Vietoris-Rips filtration using homcloud.interface.PDList.from_rips_filtration. The typical arguments are as follows:

First argument: A distance matrix (a symmetric matrix storing distance information between pairs of points)
maxdim: The maximum dimension of homology to compute (e.g., 1 computes 0-dimensional and 1-dimensional homology)
save_to: The filename to save the computed persistence diagrams
maxvalue (optional): Specifies the maximum distance value (edges exceeding this value will be ignored)

In [6]:

%%time
hc.PDList.from_rips_filtration(dmatrix, maxdim=3, save_to="rips.pdgm")

CPU times: user 1.17 s, sys: 62.1 ms, total: 1.23 s
Wall time: 1.22 s

Out[6]:

PDList(path=rips.pdgm)

Relationship Between Computation Time and `maxdim`¶

The computation time for persistence diagrams varies significantly depending on the value of maxdim:

maxdim = 2: Completes almost instantly
maxdim = 3: Takes about 5 to 10 seconds
maxdim = 4: Around 2 minutes
maxdim > 4: Not yet tested, but the computational complexity is expected to grow exponentially with respect to maxdim

As shown above, the higher the dimension, the more rapidly the computational cost increases. Therefore, it is important to appropriately limit maxdim for practical use.

These computations are performed using Ripser, a high-speed library for persistent homology for Vietoris-Rips filtrations. Ripser is currently considered one of the fastest implementations for computing persistence diagrams based on Vietoris-Rips filtrations.

パーシステンス図の可視化¶

We will load the persistence diagrams.

In [7]:

pdlist = hc.PDList("rips.pdgm")

Visualize the 1th persistence diagram.

In [8]:

pd1 = pdlist.dth_diagram(1)
pd1.histogram().plot(colorbar={"type": "log"})

No description has been provided for this image

The 2nd, and third diagrams are also visualized.

In [9]:

pd2 = pdlist.dth_diagram(2)
pd2.histogram().plot(colorbar={"type": "log"})

In [10]:

pd3 = pdlist.dth_diagram(3)
pd3.histogram().plot(colorbar={"type": "log"})

Effect of `maxvalue`¶

We examine the effect of maxvalue when calculating persistence diagrams.

We visualize the histogram of distances in the distanse matrix as follows.

In [11]:

plt.hist(dmatrix[np.triu_indices(100, 1)], 100); None

The distances are distributed between 5 and 6.5.

We calculate persistence diagrams with patameter maxvalue=5.6.

In [12]:

%%time
hc.PDList.from_rips_filtration(dmatrix, maxdim=3, maxvalue=5.6, save_to="rips-5.6.pdgm")

CPU times: user 15.6 ms, sys: 988 μs, total: 16.6 ms
Wall time: 16.5 ms

Out[12]:

PDList(path=rips-5.6.pdgm)

The computation time becomes much shorter.

We will plot the persistence diagrams when maxvalue is given.

In [13]:

pdlist56 = hc.PDList("rips-5.6.pdgm")

In [14]:

pd56_1 = pdlist56.dth_diagram(1)
pd56_1.histogram((5.1, 5.7)).plot(colorbar={"type": "log"})
pd56_2 = pdlist56.dth_diagram(2)
pd56_2.histogram((5.3, 5.9)).plot(colorbar={"type": "log"})
pd56_3 = pdlist56.dth_diagram(3)
pd56_3.histogram((5.3, 5.9)).plot(colorbar={"type": "log"})

The upper bound of birth and death times is 5.6.

We will plot the diagrams with and without maxvalue. The left column has PDs without maxvalue, and the right has PDs with maxvalue. The first, second, and third rows have 0th, 1st, and 2nd persistence diagrams, respectively.

In [15]:

fig, axes = plt.subplots(3, 2, figsize=(14, 14)) 
pd1.histogram((5.1, 5.7)).plot(colorbar={"type": "log", "max": 10}, ax=axes[0][0])
pd56_1.histogram((5.1, 5.7)).plot(colorbar={"type": "log", "max": 10}, ax=axes[0][1])
pd2.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, ax=axes[1][0])
pd56_2.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, ax=axes[1][1])
pd3.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, ax=axes[2][0])
pd56_3.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, ax=axes[2][1])

The distributions of pairs below 5.6 are exactly the same. The pairs above 5.6 are completely vanished when maxvalue is given.

So where have those pairs gone? Indeed, the pairs become essential pair if the birth times are less than 5.6. Other pairs are completely ignored.

A pair is defined essential if the death time is infinity. The birth times of the essential pairs is stored in essential_births attribute.

The following cells shows the attribute.

In [16]:

pd2.essential_births

Out[16]:

array([], dtype=float64)

In [17]:

pd56_2.essential_births

Out[17]:

array([5.59813023, 5.59808731, 5.59780836, 5.59765434, 5.59636068,
       5.59636068, 5.59493494, 5.59493494, 5.59484053, 5.59393644,
       5.59317541, 5.59219551, 5.58871603, 5.58834219, 5.58564186,
       5.58529902, 5.58446312, 5.5807929 , 5.57627106, 5.57603884,
       5.57436275, 5.56771517, 5.56705999, 5.56274271, 5.56092215,
       5.55802393, 5.55626774, 5.55130339, 5.55130339, 5.5430665 ,
       5.53529167, 5.53497267, 5.53283215, 5.50469589, 5.49289036])

essential_births is empty array if maxvalue is not given. On the other hand, maxvalue is given, essential_births has some values.

This means that the pairs with birth < 5.6 and death > 5.6 changes from normal pairs to essential pairs.

The following code plots the essential pairs with normal pairs.

In [18]:

pd56_1.histogram((5.1, 5.7)).plot(colorbar={"type": "log", "max": 10}, plot_ess=True, title="1D")
pd56_2.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, plot_ess=True, title="2D")
pd56_3.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, plot_ess=True, title="3D")

Inverse analysis¶

逆解析について¶

Each point on the persistence diagram corresponds to a homology structure (e.g., a ring in one dimension, a hole in two dimensions) that exists in the original data. If these structures can be extracted, it is possible to gain a deeper understanding of the geometric and topological features of the data, which is extremely useful for analysis.

When using alpha filtrations, it is possible to extract the corresponding structures relatively efficiently using optimal volume and stable volume methods.

However, because the number of simplices in Vietoris-Rips complexes is very large, these inverse analysis methods are computationally expensive and often impractical.

For that case, if the diagram is a 1st persistence diagram, we can effectively extract the ring structures using shortest path algorithm for graphs.

This tutorial uses this approach, called optimal 1-cycle.

It should be noted that the rings obtained by the approach are approximate structures and do not fully reflect the persistent homology information.

We need to give save_graph=True argument when calling from_rips_filtration to compute optimal 1-cycles later. The argument saves the graph to a .pdgm file.

In [19]:

%%time
hc.PDList.from_rips_filtration(dmatrix, maxdim=1, save_graph=True, save_to="rips_with_graph.pdgm")

CPU times: user 11.1 ms, sys: 4 μs, total: 11.1 ms
Wall time: 11.1 ms

Out[19]:

PDList(path=rips_with_graph.pdgm)

The 1st persistence diagram is plotted.

In [20]:

pd1 = hc.PDList("rips_with_graph.pdgm").dth_diagram(1)

In [21]:

pd1.histogram().plot(colorbar={"type": "log"})

The pair nearest to (5.3, 5.6) is extracted.

In [22]:

pair = pd1.nearest_pair_to(5.3, 5.6)
pair

Out[22]:

Pair(5.344056129455566, 5.576271057128906)

The optimal 1-cycle is computed. The computation time is reasonably short.

In [23]:

optimal_1_cycle = pair.optimal_1_cycle()

boundary_points method returns the indices of points on the cycle.

In [24]:

optimal_1_cycle.boundary_points()

Out[24]:

[99, 46, 95, 81, 14, 98]

Remark¶

When calculating the optimal 1-cycle, an error Pairs with the same birth time may occur. This happens when there are two same distances.

You can avoid the error by adding very small noise to the distances.

This tutorial ends here.