Vietoris-Rips filtration tutorial¶
This tutorial is an example of calculating a persistence diagram using Vectoris-Rips filtration.
Vietoris-Rips filtration offers the following advantages.
- It can be calculated with only the distance information between two points(distance matrix)
- It can be calculated in spaces other than Euclidean space like distance information between genes.
- alpha filtration can currently only be calculated in two or three dimensions, but can also be calculated in higher dimensional points.
On the other hand, there are the following disadvantages.
- High computation cost due to combination explosion.
- Especially difficult to calculate higher order homology.
- Therefore, it is mandatory to limit the order to be calculated.
- Also, sides that are larger than a certain distance may need to be ignored for practical calculation time.
- Only approximately reflects the geometric information of the data.
- There is a considerable deviation from how to place the ball at each point.
First load the required libraries
import numpy as np
import homcloud.interface as hc
from scipy.spatial import distance_matrix
import matplotlib.pyplot as plt
%matplotlib inline
Prepare 100 random 200-dimensional data
data = np.random.uniform(size=(100, 200))
Calculate the distance matrix using scipy.spatial.distance_matrix
.
dmatrix = distance_matrix(data, data)
distance matrix is a 100x100 matrix.
dmatrix.shape
(100, 100)
Calculation of persistence diagram¶
Calculate the persistence diagram with homcloud.interface.from_rips_filtration
. The meanings of the arguments is as follows.
- First argument is distance matrix
- Specify maximum order with
maxdim
. - Specify save destination with
save_to
. - Although not specified here, specify the maximum value of birth/death time handled by
maxvalue
.
%%time
hc.PDList.from_rips_filtration(dmatrix, maxdim=3, save_to="rips.pdgm")
CPU times: user 963 ms, sys: 111 ms, total: 1.07 s Wall time: 1.07 s
PDList(path=rips.pdgm)
The calculation time varies considerably depending on what maxdim is used.
- 2 - moment
- 3 - 5-10 sec
- 4 - 2min
- Any more hasn't be tried. Expected to increase exponentially with respect to maxdim.
These calculations are calculated by a program called risper. This risper is (probably) the fastest program for calculating persistent diagrams with Vietoris-Rips filtration.
Visualization of persistence diagrams¶
Load the persistence diagram as follows.
pdlist = hc.PDList("rips.pdgm")
Visualize the first-order persistence diagram
pd1 = pdlist.dth_diagram(1)
pd1.histogram().plot(colorbar={"type": "log"})
Visualize second-order persistence diagram.
pd2 = pdlist.dth_diagram(2)
pd2.histogram().plot(colorbar={"type": "log"})
pd3 = pdlist.dth_diagram(3)
pd3.histogram().plot(colorbar={"type": "log"})
The effect of maxvalue¶
Now, let's see what happens if you specify maxvalue
when calculating PD here. First, let's look at the histogram to see how the distance matrix values are distributed.
Use numpy.triu_indices
to extract only the upper half of the distance matrix.
plt.hist(dmatrix[np.triu_indices(100, 1)], 100); None
It seems to be distributed between 5 and 6.5 .
Here, let's calculate with 5.6 as the upper limit.
%%time
hc.PDList.from_rips_filtration(dmatrix, maxdim=3, maxvalue=5.6, save_to="rips-5.6.pdgm")
CPU times: user 8.43 ms, sys: 645 µs, total: 9.08 ms Wall time: 8.31 ms
PDList(path=rips-5.6.pdgm)
As you can see, the calculation time is much faster than before.
Then let's plot PD.
pdlist56 = hc.PDList("rips-5.6.pdgm")
pd56_1 = pdlist56.dth_diagram(1)
pd56_1.histogram().plot(colorbar={"type": "log"})
pd56_2 = pdlist56.dth_diagram(2)
pd56_2.histogram().plot(colorbar={"type": "log"})
pd56_3 = pdlist56.dth_diagram(3)
pd56_3.histogram().plot(colorbar={"type": "log"})
The maximum value of the PD range is 5.6. Let's compare it with no maxvalue.
The left column PDs have infinite maxvalue, and the right column PDs have 5.6 maxvalue. The first, second, and third rows correspond to the 0th, 1st, and 2nd PDs.
fig, axes = plt.subplots(3, 2, figsize=(14, 14))
pd1.histogram((5.1, 5.7)).plot(colorbar={"type": "log", "max": 10}, ax=axes[0][0])
pd56_1.histogram((5.1, 5.7)).plot(colorbar={"type": "log", "max": 10}, ax=axes[0][1])
pd2.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, ax=axes[1][0])
pd56_2.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, ax=axes[1][1])
pd3.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, ax=axes[2][0])
pd56_3.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, ax=axes[2][1])
If you specify maxvalue, you can see that both birth and death are cut off at 5.6 . Both parts below 5.6 have the same histogram.
So where did these cut birth-death pairs go?
In fact, part of it is a birth-death pair with ∞ death time.
The birth time of such a birth-death pair is stored in the attribute essential_births
.
pd2.essential_births
array([], dtype=float64)
pd56_2.essential_births
array([5.59863234, 5.59797287, 5.59765339, 5.59693623, 5.59558058, 5.59281778, 5.59062338, 5.5886116 , 5.58736086, 5.58697224, 5.58520603, 5.58346081, 5.58338881, 5.58194208, 5.57865238, 5.57865238, 5.57846498, 5.57749414, 5.57318592, 5.57309628, 5.57207203, 5.57063961, 5.56938648, 5.56735373, 5.56732082, 5.56546545, 5.56084299, 5.55829144, 5.55690098, 5.55544519, 5.55516768, 5.55488634, 5.55367947, 5.55089426, 5.54538584, 5.54367685, 5.53900909, 5.53416777, 5.5338974 , 5.51954889, 5.51243114, 5.5016222 , 5.49529028, 5.48711061, 5.4784236 ])
Thus, if you do not specify maxvalue, essential_births is empty, but if you do, you will get some values. birth-death pair with birth less than 5.6 and death more than 5.6 move here. If both birth and death are 5.6 or more, those birth-death pairs are erased without trace.
You can plot diagrams with birth-death pairs whose death times are infinity as follows.
pd56_1.histogram((5.1, 5.7)).plot(colorbar={"type": "log", "max": 10}, plot_ess=True)
pd56_2.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, plot_ess=True)
pd56_3.histogram((5.3, 5.9)).plot(colorbar={"type": "log", "max": 10}, plot_ess=True)
Inverse analysis (optimal 1-cycle)¶
Each point on the persistent diagram corresponds to a homological structure of the original data (rings in 1D, voids in 2D, etc.). If you can extract the structure, it will be useful for data analysis. In the case of alpha complexes, this can be done by using optimal volume. However, extracting the structure of the Vietoris-Rips filtration is difficult because of the large number of simplices. However, for 1D PDs, it is relatively easy to extract such structures by certain approximations. In this section, we will explain how to extract such a structure (called the optimal 1-cycle).
To compute the optimal 1-cycle, you need to add the argument save_graph=True
when calling from_rips_filtration
. The maxdim should be set to 1 since only 1D PD is used.
%%time
hc.PDList.from_rips_filtration(dmatrix, maxdim=1, save_graph=True, save_to="rips_with_graph.pdgm")
CPU times: user 5.29 ms, sys: 1.29 ms, total: 6.57 ms Wall time: 5.91 ms
PDList(path=rips_with_graph.pdgm)
1D PD is loaded and plotted.
pd1 = hc.PDList("rips_with_graph.pdgm").dth_diagram(1)
pd1.histogram().plot(colorbar={"type": "log"})
Pick up the point closest to (5.3, 5.6).
pair = pd1.nearest_pair_to(5.3, 5.6)
pair
Pair(5.271453380584717, 5.52301549911499)
The optimal 1-cycle is computed. The computation cost is not so expensive.
optimal_1_cycle = pair.optimal_1_cycle()
We call the method boundary_point
to get the points on the corresponding ring. Each point is represented by a number, which is assigned in 0-origin ascending order.
optimal_1_cycle.boundary_points()
[92, 18, 24, 6, 15, 8, 32]
Remark¶
When calculating the optimal 1-cycle, you may get the error Pairs with the same birth time
. The function does not work if the non-symmetric elements of the distance matrix have the same value. If such an error occurs, adding a small noise to the distance matrix will do the trick. Add the noise while maintaining the symmetry.
This concludes this tutorial.