Machine learning on pointclouds¶

This tutorial shows how to apply machine learning methods with persistent homology on pointclouds dataset.

This tutorial uses sklearn https://scikit-learn.org/stable/index.html.

In [1]:
import os  # for makedirs
import homcloud.interface as hc  # HomCloud 
import numpy as np  # Numerical array library
from tqdm.notebook import tqdm  # For progressbar
import matplotlib.pyplot as plt  # Plotting
import sklearn.linear_model as lm  # Machine learning
from sklearn.decomposition import PCA  # for PCA
from sklearn.model_selection import train_test_split
import pyvista as pv  # for 3D visualization
In [2]:
%matplotlib inline

Before machine learning¶

We check the input data before applying machine learning methods. The data for this tutorial is prepared in pc directory.

In [3]:
ls pc/
0000.txt  0026.txt  0052.txt  0078.txt  0104.txt  0130.txt  0156.txt  0182.txt
0001.txt  0027.txt  0053.txt  0079.txt  0105.txt  0131.txt  0157.txt  0183.txt
0002.txt  0028.txt  0054.txt  0080.txt  0106.txt  0132.txt  0158.txt  0184.txt
0003.txt  0029.txt  0055.txt  0081.txt  0107.txt  0133.txt  0159.txt  0185.txt
0004.txt  0030.txt  0056.txt  0082.txt  0108.txt  0134.txt  0160.txt  0186.txt
0005.txt  0031.txt  0057.txt  0083.txt  0109.txt  0135.txt  0161.txt  0187.txt
0006.txt  0032.txt  0058.txt  0084.txt  0110.txt  0136.txt  0162.txt  0188.txt
0007.txt  0033.txt  0059.txt  0085.txt  0111.txt  0137.txt  0163.txt  0189.txt
0008.txt  0034.txt  0060.txt  0086.txt  0112.txt  0138.txt  0164.txt  0190.txt
0009.txt  0035.txt  0061.txt  0087.txt  0113.txt  0139.txt  0165.txt  0191.txt
0010.txt  0036.txt  0062.txt  0088.txt  0114.txt  0140.txt  0166.txt  0192.txt
0011.txt  0037.txt  0063.txt  0089.txt  0115.txt  0141.txt  0167.txt  0193.txt
0012.txt  0038.txt  0064.txt  0090.txt  0116.txt  0142.txt  0168.txt  0194.txt
0013.txt  0039.txt  0065.txt  0091.txt  0117.txt  0143.txt  0169.txt  0195.txt
0014.txt  0040.txt  0066.txt  0092.txt  0118.txt  0144.txt  0170.txt  0196.txt
0015.txt  0041.txt  0067.txt  0093.txt  0119.txt  0145.txt  0171.txt  0197.txt
0016.txt  0042.txt  0068.txt  0094.txt  0120.txt  0146.txt  0172.txt  0198.txt
0017.txt  0043.txt  0069.txt  0095.txt  0121.txt  0147.txt  0173.txt  0199.txt
0018.txt  0044.txt  0070.txt  0096.txt  0122.txt  0148.txt  0174.txt  label.txt
0019.txt  0045.txt  0071.txt  0097.txt  0123.txt  0149.txt  0175.txt
0020.txt  0046.txt  0072.txt  0098.txt  0124.txt  0150.txt  0176.txt
0021.txt  0047.txt  0073.txt  0099.txt  0125.txt  0151.txt  0177.txt
0022.txt  0048.txt  0074.txt  0100.txt  0126.txt  0152.txt  0178.txt
0023.txt  0049.txt  0075.txt  0101.txt  0127.txt  0153.txt  0179.txt
0024.txt  0050.txt  0076.txt  0102.txt  0128.txt  0154.txt  0180.txt
0025.txt  0051.txt  0077.txt  0103.txt  0129.txt  0155.txt  0181.txt

There are 200 pointclouds from 0000.txt to 0199.txt. The text file label.txt has labels 0 or 1 for each pointcloud. This means that the pointclouds are classified into two groups by the labels.

We read label data and show it.

In [4]:
labels = np.loadtxt("pc/label.txt")
labels
Out[4]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

This is an array of 0 and 1. We show that indices of 0-labeled data and 1-labeled data. Numpy is useful for that purpose.

In [5]:
np.nonzero(labels == 0), np.nonzero(labels == 1)
Out[5]:
((array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
         17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
         34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
         51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
         68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
         85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),),
 (array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
         113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
         126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
         139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
         152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
         165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
         178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
         191, 192, 193, 194, 195, 196, 197, 198, 199]),))

The pointclouds from 0 to 99 are labeled by 0, and from 100 to 199 are labeled by 1.

Read pointclouds¶

We read pointclouds data from text files.

In [6]:
pointclouds = [np.loadtxt("pc/{:04d}.txt".format(i)) for i in tqdm(range(200))]
  0%|          | 0/200 [00:00<?, ?it/s]

The shape of 0th data is 1000x3. This means that 1000 points are distributed in $\mathbb{R}^3$. Each pointcloud have the same number of points. If you want to use machine learning with persistence diagrams, you should adjust the number of points. When the numbers of points in your dataset vary widely, the learning does not work in many cases.

In [7]:
pointclouds[0].shape
Out[7]:
(1000, 3)

We will visualize 0th data and 100th data. PyVista's subplot functionality is used to plot them side by side.

In [8]:
pl = pv.Plotter(shape=(1, 2))
pl.subplot(0, 0)
pl.add_mesh(pv.PointSet(pointclouds[0]))
pl.subplot(0, 1)
pl.add_mesh(pv.PointSet(pointclouds[100]))
pl.show()
Widget(value="<iframe src='http://localhost:38695/index.html?ui=P_0x7f77c2bda680_0&reconnect=auto' style='widt…

The label data says that the two data have an important difference, but it is difficult to identify the difference.

Computing persistence diagrams¶

Now we compute diagrams by HomCloud. The results are saved into pd directory.

The saved result is reusable any number of times.

In [9]:
os.makedirs("pd", exist_ok=True)  # 保存先のディレクトリを作成
for i in tqdm(range(200)):
    hc.PDList.from_alpha_filtration(pointclouds[i], save_boundary_map=True, save_to="pd/{:04d}.pdgm".format(i))
  0%|          | 0/200 [00:00<?, ?it/s]

This tutorial uses 2nd PDs. It means that we focus on void/bubble structures. In fact, these data are artifically generated and have characteristic void structures. Of course, we can use 0th and 1st PDs instead.

In [10]:
pds = [hc.PDList("pd/{:04d}.pdgm".format(i)).dth_diagram(2) for i in range(200)]

Now we displya the PDs. No. 0 (0-labeled) and No. 100 (1-labeled) are shown.

In this analysis, we believe that some important characteristic geometric structures of the dataset are encoded into these diagrams.

In [11]:
pds[0].histogram(x_range=(0, 0.03)).plot(colorbar={"type": "log"})
pds[100].histogram(x_range=(0, 0.03)).plot(colorbar={"type": "log"})
No description has been provided for this image
No description has been provided for this image

Vecatorize using Persistence Image (PI)¶

We use PI for vectorization. Please see https://arxiv.org/abs/1507.06217 http://www.jmlr.org/papers/volume18/16-337/16-337.pdf for details. We remark that the vectorization method is a bit different from the above paper, and the method is shown in the paper: https://arxiv.org/abs/1706.10082 https://link.springer.com/article/10.1007%2Fs41468-018-0013-5.

Before vectorization, we decide the parameters for PI vectorzation. (0, 0.03), 128 means that a square [0, 0.03]x[0, 0.03] is divided into 128x128 boxes. sigma=0.002 specifies the standard deviation of the Gaussian distribution. weight=("atan", 0.01, 3) specifies the weight function determined by the distance from the diagonal line. The function is $\textrm{atan} (0.01 \cdot \ell^3)$, where $\ell$ is a lifetime ( = birth - death).

Of course these paramters effects the learning performance. You should find "good" parameters by trials-and-erros or more systematic ways such as cross-validation.

In [12]:
spec = hc.PIVectorizeSpec((0, 0.03), 128, sigma=0.002, weight=("atan", 0.01, 3))

Now we vectorize PDs.

In [13]:
pdvects = np.vstack([spec.vectorize(pd) for pd in pds])

We check the minimum and maximum elements of vectors. The absolute value is about $10^{-10}$ and very small. We should normalize the vectors for the good performance of learning.

In [14]:
pdvects.min(), pdvects.max()
Out[14]:
(0.0, 2.099302720413663e-10)
In [15]:
pdvects = pdvects / pdvects.max()

Principal Component Analysis (PCA)¶

Now we use PCA for analysis. PCA can project high dimensional vectors into low dimensional vectors without the loss of information. PCA is always worth trying since the result of PCA is quite robust for various inputs. If you use PCA with PDs, you can project each PD into low dimensional point and understand the relationship among PDs.

Now we project PI vectors into two dimensional space because of the convenience of visualzation.

In [16]:
pca = PCA(n_components=2)
pca.fit(pdvects)
Out[16]:
PCA(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=2)

The result is visualized as follows. The red points in the following figure correspond to 0-labeled data, and the blue points correspond to 1-labeled data.

In [17]:
reduced = pca.transform(pdvects)  # Project all vectors into R^2
plt.gca().set_aspect('equal')  # Set the aspect ratio of the figure
plt.scatter(reduced[labels == 0, 0], reduced[labels == 0, 1], c="r")  # Show 0-labled data by "r"ed
plt.scatter(reduced[labels == 1, 0], reduced[labels == 1, 1], c="b")  # Show 0-labled data by "b"lue
Out[17]:
<matplotlib.collections.PathCollection at 0x7f787035a410>
No description has been provided for this image

On 2D plane, red points and blue points are well separated. Probably the PI vectors have some characteristic difference between two groups.

Here, we want to understand the meaning of X-axis and Y-axis. In the standard PCA, the projection is the orthogonal projection and the 2D plane in high dimensional space is computed by PCA. This means that there are two vectors $v_1$ and $v_2$ such that each high-dimensional vector $w_i$ computed by PI (the index $i$ is the pointcloud index) is approximated as follows $$ w_i \approx \mu_{i1} v_1 + \mu_{i2} v_2 + v_0, $$ where $v_0$ is the average of $w_1,\ldots,$. If you want to understand correctly, please learn the theory of PCA. Anyway, two vectors $v_1, v_2$ are important to understand the above figure.

In sklearn library, $v_0$ is saved in PCA.mean_, and $v_1, v_2$ are saved in PCA.components_.

In [18]:
pca.mean_
Out[18]:
array([3.32287297e-05, 3.52588389e-05, 3.67793759e-05, ...,
       9.95550058e-08, 7.27036741e-08, 7.13024938e-08])
In [19]:
pca.components_
Out[19]:
array([[ 2.59806997e-07,  2.70363482e-07,  2.77565448e-07, ...,
        -3.62878022e-09, -4.09395688e-09, -3.50500694e-09],
       [-6.08939376e-07, -6.42464501e-07, -6.65625771e-07, ...,
         4.65980431e-09, -5.76173853e-10,  1.41943997e-09]])

One advantage of PI is the ability to convert a vector into a "persistence diagram". The diagram corresponding to $v_0$ is in some sense the average of 200 PDs. We can visualize the average, $v_1$, and $v_2$ in the form of persistence diagrams. The following shows such diagrams. hc.PIVectorizerMesh object used by vectorization can convert the vectors into diagrams.

In [20]:
spec.histogram_from_vector(pca.mean_).plot()
No description has been provided for this image
In [21]:
spec.histogram_from_vector(pca.components_[0, :]).plot(colorbar={"type":"linear-midpoint", "midpoint": 0})
spec.histogram_from_vector(pca.components_[1, :]).plot(colorbar={"type":"linear-midpoint", "midpoint": 0})
No description has been provided for this image
No description has been provided for this image

Becaues PCA vectors have positive and negative elements, it is easy to understand if the sign is shown in different colors. colorbar={"type":"linear-midpoint", "midpoint": 0} adjusts the colorbar.

Logistic regression¶

Next we try logistic regression. Logistic regression is also a good partner of persistent homology. 2-class classification is easy to use because we can easily interpret the learned result.

Before learning, we split the dataset into test set and training set. We use sklearn for that purpose.

In [22]:
pdvects_train, pdvects_test, labels_train, labels_test = train_test_split(pdvects, labels, test_size=0.25)

Logistic regression is applied to the training set. LogisitcRegression is available. The regularization parameter C is chosen properly in this tutorial. You can automatically choose the regularization parameter by using cross-validation with LogisitcRegressionCV class in sklearn.

In [23]:
model = lm.LogisticRegression(C=0.01, solver="lbfgs")
In [24]:
model.fit(pdvects_train, labels_train)
Out[24]:
LogisticRegression(C=0.01)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.01)

We check the accurary of prediction by using test set.

In [25]:
model.score(pdvects_test, labels_test)
Out[25]:
0.8

The accuracy is quite good. Then we visualize the learned result.

In [26]:
spec.histogram_from_vector(model.coef_.ravel()).plot(colorbar={"type": "linear-midpoint", "midpoint": 0.0})
No description has been provided for this image

This figure means that a PD with many birth-death pairs in the red area has high probability of 0-labeled data, and a PD with many birth-death pairs in the blue are has high probablity of 1-labeled data.

Inverse analysis (Optimal volume)¶

OK, the red and blue areas are important for our analysis. We clip the areas with threshold 0.005. NumPy's inequality operators are useful.

In [27]:
coef = model.coef_.ravel()
In [28]:
red_area = spec.mask_from_vector(coef > 0.005)
In [29]:
blue_area = spec.mask_from_vector(coef < -0.005)

The clipped areas are visualized.

In [30]:
blue_area.plot(title="blue", ax=plt.subplot(1, 2, 1))
red_area.plot(title="red", ax=plt.subplot(1, 2, 2))
No description has been provided for this image

We pick up birth-daeth pairs in these areas from No. 0 (0-labeled) PD and No. 100 (1-labeled) PD.

In [31]:
pairs_in_blue_0 = blue_area.filter_pairs(pds[0].pairs())
pairs_in_red_0 = red_area.filter_pairs(pds[0].pairs())
In [32]:
pairs_in_blue_100 = blue_area.filter_pairs(pds[100].pairs())
pairs_in_red_100 = red_area.filter_pairs(pds[100].pairs())

The following shows the number of pairs.

In [33]:
len(pairs_in_blue_0), len(pairs_in_red_0), len(pairs_in_blue_100), len(pairs_in_red_100)
Out[33]:
(3, 2, 1, 5)

Indeed No. 0 PD has more pairs in the blue area, and No. 100 PD has more pairs in the red area. The following codes shows the birth and death times of the pairs.

In [34]:
pairs_in_blue_0
Out[34]:
[Pair(0.008484743298521374, 0.015315059365253271),
 Pair(0.0086407914437143, 0.01634538350339904),
 Pair(0.006687816186736094, 0.01715384087925338)]

Now we compute optimal volumes for these pairs and visualize them (using stable volumes instead is also a good idea). This method allows visualization of which voids play an important role in classification.

In [35]:
optimal_volumes_blue_0 = [pair.optimal_volume(cutoff_radius=0.4) for pair in pairs_in_blue_0]
optimal_volumes_red_100 = [pair.optimal_volume(cutoff_radius=0.4) for pair in pairs_in_red_100]
In [36]:
pl = pv.Plotter(shape=(1, 2))
pl.subplot(0, 0)
pl.add_mesh(pv.PointSet(pointclouds[0]))
for ov in optimal_volumes_blue_0:
    pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="blue")
pl.subplot(0, 1)
pl.add_mesh(pv.PointSet(pointclouds[100]))
for ov in optimal_volumes_red_100:
    pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="red")
pl.show()
Widget(value="<iframe src='http://localhost:38695/index.html?ui=P_0x7f785c44cf70_1&reconnect=auto' style='widt…

It is a bit difficult to interpret, so these optimal volumes are displayed in the same space.

In [37]:
pl = pv.Plotter()
for ov in optimal_volumes_blue_0:
    pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="blue", opacity=0.5)
for ov in optimal_volumes_red_100:
    pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="red", opacity=0.5)
pl.show()
Widget(value="<iframe src='http://localhost:38695/index.html?ui=P_0x7f7860af19f0_2&reconnect=auto' style='widt…

The blue voids are typical for 0-labeled data, and the red voids are typical for 1-labeled data.

You find that the blue voids are a bit smaller than the red ones. Indeed in the learned PD the blue area is located below the red one. In the 2nd PDs the death time corresponds the square of the radius of maximal sphere settled in a void.

The death time of the center of the red area 0.022 and the blue area is 0.017. This means the square roots of 0.017 and 0.022 is characteristic space scale of 0-labeled data and 1-labeled data.

In [38]:
np.sqrt(0.017), np.sqrt(0.022)
Out[38]:
(0.130384048104053, 0.14832396974191325)

From the PDs we can see that there are many smaller voids in the pointclouds but the voids are common among all pointclouds.

We finish the tutorial here.