Machine learning on pointclouds¶
This tutorial shows how to apply machine learning methods with persistent homology on pointclouds dataset.
This tutorial uses sklearn https://scikit-learn.org/stable/index.html.
import os # for makedirs
import homcloud.interface as hc # HomCloud
import numpy as np # Numerical array library
from tqdm.notebook import tqdm # For progressbar
import matplotlib.pyplot as plt # Plotting
import sklearn.linear_model as lm # Machine learning
from sklearn.decomposition import PCA # for PCA
from sklearn.model_selection import train_test_split
import pyvista as pv # for 3D visualization
%matplotlib inline
Before machine learning¶
We check the input data before applying machine learning methods.
The data for this tutorial is prepared in pc
directory.
ls pc/
0000.txt 0026.txt 0052.txt 0078.txt 0104.txt 0130.txt 0156.txt 0182.txt 0001.txt 0027.txt 0053.txt 0079.txt 0105.txt 0131.txt 0157.txt 0183.txt 0002.txt 0028.txt 0054.txt 0080.txt 0106.txt 0132.txt 0158.txt 0184.txt 0003.txt 0029.txt 0055.txt 0081.txt 0107.txt 0133.txt 0159.txt 0185.txt 0004.txt 0030.txt 0056.txt 0082.txt 0108.txt 0134.txt 0160.txt 0186.txt 0005.txt 0031.txt 0057.txt 0083.txt 0109.txt 0135.txt 0161.txt 0187.txt 0006.txt 0032.txt 0058.txt 0084.txt 0110.txt 0136.txt 0162.txt 0188.txt 0007.txt 0033.txt 0059.txt 0085.txt 0111.txt 0137.txt 0163.txt 0189.txt 0008.txt 0034.txt 0060.txt 0086.txt 0112.txt 0138.txt 0164.txt 0190.txt 0009.txt 0035.txt 0061.txt 0087.txt 0113.txt 0139.txt 0165.txt 0191.txt 0010.txt 0036.txt 0062.txt 0088.txt 0114.txt 0140.txt 0166.txt 0192.txt 0011.txt 0037.txt 0063.txt 0089.txt 0115.txt 0141.txt 0167.txt 0193.txt 0012.txt 0038.txt 0064.txt 0090.txt 0116.txt 0142.txt 0168.txt 0194.txt 0013.txt 0039.txt 0065.txt 0091.txt 0117.txt 0143.txt 0169.txt 0195.txt 0014.txt 0040.txt 0066.txt 0092.txt 0118.txt 0144.txt 0170.txt 0196.txt 0015.txt 0041.txt 0067.txt 0093.txt 0119.txt 0145.txt 0171.txt 0197.txt 0016.txt 0042.txt 0068.txt 0094.txt 0120.txt 0146.txt 0172.txt 0198.txt 0017.txt 0043.txt 0069.txt 0095.txt 0121.txt 0147.txt 0173.txt 0199.txt 0018.txt 0044.txt 0070.txt 0096.txt 0122.txt 0148.txt 0174.txt label.txt 0019.txt 0045.txt 0071.txt 0097.txt 0123.txt 0149.txt 0175.txt 0020.txt 0046.txt 0072.txt 0098.txt 0124.txt 0150.txt 0176.txt 0021.txt 0047.txt 0073.txt 0099.txt 0125.txt 0151.txt 0177.txt 0022.txt 0048.txt 0074.txt 0100.txt 0126.txt 0152.txt 0178.txt 0023.txt 0049.txt 0075.txt 0101.txt 0127.txt 0153.txt 0179.txt 0024.txt 0050.txt 0076.txt 0102.txt 0128.txt 0154.txt 0180.txt 0025.txt 0051.txt 0077.txt 0103.txt 0129.txt 0155.txt 0181.txt
There are 200 pointclouds from 0000.txt
to 0199.txt
. The text file label.txt
has labels 0 or 1 for
each pointcloud. This means that the pointclouds are classified into two groups by the labels.
We read label data and show it.
labels = np.loadtxt("pc/label.txt")
labels
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
This is an array of 0 and 1. We show that indices of 0-labeled data and 1-labeled data. Numpy is useful for that purpose.
np.nonzero(labels == 0), np.nonzero(labels == 1)
((array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),), (array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199]),))
The pointclouds from 0 to 99 are labeled by 0, and from 100 to 199 are labeled by 1.
Read pointclouds¶
We read pointclouds data from text files.
pointclouds = [np.loadtxt("pc/{:04d}.txt".format(i)) for i in tqdm(range(200))]
0%| | 0/200 [00:00<?, ?it/s]
The shape of 0th data is 1000x3. This means that 1000 points are distributed in $\mathbb{R}^3$. Each pointcloud have the same number of points. If you want to use machine learning with persistence diagrams, you should adjust the number of points. When the numbers of points in your dataset vary widely, the learning does not work in many cases.
pointclouds[0].shape
(1000, 3)
We will visualize 0th data and 100th data. PyVista's subplot functionality is used to plot them side by side.
pl = pv.Plotter(shape=(1, 2))
pl.subplot(0, 0)
pl.add_mesh(pv.PointSet(pointclouds[0]))
pl.subplot(0, 1)
pl.add_mesh(pv.PointSet(pointclouds[100]))
pl.show()
Widget(value="<iframe src='http://localhost:38695/index.html?ui=P_0x7f77c2bda680_0&reconnect=auto' style='widt…
The label data says that the two data have an important difference, but it is difficult to identify the difference.
Computing persistence diagrams¶
Now we compute diagrams by HomCloud. The results are saved into pd
directory.
The saved result is reusable any number of times.
os.makedirs("pd", exist_ok=True) # 保存先のディレクトリを作成
for i in tqdm(range(200)):
hc.PDList.from_alpha_filtration(pointclouds[i], save_boundary_map=True, save_to="pd/{:04d}.pdgm".format(i))
0%| | 0/200 [00:00<?, ?it/s]
This tutorial uses 2nd PDs. It means that we focus on void/bubble structures. In fact, these data are artifically generated and have characteristic void structures. Of course, we can use 0th and 1st PDs instead.
pds = [hc.PDList("pd/{:04d}.pdgm".format(i)).dth_diagram(2) for i in range(200)]
Now we displya the PDs. No. 0 (0-labeled) and No. 100 (1-labeled) are shown.
In this analysis, we believe that some important characteristic geometric structures of the dataset are encoded into these diagrams.
pds[0].histogram(x_range=(0, 0.03)).plot(colorbar={"type": "log"})
pds[100].histogram(x_range=(0, 0.03)).plot(colorbar={"type": "log"})
Vecatorize using Persistence Image (PI)¶
We use PI for vectorization. Please see https://arxiv.org/abs/1507.06217 http://www.jmlr.org/papers/volume18/16-337/16-337.pdf for details. We remark that the vectorization method is a bit different from the above paper, and the method is shown in the paper: https://arxiv.org/abs/1706.10082 https://link.springer.com/article/10.1007%2Fs41468-018-0013-5.
Before vectorization, we decide the parameters for PI vectorzation.
(0, 0.03), 128
means that a square [0, 0.03]x[0, 0.03]
is divided into 128x128 boxes.
sigma=0.002
specifies the standard deviation of the Gaussian distribution.
weight=("atan", 0.01, 3)
specifies the weight function determined by the distance from
the diagonal line. The function is $\textrm{atan} (0.01 \cdot \ell^3)$, where $\ell$ is a lifetime ( = birth - death).
Of course these paramters effects the learning performance. You should find "good" parameters by trials-and-erros or more systematic ways such as cross-validation.
spec = hc.PIVectorizeSpec((0, 0.03), 128, sigma=0.002, weight=("atan", 0.01, 3))
Now we vectorize PDs.
pdvects = np.vstack([spec.vectorize(pd) for pd in pds])
We check the minimum and maximum elements of vectors. The absolute value is about $10^{-10}$ and very small. We should normalize the vectors for the good performance of learning.
pdvects.min(), pdvects.max()
(0.0, 2.099302720413663e-10)
pdvects = pdvects / pdvects.max()
Principal Component Analysis (PCA)¶
Now we use PCA for analysis. PCA can project high dimensional vectors into low dimensional vectors without the loss of information. PCA is always worth trying since the result of PCA is quite robust for various inputs. If you use PCA with PDs, you can project each PD into low dimensional point and understand the relationship among PDs.
Now we project PI vectors into two dimensional space because of the convenience of visualzation.
pca = PCA(n_components=2)
pca.fit(pdvects)
PCA(n_components=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=2)
The result is visualized as follows. The red points in the following figure correspond to 0-labeled data, and the blue points correspond to 1-labeled data.
reduced = pca.transform(pdvects) # Project all vectors into R^2
plt.gca().set_aspect('equal') # Set the aspect ratio of the figure
plt.scatter(reduced[labels == 0, 0], reduced[labels == 0, 1], c="r") # Show 0-labled data by "r"ed
plt.scatter(reduced[labels == 1, 0], reduced[labels == 1, 1], c="b") # Show 0-labled data by "b"lue
<matplotlib.collections.PathCollection at 0x7f787035a410>
On 2D plane, red points and blue points are well separated. Probably the PI vectors have some characteristic difference between two groups.
Here, we want to understand the meaning of X-axis and Y-axis. In the standard PCA, the projection is the orthogonal projection and the 2D plane in high dimensional space is computed by PCA. This means that there are two vectors $v_1$ and $v_2$ such that each high-dimensional vector $w_i$ computed by PI (the index $i$ is the pointcloud index) is approximated as follows $$ w_i \approx \mu_{i1} v_1 + \mu_{i2} v_2 + v_0, $$ where $v_0$ is the average of $w_1,\ldots,$. If you want to understand correctly, please learn the theory of PCA. Anyway, two vectors $v_1, v_2$ are important to understand the above figure.
In sklearn library, $v_0$ is saved in PCA.mean_
, and $v_1, v_2$ are saved in PCA.components_
.
pca.mean_
array([3.32287297e-05, 3.52588389e-05, 3.67793759e-05, ..., 9.95550058e-08, 7.27036741e-08, 7.13024938e-08])
pca.components_
array([[ 2.59806997e-07, 2.70363482e-07, 2.77565448e-07, ..., -3.62878022e-09, -4.09395688e-09, -3.50500694e-09], [-6.08939376e-07, -6.42464501e-07, -6.65625771e-07, ..., 4.65980431e-09, -5.76173853e-10, 1.41943997e-09]])
One advantage of PI is the ability to convert a vector into a "persistence diagram".
The diagram corresponding to $v_0$ is in some sense the average of 200 PDs.
We can visualize the average, $v_1$, and $v_2$ in the form of persistence diagrams.
The following shows such diagrams. hc.PIVectorizerMesh
object used by vectorization can convert
the vectors into diagrams.
spec.histogram_from_vector(pca.mean_).plot()
spec.histogram_from_vector(pca.components_[0, :]).plot(colorbar={"type":"linear-midpoint", "midpoint": 0})
spec.histogram_from_vector(pca.components_[1, :]).plot(colorbar={"type":"linear-midpoint", "midpoint": 0})
Becaues PCA vectors have positive and negative elements, it is easy to understand if the sign is shown in
different colors.
colorbar={"type":"linear-midpoint", "midpoint": 0}
adjusts the colorbar.
Logistic regression¶
Next we try logistic regression. Logistic regression is also a good partner of persistent homology. 2-class classification is easy to use because we can easily interpret the learned result.
Before learning, we split the dataset into test set and training set. We use sklearn for that purpose.
pdvects_train, pdvects_test, labels_train, labels_test = train_test_split(pdvects, labels, test_size=0.25)
Logistic regression is applied to the training set. LogisitcRegression
is available.
The regularization parameter C
is chosen properly in this tutorial. You can
automatically choose the regularization
parameter by using cross-validation with LogisitcRegressionCV
class in sklearn.
model = lm.LogisticRegression(C=0.01, solver="lbfgs")
model.fit(pdvects_train, labels_train)
LogisticRegression(C=0.01)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.01)
We check the accurary of prediction by using test set.
model.score(pdvects_test, labels_test)
0.8
The accuracy is quite good. Then we visualize the learned result.
spec.histogram_from_vector(model.coef_.ravel()).plot(colorbar={"type": "linear-midpoint", "midpoint": 0.0})
This figure means that a PD with many birth-death pairs in the red area has high probability of 0-labeled data, and a PD with many birth-death pairs in the blue are has high probablity of 1-labeled data.
Inverse analysis (Optimal volume)¶
OK, the red and blue areas are important for our analysis. We clip the areas with threshold 0.005. NumPy's inequality operators are useful.
coef = model.coef_.ravel()
red_area = spec.mask_from_vector(coef > 0.005)
blue_area = spec.mask_from_vector(coef < -0.005)
The clipped areas are visualized.
blue_area.plot(title="blue", ax=plt.subplot(1, 2, 1))
red_area.plot(title="red", ax=plt.subplot(1, 2, 2))
We pick up birth-daeth pairs in these areas from No. 0 (0-labeled) PD and No. 100 (1-labeled) PD.
pairs_in_blue_0 = blue_area.filter_pairs(pds[0].pairs())
pairs_in_red_0 = red_area.filter_pairs(pds[0].pairs())
pairs_in_blue_100 = blue_area.filter_pairs(pds[100].pairs())
pairs_in_red_100 = red_area.filter_pairs(pds[100].pairs())
The following shows the number of pairs.
len(pairs_in_blue_0), len(pairs_in_red_0), len(pairs_in_blue_100), len(pairs_in_red_100)
(3, 2, 1, 5)
Indeed No. 0 PD has more pairs in the blue area, and No. 100 PD has more pairs in the red area. The following codes shows the birth and death times of the pairs.
pairs_in_blue_0
[Pair(0.008484743298521374, 0.015315059365253271), Pair(0.0086407914437143, 0.01634538350339904), Pair(0.006687816186736094, 0.01715384087925338)]
Now we compute optimal volumes for these pairs and visualize them (using stable volumes instead is also a good idea). This method allows visualization of which voids play an important role in classification.
optimal_volumes_blue_0 = [pair.optimal_volume(cutoff_radius=0.4) for pair in pairs_in_blue_0]
optimal_volumes_red_100 = [pair.optimal_volume(cutoff_radius=0.4) for pair in pairs_in_red_100]
pl = pv.Plotter(shape=(1, 2))
pl.subplot(0, 0)
pl.add_mesh(pv.PointSet(pointclouds[0]))
for ov in optimal_volumes_blue_0:
pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="blue")
pl.subplot(0, 1)
pl.add_mesh(pv.PointSet(pointclouds[100]))
for ov in optimal_volumes_red_100:
pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="red")
pl.show()
Widget(value="<iframe src='http://localhost:38695/index.html?ui=P_0x7f785c44cf70_1&reconnect=auto' style='widt…
It is a bit difficult to interpret, so these optimal volumes are displayed in the same space.
pl = pv.Plotter()
for ov in optimal_volumes_blue_0:
pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="blue", opacity=0.5)
for ov in optimal_volumes_red_100:
pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="red", opacity=0.5)
pl.show()
Widget(value="<iframe src='http://localhost:38695/index.html?ui=P_0x7f7860af19f0_2&reconnect=auto' style='widt…
The blue voids are typical for 0-labeled data, and the red voids are typical for 1-labeled data.
You find that the blue voids are a bit smaller than the red ones. Indeed in the learned PD the blue area is located below the red one. In the 2nd PDs the death time corresponds the square of the radius of maximal sphere settled in a void.
The death time of the center of the red area 0.022 and the blue area is 0.017. This means the square roots of 0.017 and 0.022 is characteristic space scale of 0-labeled data and 1-labeled data.
np.sqrt(0.017), np.sqrt(0.022)
(0.130384048104053, 0.14832396974191325)
From the PDs we can see that there are many smaller voids in the pointclouds but the voids are common among all pointclouds.
We finish the tutorial here.