Machine Learning Tutorial for Point Cloud¶
First, we import the necessary libraries. We'll be using Python machine learning libraries such as scikit-learn. For more information on scikit-learn, please refer to the official documentation: https://scikit-learn.org/stable/index.html Make sure to also study any other libraries used in this tutorial on your own.
import os
import homcloud.interface as hc
import numpy as np
from tqdm.notebook import tqdm # For progressbar
import matplotlib.pyplot as plt
import sklearn.linear_model as lm # for rogistic regression
from sklearn.decomposition import PCA # for PCA
from sklearn.model_selection import train_test_split
import pyvista as pv # for 3D visualization
Checking the Data¶
Before starting with machine learning, let's first take a look at the data we'll be working with.
The tutorial dataset is located in the pc
directory.
ls pc/
0000.txt 0026.txt 0052.txt 0078.txt 0104.txt 0130.txt 0156.txt 0182.txt 0001.txt 0027.txt 0053.txt 0079.txt 0105.txt 0131.txt 0157.txt 0183.txt 0002.txt 0028.txt 0054.txt 0080.txt 0106.txt 0132.txt 0158.txt 0184.txt 0003.txt 0029.txt 0055.txt 0081.txt 0107.txt 0133.txt 0159.txt 0185.txt 0004.txt 0030.txt 0056.txt 0082.txt 0108.txt 0134.txt 0160.txt 0186.txt 0005.txt 0031.txt 0057.txt 0083.txt 0109.txt 0135.txt 0161.txt 0187.txt 0006.txt 0032.txt 0058.txt 0084.txt 0110.txt 0136.txt 0162.txt 0188.txt 0007.txt 0033.txt 0059.txt 0085.txt 0111.txt 0137.txt 0163.txt 0189.txt 0008.txt 0034.txt 0060.txt 0086.txt 0112.txt 0138.txt 0164.txt 0190.txt 0009.txt 0035.txt 0061.txt 0087.txt 0113.txt 0139.txt 0165.txt 0191.txt 0010.txt 0036.txt 0062.txt 0088.txt 0114.txt 0140.txt 0166.txt 0192.txt 0011.txt 0037.txt 0063.txt 0089.txt 0115.txt 0141.txt 0167.txt 0193.txt 0012.txt 0038.txt 0064.txt 0090.txt 0116.txt 0142.txt 0168.txt 0194.txt 0013.txt 0039.txt 0065.txt 0091.txt 0117.txt 0143.txt 0169.txt 0195.txt 0014.txt 0040.txt 0066.txt 0092.txt 0118.txt 0144.txt 0170.txt 0196.txt 0015.txt 0041.txt 0067.txt 0093.txt 0119.txt 0145.txt 0171.txt 0197.txt 0016.txt 0042.txt 0068.txt 0094.txt 0120.txt 0146.txt 0172.txt 0198.txt 0017.txt 0043.txt 0069.txt 0095.txt 0121.txt 0147.txt 0173.txt 0199.txt 0018.txt 0044.txt 0070.txt 0096.txt 0122.txt 0148.txt 0174.txt label.txt 0019.txt 0045.txt 0071.txt 0097.txt 0123.txt 0149.txt 0175.txt 0020.txt 0046.txt 0072.txt 0098.txt 0124.txt 0150.txt 0176.txt 0021.txt 0047.txt 0073.txt 0099.txt 0125.txt 0151.txt 0177.txt 0022.txt 0048.txt 0074.txt 0100.txt 0126.txt 0152.txt 0178.txt 0023.txt 0049.txt 0075.txt 0101.txt 0127.txt 0153.txt 0179.txt 0024.txt 0050.txt 0076.txt 0102.txt 0128.txt 0154.txt 0180.txt 0025.txt 0051.txt 0077.txt 0103.txt 0129.txt 0155.txt 0181.txt
There are 200 text data files, ranging from 0000.txt
to 0199.txt
.
In addition, label.txt
contains the corresponding labels for each data file, with values of either 0 or 1.
This indicates that the data is divided into two distinct groups.
Reads and displays the labels.
labels = np.loadtxt("pc/label.txt")
labels
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
This is an array of 200 labels, each with a value of either 0 or 1. Now, let's check which data points are labeled as 0 and which are labeled as 1. We'll use NumPy for this task.
np.nonzero(labels == 0), np.nonzero(labels == 1)
((array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),), (array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199]),))
We can see that the data from index 0 to 99 is labeled as 0, and the data from index 100 to 199 is labeled as 1.
Loading Point Cloud Data¶
Load the point cloud data from the files.
pointclouds = [np.loadtxt("pc/{:04d}.txt".format(i)) for i in tqdm(range(200))]
All 100 data samples are arrays of size 1000×3. This means there are 100 point clouds, each consisting of 1000 points in 3D space. We’ll verify the fact using the code below.
Let's visualize the 0th and 100th data points.
pl = pv.Plotter(shape=(1, 2))
pl.subplot(0, 0)
pl.add_mesh(pv.PointSet(pointclouds[0]))
pl.subplot(0, 1)
pl.add_mesh(pv.PointSet(pointclouds[100]))
pl.show()
<MagicMock name='mock()' id='129822652941136'>
all(pointcloud.shape == (1000, 3) for pointcloud in pointclouds)
True
When performing machine learning with persistence diagrams, it's not strictly necessary for each point cloud to have exactly the same number of points. However, if the point counts differ too much, the model may not perform well.
Computing Persistence Diagrams¶
Now, let's compute the persistence diagrams using HomCloud.
The resulting files will be saved in the pd
directory.
Since the saved data can be reused, you won't need to recompute them each time you perform a different analysis.
os.makedirs("pd", exist_ok=True)
for i in tqdm(range(200)):
hc.PDList.from_alpha_filtration(pointclouds[i], save_boundary_map=True,
save_to="pd/{:04d}.pdgm".format(i))
In this tutorial, we will use 2-dimensional persistence diagrams, focusing on void structures in the data. While it's possible to use 0-dimensional or 1-dimensional diagrams as well, this dataset shows its most distinctive features in the 2-dimensional case, which is why we focus on 2-dimensional diagmras here.
Although we won’t go into detail in this tutorial, more advanced analysis methods can involve using both 1D and 2D persistence information together.
pds = [hc.PDList("pd/{:04d}.pdgm".format(i)).dth_diagram(2) for i in tqdm(range(200))]
Displays the 0th (label 0) and 100th (label 1) persistence diagrams.
pds[0].histogram(x_range=(0, 0.03)).plot(colorbar={"type": "log"})
pds[100].histogram(x_range=(0, 0.03)).plot(colorbar={"type": "log"})
Vectorization Using Persistence Images (PI)¶
For more information on Persistence Images, please refer to the following paper:
However, the method used in this tutorial is not exactly the same as the one proposed in that papers. Instead, it is a modified version based on the following work:
First, we define the parameters for vectorization.
The setting (0, 0.03), 128
specifies that the square region [0, 0.03] × [0, 0.03]
will be divided into a 128×128 grid.
sigma=0.002
defines the standard deviation of the Gaussian distribution.
The option weight=("atan", 0.01, 3)
applies a weighting based on the distance from the diagonal.
In this case, the weight is determined by the function $\textrm{atan}(0.01 \cdot \ell^3)$, where $\ell$ represents the lifetime (i.e., death − birth).
vectorize_spec = hc.PIVectorizeSpec((0, 0.03), 128, sigma=0.002, weight=("atan", 0.01, 3))
Let's vectorize the persistence diagrams.
pdvects = np.vstack([vectorize_spec.vectorize(pd) for pd in pds])
Check the minimum and maximum of vector elements.
pdvects.min(), pdvects.max()
(np.float64(0.0), np.float64(2.099302720413663e-10))
The maximum value is approximately $2 \times 10^{-10}$, which is extremely small.
Such a low magnitude may negatively affect machine learning performance, so normalization is recommended.
There are various known methods for normalization, but here we will use a simple approach: scaling the array so that its maximum value becomes 1.
pdvects = pdvects / pdvects.max()
Principal Component Analysis (PCA)¶
We begin our analysis using PCA (Principal Component Analysis).
PCA is a technique for reducing high-dimensional data to a lower-dimensional space while preserving as much information as possible.
Because it rarely produces unstable results, PCA is a valuable first approach when combining persistence diagrams with machine learning.
In combination with persistence diagrams, PCA allows each diagram to be represented as a low-dimensional point, making it easier to visually understand the relationships between diagrams.
In this tutorial, we will reduce the dimensionality to 2D for easier visualization.
pca = PCA(n_components=2)
pca.fit(pdvects)
PCA(n_components=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_components | 2 | |
copy | True | |
whiten | False | |
svd_solver | 'auto' | |
tol | 0.0 | |
iterated_power | 'auto' | |
n_oversamples | 10 | |
power_iteration_normalizer | 'auto' | |
random_state | None |
We visualize how the data is represented in two dimensions. Red indicates label 0, and blue indicates label 1.
reduced = pca.transform(pdvects) # Reduce all data to 2D
plt.gca().set_aspect('equal') # Set equal aspect ratio for x and y axes
plt.scatter(reduced[labels == 0, 0], reduced[labels == 0, 1], c="r") # Plot data with label 0 in red
plt.scatter(reduced[labels == 1, 0], reduced[labels == 1, 1], c="b") # Plot data with label 1 in blue
<matplotlib.collections.PathCollection at 0x76126ac32450>
Since the red and blue points are reasonably well separated in the 2D plane, the vectors computed from the Persistence Images effectively seem to capture the differences between the two groups.
So, what do the X and Y axes in this plot represent?
If we denote the vectors corresponding to the X and Y axes as $ v_1 $ and $ v_2 $, then the high-dimensional vector $ w_i $ obtained from the Persistence Image (where $ i $ is the index of the data point) can be approximated as:
$$ w_i \approx \mu_{i1} v_1 + \mu_{i2} v_2 + v_0 $$
Here, $ v_0 $ is the mean vector of all $ w_1, w_2, \ldots $, and the coordinates $ (\mu_{i1}, \mu_{i2}) $ correspond to the position of each point in the plot above (for more details, please refer to the fundamentals of PCA).
In other words, by understanding what kind of vectors $ v_1 $ and $ v_2 $ are, we can gain deeper insight into the meaning of this plot.
In scikit-learn
's PCA implementation, $ v_0 $ is stored in PCA.mean_
, and $ v_1 $, $ v_2 $ are stored in PCA.components_
.
pca.mean_
array([3.32287297e-05, 3.52588389e-05, 3.67793759e-05, ..., 9.95550058e-08, 7.27036741e-08, 7.13024938e-08], shape=(8256,))
pca.components_
array([[ 2.59806997e-07, 2.70363482e-07, 2.77565448e-07, ..., -3.62878022e-09, -4.09395688e-09, -3.50500694e-09], [-6.08939377e-07, -6.42464501e-07, -6.65625771e-07, ..., 4.65980431e-09, -5.76173853e-10, 1.41943997e-09]], shape=(2, 8256))
One of the advantages of Persistence Images (PI) is that the vectors can be converted back into signed histograms, similar in form to persistence diagrams.
This allows us to visualize vectors such as the mean vector and the principal components $ v_1 $ and $ v_2 $ in the style of persistence diagrams.
Let’s try it out. The hc.PIVectorizerMesh
object used for vectorization contains the necessary information for this transformation, and the histogram_from_vector
method can be used to convert a vector back into a histogram.
vectorize_spec.histogram_from_vector(pca.mean_).plot()
vectorize_spec.histogram_from_vector(pca.components_[0, :]).plot(colorbar={"type":"linear-midpoint", "midpoint": 0})
vectorize_spec.histogram_from_vector(pca.components_[1, :]).plot(colorbar={"type":"linear-midpoint", "midpoint": 0})
The PCA vectors contain both positive and negative components, which correspond to the positive and negative directions along each axis.
To make this visually intuitive, it's helpful to map positive values to red and negative values to black when displaying them, such as in a heatmap.
This can be adjusted using the setting colorbar={"type": "linear-midpoint", "midpoint": 0}
.
Furthermore, you can extract the birth-death pairs that fall within the red or blue regions and perform inverse analysis to investigate their characteristics in more detail.
We will explore this inverse analysis in the next section on logistic regression.
Logistic Regression¶
Next, we move on to logistic regression. Logistic regression is another method that works well in combination with persistent homology, following PCA.
It is especially suitable for binary classification tasks due to its interpretability and simplicity.
Before training the model, we first split the data into a training set and a test set.
pdvects_train, pdvects_test, labels_train, labels_test = train_test_split(pdvects, labels, test_size=0.25)
We will train the model using the training set.
Here, we use the LogisticRegression
class from sklearn
.
While the regularization parameter C
is manually set to a reasonable value in this example, you can also use the LogisticRegressionCV
class to automatically select the optimal parameter through cross-validation.
model = lm.LogisticRegression(C=0.01, solver="lbfgs")
model.fit(pdvects_train, labels_train, )
LogisticRegression(C=0.01)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty | 'l2' | |
dual | False | |
tol | 0.0001 | |
C | 0.01 | |
fit_intercept | True | |
intercept_scaling | 1 | |
class_weight | None | |
random_state | None | |
solver | 'lbfgs' | |
max_iter | 100 | |
multi_class | 'deprecated' | |
verbose | 0 | |
warm_start | False | |
n_jobs | None | |
l1_ratio | None |
We will check the performance using the test set.
model.score(pdvects_test, labels_test)
0.86
The accuracy looks good. We will display the learning result.
vectorize_spec.histogram_from_vector(model.coef_.ravel()).plot(colorbar={"type": "linear-midpoint", "midpoint": 0.0})
How to interpret this plot is as follows:
If a PD has many birth-death pairs located in the dark red regions, it is likely to belong to label 1.
Conversely, if many birth-death pairs are found in the dark blue regions, the data is more likely to belong to label 0.
Applying Inverse analysis¶
Let’s take a closer look at the red and blue regions. We begin by extracting the pairs contained within each region from the persistence diagram.
To do this, we isolate the red and blue regions using a threshold value of 0.005. Specifically, we extract the parts greater than 0.005 and less than -0.005. Using NumPy’s inequality operations, we generate a boolean vector, which is then passed to the mask_from_vector
method of vectorizer_spec
to construct a MaskHistogram
object.
This object can be used to visualize the extracted regions and to retrieve the birth-death pairs contained within them.
coef = model.coef_.ravel()
red_area = vectorize_spec.mask_from_vector(coef > 0.005)
blue_area = vectorize_spec.mask_from_vector(coef < -0.005)
The regions are visualized as follows.
blue_area.plot(title="blue", ax=plt.subplot(1, 2, 1))
red_area.plot(title="red", ax=plt.subplot(1, 2, 2))
We extract the birth-death pairs contained in the selected regions from the 0th persistence diagram (label 0) and the 100th persistence diagram (label 1) using MaskHistogram
's filter_pair
method.
pairs_in_blue_0 = blue_area.filter_pairs(pds[0].pairs())
pairs_in_red_0 = red_area.filter_pairs(pds[0].pairs())
pairs_in_blue_100 = blue_area.filter_pairs(pds[100].pairs())
pairs_in_red_100 = red_area.filter_pairs(pds[100].pairs())
We will print the numbers of pairs.
len(pairs_in_blue_0), len(pairs_in_red_0), len(pairs_in_blue_100), len(pairs_in_red_100)
(3, 2, 1, 7)
Indeed, the 0th persistence diagram contains many pairs located in the blue region, while the 100th persistence diagram has more pairs in the red region. Upon inspection, we find that the following pairs are included.
pairs_in_blue_0
[Pair(0.008484743298521374, 0.015315059365253271), Pair(0.0086407914437143, 0.01634538350339904), Pair(0.006687816186736094, 0.01715384087925338)]
Here, we compute and visualize the stable volume, a tool used in the point cloud tutorial. This allows us to visually identify which types of void structures play a significant role in the classification task.
stable_volumes_blue_0 = [pair.stable_volume(0.0003, cutoff_radius=0.4) for pair in pairs_in_blue_0]
stable_volumes_red_100 = [pair.stable_volume(0.0003, cutoff_radius=0.4) for pair in pairs_in_red_100]
pl = pv.Plotter(shape=(1, 2))
pl.subplot(0, 0)
pl.add_mesh(pv.PointSet(pointclouds[0]))
for ov in stable_volumes_blue_0:
pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="blue")
pl.subplot(0, 1)
pl.add_mesh(pv.PointSet(pointclouds[100]))
for ov in stable_volumes_red_100:
pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="red")
pl.show()
<MagicMock name='mock()' id='129822652941136'>
Since it is a difficult to find the difference, we display all of them in a single view.
pl = pv.Plotter()
for ov in stable_volumes_blue_0:
pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="blue", opacity=0.5)
for ov in stable_volumes_red_100:
pl.add_mesh(ov.to_pyvista_boundary_mesh(), color="red", opacity=0.5)
pl.show()
<MagicMock name='mock()' id='129822652941136'>
The blue volumes represent voids that are "typical" for label 0, while the red volumes correspond to voids that are "typical" for label 1.
The blue voids appear slightly narrower and smaller than the red ones. Looking at the learned persistence diagrams, we can see that the blue region lies lower on the Y-axis (death) compared to the red region. In fact, in a 2-dimensional persistence diagram, the death value corresponds to the square of the radius of the largest ball that can fit into the void. This means that the learning result suggests label 1 tends to have larger voids at this scale.
Specifically, the center of the red region has a death value of approximately 0.022, while the blue region's center is around 0.017. Taking the square roots of these values gives us the radii of the largest balls that can fit into the respective voids:
np.sqrt(0.017), np.sqrt(0.022)
(np.float64(0.130384048104053), np.float64(0.14832396974191325))
Additionally, by examining the original persistence diagrams, we can observe that smaller voids are abundant and commonly shared between both groups.
This tutorial ends here.