Machine Learning Tutorial for Point Cloud¶

First, we import the necessary libraries. We'll be using Python machine learning libraries such as scikit-learn. For more information on scikit-learn, please refer to the official documentation: https://scikit-learn.org/stable/index.html Make sure to also study any other libraries used in this tutorial on your own.

In [1]:
import os  # for makedirs
import homcloud.interface as hc  # HomCloud 
import homcloud.plotly_3d as p3d  # 3D visualizaiton
import plotly.graph_objects as go  # 3D visualizaiton
import numpy as np  # Numerical array library
from tqdm.notebook import tqdm  # For progressbar
import matplotlib.pyplot as plt  # Plotting
import sklearn.linear_model as lm  # Machine learning
from sklearn.decomposition import PCA  # for PCA
from sklearn.model_selection import train_test_split
import plotly.io
plotly.io.renderers.default = 'notebook'  # plotly Setting. Choose your favorite renderer.

Checking the Data¶

Before starting with machine learning, let's first take a look at the data we'll be working with. The tutorial dataset is located in the pc directory.

In [2]:
ls pc/
0000.txt  0026.txt  0052.txt  0078.txt  0104.txt  0130.txt  0156.txt  0182.txt
0001.txt  0027.txt  0053.txt  0079.txt  0105.txt  0131.txt  0157.txt  0183.txt
0002.txt  0028.txt  0054.txt  0080.txt  0106.txt  0132.txt  0158.txt  0184.txt
0003.txt  0029.txt  0055.txt  0081.txt  0107.txt  0133.txt  0159.txt  0185.txt
0004.txt  0030.txt  0056.txt  0082.txt  0108.txt  0134.txt  0160.txt  0186.txt
0005.txt  0031.txt  0057.txt  0083.txt  0109.txt  0135.txt  0161.txt  0187.txt
0006.txt  0032.txt  0058.txt  0084.txt  0110.txt  0136.txt  0162.txt  0188.txt
0007.txt  0033.txt  0059.txt  0085.txt  0111.txt  0137.txt  0163.txt  0189.txt
0008.txt  0034.txt  0060.txt  0086.txt  0112.txt  0138.txt  0164.txt  0190.txt
0009.txt  0035.txt  0061.txt  0087.txt  0113.txt  0139.txt  0165.txt  0191.txt
0010.txt  0036.txt  0062.txt  0088.txt  0114.txt  0140.txt  0166.txt  0192.txt
0011.txt  0037.txt  0063.txt  0089.txt  0115.txt  0141.txt  0167.txt  0193.txt
0012.txt  0038.txt  0064.txt  0090.txt  0116.txt  0142.txt  0168.txt  0194.txt
0013.txt  0039.txt  0065.txt  0091.txt  0117.txt  0143.txt  0169.txt  0195.txt
0014.txt  0040.txt  0066.txt  0092.txt  0118.txt  0144.txt  0170.txt  0196.txt
0015.txt  0041.txt  0067.txt  0093.txt  0119.txt  0145.txt  0171.txt  0197.txt
0016.txt  0042.txt  0068.txt  0094.txt  0120.txt  0146.txt  0172.txt  0198.txt
0017.txt  0043.txt  0069.txt  0095.txt  0121.txt  0147.txt  0173.txt  0199.txt
0018.txt  0044.txt  0070.txt  0096.txt  0122.txt  0148.txt  0174.txt  label.txt
0019.txt  0045.txt  0071.txt  0097.txt  0123.txt  0149.txt  0175.txt
0020.txt  0046.txt  0072.txt  0098.txt  0124.txt  0150.txt  0176.txt
0021.txt  0047.txt  0073.txt  0099.txt  0125.txt  0151.txt  0177.txt
0022.txt  0048.txt  0074.txt  0100.txt  0126.txt  0152.txt  0178.txt
0023.txt  0049.txt  0075.txt  0101.txt  0127.txt  0153.txt  0179.txt
0024.txt  0050.txt  0076.txt  0102.txt  0128.txt  0154.txt  0180.txt
0025.txt  0051.txt  0077.txt  0103.txt  0129.txt  0155.txt  0181.txt

There are 200 text data files, ranging from 0000.txt to 0199.txt.
In addition, label.txt contains the corresponding labels for each data file, with values of either 0 or 1.
This indicates that the data is divided into two distinct groups.

Reads and displays the labels.

In [3]:
labels = np.loadtxt("pc/label.txt")
labels
Out[3]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

This is an array of 200 labels, each with a value of either 0 or 1. Now, let's check which data points are labeled as 0 and which are labeled as 1. We'll use NumPy for this task.

In [4]:
np.nonzero(labels == 0), np.nonzero(labels == 1)
Out[4]:
((array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
         17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
         34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
         51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
         68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
         85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),),
 (array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
         113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
         126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
         139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
         152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
         165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
         178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
         191, 192, 193, 194, 195, 196, 197, 198, 199]),))

We can see that the data from index 0 to 99 is labeled as 0, and the data from index 100 to 199 is labeled as 1. Now, let's visualize the 0th and 100th data points.

In [5]:
go.Figure(data=[p3d.PointCloud(np.loadtxt("pc/0000.txt"))], layout=dict(scene=p3d.SimpleScene()))
In [6]:
go.Figure(data=[p3d.PointCloud(np.loadtxt("pc/0100.txt"))], layout=dict(scene=p3d.SimpleScene()))

Loading Point Cloud Data¶

Now, let's load the point cloud data from the files.

In [7]:
pointclouds = [np.loadtxt("pc/{:04d}.txt".format(i)) for i in tqdm(range(200))]
  0%|          | 0/200 [00:00<?, ?it/s]

All 100 data samples are arrays of size 1000×3. This means there are 100 point clouds, each consisting of 1000 points in 3D space. We’ll verify the fact using the code below.

In [8]:
all(pointcloud.shape == (1000, 3) for pointcloud in pointclouds)
Out[8]:
True

When performing machine learning with persistence diagrams, it's not strictly necessary for each point cloud to have exactly the same number of points. However, if the point counts differ too much, the model may not perform well.

Computing Persistence Diagrams¶

Now, let's compute the persistence diagrams using HomCloud.
The resulting files will be saved in the pd directory.

Since the saved data can be reused, you won't need to recompute them each time you perform a different analysis.

In [9]:
os.makedirs("pd", exist_ok=True)
for i in tqdm(range(200)):
    hc.PDList.from_alpha_filtration(pointclouds[i], save_boundary_map=True, 
                                    save_to="pd/{:04d}.pdgm".format(i))
  0%|          | 0/200 [00:00<?, ?it/s]

In this tutorial, we will use 2-dimensional persistence diagrams, focusing on void structures in the data. While it's possible to use 0-dimensional or 1-dimensional diagrams as well, this dataset shows its most distinctive features in the 2-dimensional case, which is why we focus on 2-dimensional diagmras here.

Although we won’t go into detail in this tutorial, more advanced analysis methods can involve using both 1D and 2D persistence information together.

In [10]:
pds = [hc.PDList("pd/{:04d}.pdgm".format(i)).dth_diagram(2) for i in tqdm(range(200))]
  0%|          | 0/200 [00:00<?, ?it/s]

Displays the 0th (label 0) and 100th (label 1) persistence diagrams.

In [11]:
pds[0].histogram(x_range=(0, 0.03)).plot(colorbar={"type": "log"})
pds[100].histogram(x_range=(0, 0.03)).plot(colorbar={"type": "log"})
No description has been provided for this image
No description has been provided for this image

Vectorization Using Persistence Images (PI)¶

For more information on Persistence Images, please refer to the following paper:

  • http://www.jmlr.org/papers/volume18/16-337/16-337.pdf

However, the method used in this tutorial is not exactly the same as the one proposed in that papers. Instead, it is a modified version based on the following work:

  • https://link.springer.com/article/10.1007%2Fs41468-018-0013-5

First, we define the parameters for vectorization.
The setting (0, 0.03), 128 specifies that the square region [0, 0.03] × [0, 0.03] will be divided into a 128×128 grid.
sigma=0.002 defines the standard deviation of the Gaussian distribution.
The option weight=("atan", 0.01, 3) applies a weighting based on the distance from the diagonal.
In this case, the weight is determined by the function $\textrm{atan}(0.01 \cdot \ell^3)$, where $\ell$ represents the lifetime (i.e., death − birth).

In [12]:
vectorize_spec = hc.PIVectorizeSpec((0, 0.03), 128, sigma=0.002, weight=("atan", 0.01, 3))

Let's vectorize the persistence diagrams.

In [13]:
pdvects = np.vstack([vectorize_spec.vectorize(pd) for pd in pds])

Check the minimum and maximum of vector elements.

In [14]:
pdvects.min(), pdvects.max()
Out[14]:
(np.float64(0.0), np.float64(2.099302720413663e-10))

The maximum value is approximately $2 \times 10^{-10}$, which is extremely small.
Such a low magnitude may negatively affect machine learning performance, so normalization is recommended.

There are various known methods for normalization, but here we will use a simple approach: scaling the array so that its maximum value becomes 1.

In [15]:
pdvects = pdvects / pdvects.max()

Principal Component Analysis (PCA)¶

We begin our analysis using PCA (Principal Component Analysis).
PCA is a technique for reducing high-dimensional data to a lower-dimensional space while preserving as much information as possible.
Because it rarely produces unstable results, PCA is a valuable first approach when combining persistence diagrams with machine learning.

In combination with persistence diagrams, PCA allows each diagram to be represented as a low-dimensional point, making it easier to visually understand the relationships between diagrams.

In this tutorial, we will reduce the dimensionality to 2D for easier visualization.

In [16]:
pca = PCA(n_components=2)
pca.fit(pdvects)
Out[16]:
PCA(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_components n_components: int, float or 'mle', default=None

Number of components to keep.
if n_components is not set all components are kept::

n_components == min(n_samples, n_features)

If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's
MLE is used to guess the dimension. Use of ``n_components == 'mle'``
will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``.

If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the
number of components such that the amount of variance that needs to be
explained is greater than the percentage specified by n_components.

If ``svd_solver == 'arpack'``, the number of components must be
strictly less than the minimum of n_features and n_samples.

Hence, the None case results in::

n_components == min(n_samples, n_features) - 1
2
copy copy: bool, default=True

If False, data passed to fit are overwritten and running
fit(X).transform(X) will not yield the expected results,
use fit_transform(X) instead.
True
whiten whiten: bool, default=False

When True (False by default) the `components_` vectors are multiplied
by the square root of n_samples and then divided by the singular values
to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal
(the relative variance scales of the components) but can sometime
improve the predictive accuracy of the downstream estimators by
making their data respect some hard-wired assumptions.
False
svd_solver svd_solver: {'auto', 'full', 'covariance_eigh', 'arpack', 'randomized'}, default='auto'

"auto" :
The solver is selected by a default 'auto' policy is based on `X.shape` and
`n_components`: if the input data has fewer than 1000 features and
more than 10 times as many samples, then the "covariance_eigh"
solver is used. Otherwise, if the input data is larger than 500x500
and the number of components to extract is lower than 80% of the
smallest dimension of the data, then the more efficient
"randomized" method is selected. Otherwise the exact "full" SVD is
computed and optionally truncated afterwards.
"full" :
Run exact full SVD calling the standard LAPACK solver via
`scipy.linalg.svd` and select the components by postprocessing
"covariance_eigh" :
Precompute the covariance matrix (on centered data), run a
classical eigenvalue decomposition on the covariance matrix
typically using LAPACK and select the components by postprocessing.
This solver is very efficient for n_samples >> n_features and small
n_features. It is, however, not tractable otherwise for large
n_features (large memory footprint required to materialize the
covariance matrix). Also note that compared to the "full" solver,
this solver effectively doubles the condition number and is
therefore less numerical stable (e.g. on input data with a large
range of singular values).
"arpack" :
Run SVD truncated to `n_components` calling ARPACK solver via
`scipy.sparse.linalg.svds`. It requires strictly
`0 < n_components < min(X.shape)`
"randomized" :
Run randomized SVD by the method of Halko et al.

.. versionadded:: 0.18.0

.. versionchanged:: 1.5
Added the 'covariance_eigh' solver.
'auto'
tol tol: float, default=0.0

Tolerance for singular values computed by svd_solver == 'arpack'.
Must be of range [0.0, infinity).

.. versionadded:: 0.18.0
0.0
iterated_power iterated_power: int or 'auto', default='auto'

Number of iterations for the power method computed by
svd_solver == 'randomized'.
Must be of range [0, infinity).

.. versionadded:: 0.18.0
'auto'
n_oversamples n_oversamples: int, default=10

This parameter is only relevant when `svd_solver="randomized"`.
It corresponds to the additional number of random vectors to sample the
range of `X` so as to ensure proper conditioning. See
:func:`~sklearn.utils.extmath.randomized_svd` for more details.

.. versionadded:: 1.1
10
power_iteration_normalizer power_iteration_normalizer: {'auto', 'QR', 'LU', 'none'}, default='auto'

Power iteration normalizer for randomized SVD solver.
Not used by ARPACK. See :func:`~sklearn.utils.extmath.randomized_svd`
for more details.

.. versionadded:: 1.1
'auto'
random_state random_state: int, RandomState instance or None, default=None

Used when the 'arpack' or 'randomized' solvers are used. Pass an int
for reproducible results across multiple function calls.
See :term:`Glossary `.

.. versionadded:: 0.18.0
None

We visualize how the data is represented in two dimensions. Red indicates label 0, and blue indicates label 1.

In [17]:
pca_reduced = pca.transform(pdvects) # Reduce all data to 2D
plt.gca().set_aspect('equal') # Set equal aspect ratio for x and y axes
plt.scatter(pca_reduced[labels == 0, 0], pca_reduced[labels == 0, 1], c="r")  # Plot data with label 0 in red
plt.scatter(pca_reduced[labels == 1, 0], pca_reduced[labels == 1, 1], c="b")  # Plot data with label 1 in blue
Out[17]:
<matplotlib.collections.PathCollection at 0x73452df9ec30>
No description has been provided for this image

Since the red and blue points are reasonably well separated in the 2D plane, the vectors computed from the Persistence Images effectively seem to capture the differences between the two groups.

So, what do the X and Y axes in this plot represent?

If we denote the vectors corresponding to the X and Y axes as $ v_1 $ and $ v_2 $, then the high-dimensional vector $ w_i $ obtained from the Persistence Image (where $ i $ is the index of the data point) can be approximated as:

$$ w_i \approx \mu_{i1} v_1 + \mu_{i2} v_2 + v_0 $$

Here, $ v_0 $ is the mean vector of all $ w_1, w_2, \ldots $, and the coordinates $ (\mu_{i1}, \mu_{i2}) $ correspond to the position of each point in the plot above (for more details, please refer to the fundamentals of PCA).
In other words, by understanding what kind of vectors $ v_1 $ and $ v_2 $ are, we can gain deeper insight into the meaning of this plot.

In scikit-learn's PCA implementation, $ v_0 $ is stored in PCA.mean_, and $ v_1 $, $ v_2 $ are stored in PCA.components_.

In [18]:
pca.mean_
Out[18]:
array([3.32287297e-05, 3.52588389e-05, 3.67793759e-05, ...,
       9.95550058e-08, 7.27036741e-08, 7.13024938e-08], shape=(8256,))
In [19]:
pca.components_
Out[19]:
array([[ 2.59806997e-07,  2.70363482e-07,  2.77565448e-07, ...,
        -3.62878022e-09, -4.09395688e-09, -3.50500694e-09],
       [-6.08939377e-07, -6.42464501e-07, -6.65625771e-07, ...,
         4.65980431e-09, -5.76173853e-10,  1.41943997e-09]],
      shape=(2, 8256))

One of the advantages of Persistence Images (PI) is that the vectors can be converted back into signed histograms, similar in form to persistence diagrams.
This allows us to visualize vectors such as the mean vector and the principal components $ v_1 $ and $ v_2 $ in the style of persistence diagrams.

Let’s try it out. The hc.PIVectorizerMesh object used for vectorization contains the necessary information for this transformation, and the histogram_from_vector method can be used to convert a vector back into a histogram.

In [20]:
vectorize_spec.histogram_from_vector(pca.mean_).plot()
No description has been provided for this image
In [21]:
vectorize_spec.histogram_from_vector(pca.components_[0, :]).plot(colorbar={"type":"linear-midpoint", "midpoint": 0})
vectorize_spec.histogram_from_vector(pca.components_[1, :]).plot(colorbar={"type":"linear-midpoint", "midpoint": 0})
No description has been provided for this image
No description has been provided for this image

The PCA vectors contain both positive and negative components, which correspond to the positive and negative directions along each axis.
To make this visually intuitive, it's helpful to map positive values to red and negative values to black when displaying them, such as in a heatmap.
This can be adjusted using the setting colorbar={"type": "linear-midpoint", "midpoint": 0}.

Furthermore, you can extract the birth-death pairs that fall within the red or blue regions and perform inverse analysis to investigate their characteristics in more detail.
We will explore this inverse analysis in the next section on logistic regression.

Logistic Regression¶

Next, we move on to logistic regression. Logistic regression is another method that works well in combination with persistent homology, following PCA.
It is especially suitable for binary classification tasks due to its interpretability and simplicity.

Before training the model, we first split the data into a training set and a test set.

In [22]:
pdvects_train, pdvects_test, labels_train, labels_test = train_test_split(pdvects, labels, test_size=0.25)

We will train the model using the training set.
Here, we use the LogisticRegression class from sklearn.
While the regularization parameter C is manually set to a reasonable value in this example, you can also use the LogisticRegressionCV class to automatically select the optimal parameter through cross-validation.

In [23]:
model = lm.LogisticRegression(C=0.01, solver="lbfgs")
In [24]:
model.fit(pdvects_train, labels_train, )
Out[24]:
LogisticRegression(C=0.01)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2'

Specify the norm of the penalty:

- `None`: no penalty is added;
- `'l2'`: add a L2 penalty term and it is the default choice;
- `'l1'`: add a L1 penalty term;
- `'elasticnet'`: both L1 and L2 penalty terms are added.

.. warning::
Some penalties may not work with some solvers. See the parameter
`solver` below, to know the compatibility between the penalty and
solver.

.. versionadded:: 0.19
l1 penalty with SAGA solver (allowing 'multinomial' + L1)

.. deprecated:: 1.8
`penalty` was deprecated in version 1.8 and will be removed in 1.10.
Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for
`penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for
`'penalty='elasticnet'`.
'deprecated'
C C: float, default=1.0

Inverse of regularization strength; must be a positive float.
Like in support vector machines, smaller values specify stronger
regularization. `C=np.inf` results in unpenalized logistic regression.
For a visual example on the effect of tuning the `C` parameter
with an L1 penalty, see:
:ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.
0.01
l1_ratio l1_ratio: float, default=0.0

The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting
`l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty.
Any value between 0 and 1 gives an Elastic-Net penalty of the form
`l1_ratio * L1 + (1 - l1_ratio) * L2`.

.. warning::
Certain values of `l1_ratio`, i.e. some penalties, may not work with some
solvers. See the parameter `solver` below, to know the compatibility between
the penalty and solver.

.. versionchanged:: 1.8
Default value changed from None to 0.0.

.. deprecated:: 1.8
`None` is deprecated and will be removed in version 1.10. Always use
`l1_ratio` to specify the penalty type.
0.0
dual dual: bool, default=False

Dual (constrained) or primal (regularized, see also
:ref:`this equation `) formulation. Dual formulation
is only implemented for l2 penalty with liblinear solver. Prefer `dual=False`
when n_samples > n_features.
False
tol tol: float, default=1e-4

Tolerance for stopping criteria.
0.0001
fit_intercept fit_intercept: bool, default=True

Specifies if a constant (a.k.a. bias or intercept) should be
added to the decision function.
True
intercept_scaling intercept_scaling: float, default=1

Useful only when the solver `liblinear` is used
and `self.fit_intercept` is set to `True`. In this case, `x` becomes
`[x, self.intercept_scaling]`,
i.e. a "synthetic" feature with constant value equal to
`intercept_scaling` is appended to the instance vector.
The intercept becomes
``intercept_scaling * synthetic_feature_weight``.

.. note::
The synthetic feature weight is subject to L1 or L2
regularization as all other features.
To lessen the effect of regularization on synthetic feature weight
(and therefore on the intercept) `intercept_scaling` has to be increased.
1
class_weight class_weight: dict or 'balanced', default=None

Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one.

The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.

.. versionadded:: 0.17
*class_weight='balanced'*
None
random_state random_state: int, RandomState instance, default=None

Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the
data. See :term:`Glossary ` for details.
None
solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs'

Algorithm to use in the optimization problem. Default is 'lbfgs'.
To choose a solver, you might want to consider the following aspects:

- 'lbfgs' is a good default solver because it works reasonably well for a wide
class of problems.
- For :term:`multiclass` problems (`n_classes >= 3`), all solvers except
'liblinear' minimize the full multinomial loss, 'liblinear' will raise an
error.
- 'newton-cholesky' is a good choice for
`n_samples` >> `n_features * n_classes`, especially with one-hot encoded
categorical features with rare categories. Be aware that the memory usage
of this solver has a quadratic dependency on `n_features * n_classes`
because it explicitly computes the full Hessian matrix.
- For small datasets, 'liblinear' is a good choice, whereas 'sag'
and 'saga' are faster for large ones;
- 'liblinear' can only handle binary classification by default. To apply a
one-versus-rest scheme for the multiclass setting one can wrap it with the
:class:`~sklearn.multiclass.OneVsRestClassifier`.

.. warning::
The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`
for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for
Elastic-Net) and on (multinomial) multiclass support:

================= ======================== ======================
solver l1_ratio multinomial multiclass
================= ======================== ======================
'lbfgs' l1_ratio=0 yes
'liblinear' l1_ratio=1 or l1_ratio=0 no
'newton-cg' l1_ratio=0 yes
'newton-cholesky' l1_ratio=0 yes
'sag' l1_ratio=0 yes
'saga' 0<=l1_ratio<=1 yes
================= ======================== ======================

.. note::
'sag' and 'saga' fast convergence is only guaranteed on features
with approximately the same scale. You can preprocess the data with
a scaler from :mod:`sklearn.preprocessing`.

.. seealso::
Refer to the :ref:`User Guide ` for more
information regarding :class:`LogisticRegression` and more specifically the
:ref:`Table `
summarizing solver/penalty supports.

.. versionadded:: 0.17
Stochastic Average Gradient (SAG) descent solver. Multinomial support in
version 0.18.
.. versionadded:: 0.19
SAGA solver.
.. versionchanged:: 0.22
The default solver changed from 'liblinear' to 'lbfgs' in 0.22.
.. versionadded:: 1.2
newton-cholesky solver. Multinomial support in version 1.6.
'lbfgs'
max_iter max_iter: int, default=100

Maximum number of iterations taken for the solvers to converge.
100
verbose verbose: int, default=0

For the liblinear and lbfgs solvers set verbose to any positive
number for verbosity.
0
warm_start warm_start: bool, default=False

When set to True, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
Useless for liblinear solver. See :term:`the Glossary `.

.. versionadded:: 0.17
*warm_start* to support *lbfgs*, *newton-cg*, *sag*, *saga* solvers.
False
n_jobs n_jobs: int, default=None

Does not have any effect.

.. deprecated:: 1.8
`n_jobs` is deprecated in version 1.8 and will be removed in 1.10.
None

We will check the performance using the test set.

In [25]:
model.score(pdvects_test, labels_test)
Out[25]:
0.9

The accuracy looks good. We will display the learning result.

In [26]:
vectorize_spec.histogram_from_vector(model.coef_.ravel()).plot(colorbar={"type": "linear-midpoint", "midpoint": 0.0})
No description has been provided for this image

How to interpret this plot is as follows:
If a PD has many birth-death pairs located in the dark red regions, it is likely to belong to label 1. Conversely, if many birth-death pairs are found in the dark blue regions, the data is more likely to belong to label 0.

Applying Inverse analysis¶

Let’s take a closer look at the red and blue regions. We begin by extracting the pairs contained within each region from the persistence diagram.

To do this, we isolate the red and blue regions using a threshold value of 0.005. Specifically, we extract the parts greater than 0.005 and less than -0.005. Using NumPy’s inequality operations, we generate a boolean vector, which is then passed to the mask_from_vector method of vectorizer_spec to construct a MaskHistogram object.

This object can be used to visualize the extracted regions and to retrieve the birth-death pairs contained within them.

In [27]:
coef = model.coef_.ravel()
In [28]:
red_area = vectorize_spec.mask_from_vector(coef > 0.005)
In [29]:
blue_area = vectorize_spec.mask_from_vector(coef < -0.005)

The regions are visualized as follows.

In [30]:
blue_area.plot(title="blue", ax=plt.subplot(1, 2, 1))
red_area.plot(title="red", ax=plt.subplot(1, 2, 2))
No description has been provided for this image

We extract the birth-death pairs contained in the selected regions from the 0th persistence diagram (label 0) and the 100th persistence diagram (label 1) using MaskHistogram's filter_pair method.

In [31]:
pairs_in_blue_0 = blue_area.filter_pairs(pds[0].pairs())
pairs_in_red_0 = red_area.filter_pairs(pds[0].pairs())
In [32]:
pairs_in_blue_100 = blue_area.filter_pairs(pds[100].pairs())
pairs_in_red_100 = red_area.filter_pairs(pds[100].pairs())

We will print the numbers of pairs.

In [33]:
len(pairs_in_blue_0), len(pairs_in_red_0), len(pairs_in_blue_100), len(pairs_in_red_100)
Out[33]:
(3, 2, 0, 5)

Indeed, the 0th persistence diagram contains many pairs located in the blue region, while the 100th persistence diagram has more pairs in the red region. Upon inspection, we find that the following pairs are included.

In [34]:
pairs_in_blue_0
Out[34]:
[Pair(0.008484743298521374, 0.015315059365253271),
 Pair(0.0086407914437143, 0.01634538350339904),
 Pair(0.006687816186736094, 0.01715384087925338)]

Here, we compute and visualize the stable volume, a tool used in the point cloud tutorial. This allows us to visually identify which types of void structures play a significant role in the classification task.

In [35]:
stable_volumes_blue_0 = [pair.stable_volume(0.0003, cutoff_radius=0.4) for pair in pairs_in_blue_0]
stable_volumes_red_100 = [pair.stable_volume(0.0003, cutoff_radius=0.4) for pair in pairs_in_red_100]
In [36]:
go.Figure(data=[
    ov.to_plotly3d_mesh(color="blue") for ov in stable_volumes_blue_0
] + [
    p3d.PointCloud(pointclouds[0])
], layout=dict(scene=p3d.SimpleScene()))
In [37]:
go.Figure(data=[
    ov.to_plotly3d_mesh(color="red") for ov in stable_volumes_red_100
] + [
    p3d.PointCloud(pointclouds[100])
], layout=dict(scene=p3d.SimpleScene()))

Since it is a difficult to find the difference, we display all of them in a single view.

In [38]:
fig = go.Figure(data=[
    ov.to_plotly3d_mesh(color="blue") for ov in stable_volumes_blue_0
] + [
    ov.to_plotly3d_mesh(color="red") for ov in stable_volumes_red_100
], layout=dict(scene=p3d.SimpleScene()))
fig.update_traces(opacity=0.5, selector=dict(type="mesh3d"))

The blue volumes represent voids that are "typical" for label 0, while the red volumes correspond to voids that are "typical" for label 1.

The blue voids appear slightly narrower and smaller than the red ones. Looking at the learned persistence diagrams, we can see that the blue region lies lower on the Y-axis (death) compared to the red region. In fact, in a 2-dimensional persistence diagram, the death value corresponds to the square of the radius of the largest ball that can fit into the void. This means that the learning result suggests label 1 tends to have larger voids at this scale.

Specifically, the center of the red region has a death value of approximately 0.022, while the blue region's center is around 0.017. Taking the square roots of these values gives us the radii of the largest balls that can fit into the respective voids:

In [39]:
np.sqrt(0.017), np.sqrt(0.022)
Out[39]:
(np.float64(0.130384048104053), np.float64(0.14832396974191325))

Additionally, by examining the original persistence diagrams, we can observe that smaller voids are abundant and commonly shared between both groups.

This tutorial ends here.

Appendix¶

Supplementary discussion about PCA¶

PCA approximates vectors $(w_i)_{i=1,\ldots}$ as follows:

$$ w_i \approx \mu_{i1} v_1 + \mu_{i2} v_2 + v_0 $$

In fact, PCA is calculated under the constraint that $\{v_1, v_2\}$ is a orthonormal system. While this constraint is mainly due to computational consideration, the constraint gives the following advantages:

  • The distance of two points in the PCA's projection space approximates the distance in the original high dimensional space
  • The angle of two vectors in the PCA's projection space approximates the angle in the original high dimensional space

Meanwhile, a disadvantage of PCA is that the axes of PCA do not correspond to the axes of the meaning inherent in the data. Let's see how the disadvantage appears when PI and PCA is combined.

The following shows the PCA projection plot again.

In [40]:
plt.gca().set_aspect('equal')  
plt.scatter(pca_reduced[labels == 0, 0], pca_reduced[labels == 0, 1], c="r")
plt.scatter(pca_reduced[labels == 1, 0], pca_reduced[labels == 1, 1], c="b")
Out[40]:
<matplotlib.collections.PathCollection at 0x73452dd761e0>
No description has been provided for this image

In this figure, you can see that the red and blue points (label 0 and label 1) are separated diagonally in the low-dimensional space obtained via PCA, as shown in the following figure.

In [41]:
plt.gca().set_aspect('equal')  
plt.scatter(pca_reduced[labels == 0, 0], pca_reduced[labels == 0, 1], c="r")
plt.scatter(pca_reduced[labels == 1, 0], pca_reduced[labels == 1, 1], c="b")
plt.plot([-4, 4], [-5, 5], c="k")
plt.annotate('', xy=(-4, 4), xytext=(6, -4), arrowprops=dict(arrowstyle='<->', color='k', lw=2, mutation_scale=20))
Out[41]:
Text(6, -4, '')
No description has been provided for this image

This means that the direction of the arrow is useful to distinguish the labels, not PCA axes.

Therefore, we want to know the meaning of the direction. From the PCA theory, we can reconstruct the arrow direction into the original high-dimensional PI space. The vector $(\alpha_1, \alpha_2) \in \mathbb{R}^2$ in the PCA's projection space corresponds to $\alpha_1 v_1 + \alpha_2 v_2$ in the PI space.

Since the vector of the arrow is $(10, -8)$, we can reconstruct the high-dimensional vector using pca.components_.

In [42]:
a = np.array([10, -8], dtype=float)
a /= np.linalg.norm(a)
a_in_pi = a[0] * pca.components_[0] + a[1] * pca.components_[1]
a_in_pi.shape
Out[42]:
(8256,)

We can visualize the high-dimensional vector in the form of signed histogram using histogram_from_vector

In [43]:
vectorize_spec.histogram_from_vector(a_in_pi).plot(colorbar={"type": "linear-midpoint", "midpoint": 0.0})
No description has been provided for this image

The figure looks similar to the reconstructed histogram of the coefficient of logistic regression. The fact is very natural since the arrow direction is important to classfy the PDs through PIs.