Estelle Nassar - Dimensionality Reduction

Dimensionality reduction is used to reduce dimensions of the data without losing much information, and to visualize the hidden patterns in the dataset.

July 10, 2023

•

min read

There are two main algorithms to perform dimensionality reduction. Bear in mind PCA much faster than t-SNE.

Principal Component Analysis - PCA

Allows you to reduce the dimensionality of a dataset consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset. This is done by transforming the variables to a new coordinate space of variables, which are known as principal components (or simply, the PCs), and are orthogonal to each other, using spectral decomposition.

The selection of principal components is such that the retention of variation present in the original variables is the maximum for the first principal component and decreases as we move down in the order. The principal components are the eigenvectors of the covariance matrix, and hence they are orthogonal.

Use correlation matrix instead of covariance matrix for when data isn't scaled.

Stochastic neighbor embedding - tSNE

This method calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. This is done by setting the probabilities from the low-dimensional space similar to those of the high dimensional space. We measure the difference between the probability distributions of the two-dimensional spaces using Kullback-Leibler divergence and try to optimize it.

Probabilistic approach to place samples from high-dimensional space into low-dimensional space so as to preserve the identity of neighbors. First, find an embedding so that original high-dimensional sample distribution is approximated well by the resulting low-dimensional sample distribution (t-SNE uses Kullback-Leibler divergence to measure the “distance” between distributions and minimizes this objective function). It gives rise to a non-linear embedding where close-by points remain close by and far away points remain far away so that clusters are preserved.

Dimensionality Reduction

Principal Component Analysis - PCA

Stochastic neighbor embedding - tSNE

Discover more

Unsupervised vs Supervised Learning

Neural Networks