Dimensionality Reduction

Dimensionality reduction is a pretty well-known technique mostly used for visualization purposes. Normally, data come in very high-dimensional representations; each feature present in the dataset is represented as a dimension, so the number of dimensions add up very rapidly. There are two problems that arise from this situation:

The curse of dimensionality comes into play. where the model is prone to overfitting due to the small sample size relative to the number of features. In order to counterattack this problem, we would have to exponentially increase the number of samples for each feature present in the dataset. If this is infeasible ─which is often the case─, then dimensionality reduction is inevitable.
Visualizing a 30-dimensional dataset is practically impossible for us humans to do, as the maximum number of dimensions we are able to see is three. Does this mean that we will never get a representational "picture" of the data we are trying to understand? No, one may argue that there are still other alternatives that help us getting to know the data, such as comparing each feature with respect to another one. Although, this might come in handy in later stages, it is important to get an overall representation of the data embedded in one figure for us to start getting familiarized with the dataset.

In this post we will check out three useful dimensionality reduction ─equally known as feature extraction─ techniques: Principal Component Analysis, Linear Discriminant Analysis and Average Neighborhood Margin Maximization. There is a lot of information out there about the two first methods, for the last one not so much. This post, as well as the other ones, was inspired by the work of Suárez et al. [3]. However, I am more focused on interpretation than using hard-rock mathematical proofs.

Principal Component Analysis

Principal Component Analysis (PCA) is a popular dimensionality reduction technique employed across various fields of study, such as neuroscience, quantitative finance and data science [1]. Although mostly used for exploratory analysis, PCA turns out to be very useful for capturing linear correlations.

As described in Weinberger's & Saul's paper, PCA is an eigenvector-based method which uses an orthogonal transformation in order to convert a set of possibly correlated dataset into an uncorrelated one. Informally, it can be thought of as fitting an ellipsoid ─or "chopping off" a distribution, as mentioned in [2]─ in the dataset, then finding its axis known as the Principal Components (PCs), and finally performing a change of basis from the Cartesian plane to the PCs subspace.

As mentioned in the above figure, the lower right plot "drops off" the second Principal Component, and a valid question would be to ask why drop that one and not the first PC. The reason being is that the goal of Principal Component Analysis is to project the points into lower dimensions without making them collapse into each other; you want them as spread as possible [2]. So, if we were to project the data orthogonally with respect to the second Principal Component, more points would collapse into each other than if we were to project them with respect to the first one.

In terms of design, PCA was optimized so that the first PC expresses the most amount of variance present in the data [3]; the second PC would then account for the second most variance, and so on. Principal Component Analysis is going to help us identify the directions along in which we have a lot of variance.

So, how much variance/information are we missing out when selecting only a subset of dimensions to represent our original dataset? This can be measured by calculating the reconstruction loss, which is the Mean Squared Error (MSE); or by fetching the amount of variance ratio each Principal Component accounts for and summing them up.

If we want to reduce the number of dimensions of the dataset, then getting rid of highly-correlated features is a good idea, since the variance they account for are already taken care of by their counterparts (see Figure 2). As a consequence, only keeping uncorrelated features is what we should be aiming for, as they explain most of the variance present in the data; thus the reconstruction error will be low. As an example, figure 3 shows the first 64 Principal Components of the 1024-dimensional Olivetti Faces dataset, which in total represent 89.71% of the original data.

As we have seen, Principal Component Analysis is an efficient dimensionality reduction technique that allows for a better computational cost. One common drawback for PCA is due to its unsupervised nature: separating the classes between each other is not one of the algorithm goals, so a classification model will perform sub-optimally when used alongside PCA. That's where Linear Discriminant Analysis comes into play, as it is designed for both reducing the number of dimensions and maximizing the distance between different classes.

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a supervised distance metric learning algorithm that exploits class information so as to maximize the ratio of between-class variance and within-class variance. In other words, the linear transformation matrix $L$ is estimated by the corresponding eigenvectors of the $n$ largest eigenvalues of $S_{w}^{-1}S_{b}$ , where $S_w$ is the within-class variance and $S_b$ is the between-class variance. Check out [3] for the derivation of why this is true.

That is all to say that LDA does not pay attention to the overall variance of the data, like PCA does. Now that it has access to where each data point belongs to, Linear Discriminant Analysis focuses on maximizing the separability between two or more clusters.

Choosing the number of axis on which to transform the data is the same for both PCA and LDA: we only need to keep the ones who account for almost all the variance. We can obtain the amount of variance an axis holds by calculating the quotient between its respective eigenvalues and the sum of all of them ( $var=\frac{\lambda_i}{\lambda}$ ). Then, select all of the axis that are above a certain threshold (see Figure 3). A minor difference from PCA is that the total number of axis we can obtain from LDA is subject to the total number of classes in the data [3]. Namely, if we have $r$ classes, then LDA will be able to find at most $r-1$ directions/axis that maximize the separability of the classes.

One major drawback for LDA is that it assumes the data follows a Gaussian distribution, as it has to be normalized. This is rarely the case, but it has been empirically proven [4] that even if the condition does not hold, LDA still does a pretty good job at both dimensionality reduction and class separation. Suárez et al. [3] argue that the algorithm may extract features not well-suited for kNN classification when the distribution of the data is non-Gaussian.

Like every other algorithm, Linear Discriminant Analysis has its limitations. It is very important to identify them so that we do not end up working with the black-box mindset. The first one was already mentioned, which is that LDA might extract spurious features for kNN to work on if the data is non-Gaussian. The other one is that a lot of information conveyed by our original data may be lost during the transformation, since it only allows for $r-1$ axis to be fetched. Finally, if the size of the dataset is very small, LDA fails to extract the eigenvectors which maximize class separability [3].

Wouldn't it be nice if there were to be an algorithm that combines the advantages of both PCA and LDA? Luckily there is, and it is called Average Neighborhood Margin Maximization. Unfortunately, it is not as popular as the previous algorithms; nonetheless, it is considered a very efficient method for feature extraction [5] that does not require lots of data for it to perform well.

Average Neighborhood Margin Maximization

The Average Neighborhood Margin Maximization (ANMM) algorithm attempts to do the same as LDA, that is, maximize class separability. The difference ─and a major advantage─ with ANMM is that the number of resulting dimensions does not depend on the number of classes, it does not assume a Gaussian distribution present in the data, and most importantly, it can operate on datasets with a low number of samples.

This supervised dimensionality reduction technique maximizes the average neighborhood margin, which is defined through two types of neighborhoods:

The Homogeneous neighborhood $N_i^o$ is the set of 𝜉 nearest points that are similar i.e. have the same class as a data point $x_i$ . This in turn defines a compactness matrix $C$ , which calculates the average Euclidean distance between each $x_i \in N_i^o$ : $C=\sum_{i, j: x_j \in N_i^o} \frac{(x_i - x_j)(x_i-x_j)^T}{|N_i^o|}$ . Basically, it measures an average of how close each data point is with respect to another one belonging to the same class.
Analogously, the Heterogeneous neighborhood $N_i^e$ is the set of $\zeta$ nearest point that are dissimilar to the data point $x_i$ . Its scatterness matrix $S$ measures the average distance between dissimilar points: $S=\sum_{i, k: x_k \in N_i^e} \frac{(x_i - x_k)(x_i-x_k)^T}{|N_i^e|}$ .

Since we want to separate the points belonging to the heterogeneous neighborhood from the homogeneous one, we maximize their difference; that is known as the average neighborhood margin $\gamma$ that best fits our desires. By maximizing $\gamma$ we obtain our transformation matrix $L$ , which can be proved [5] to be the resulting eigenvectors of $S-C$ . Dimensionality reduction can then be achieved by taking the first $l$ eigenvectors of $S-C$ whose corresponding eigenvalues are the $l$ largest ones.

In Figure 8 we can appreciate how different parameterizations of $\xi$ and $\zeta$ affect the transformation (Suárez et al. refer to them as num_friends and num_enemies, respectively, in their Python module for DML). We can observe how the number of friends helps the clusters be more dense, while the number of enemies encourages class separation. These hyperparameters are similar to the number of neighbors used in kNN: if the number is too low, then the linear transformation will overfit, and conversely, if it is too high, then the transformation will underfit.

Average Neighborhood Margin Maximization was introduced by Fei & Zhang [5] as a response to solve LDA's major problem: if the dataset had a small sample size, then $S_w^{-1}S_b$ would be singular (thus, not invertible) and their corresponding eigenvectors could not be derived. ANMM avoids this problem by not having to invert any matrix whatsoever. Other two advantages that this dimensionality reduction technique holds over LDA is that the number of desired dimensions does not depend on the number of classes, and it does not make any assumptions over the distribution of the classes [3].

Conclusions

As we have seen in the last three sections, dimensionality reduction is a very useful method for feature extraction, which in turn helps combat the well-known and very much feared dimensionality curse. Unsupervised algorithms such as Principal Component Analysis are useful when trying to get rid of all linear correlations present in the dataset; similarly, supervised algorithms like Linear Discriminant Analysis and Average Neighborhood Margin Maximization maximize class separability while still reducing the number of dimensions. This common factor is due to the change in basis all transformations make in order to account for the most variance present in the data.

By only keeping the most informative pieces of information (i.e. accounting for most of the variance) we are able to achieve a low computational cost with respect to both space and time. A lot of people only use dimensionality reduction when it comes to understanding and visualizing the data, which is one of the first steps someone as a data scientist should do, as blindly employing a black-box model will be of no use. However, this is not its sole purpose, it also comes in handy particularly when trying to create a data pipeline that feeds to a model whose objective relies on being fast and accurate. There are a lot of ways that dimensionality reduction can help us solve a problem, but these two are the most popular go-to use cases.

Finally, let us take a quick survey for each algorithm explained in this post:

Principal Component Analysis (PCA) is an unsupervised learning algorithm that estimates a new subset of axis ─or Principal Components (PCs), as they are known in this context─ which account for the most variance present in the dataset. It then transforms the original correlated high-dimensional dataset into an uncorrelated lower-dimensional representation by performing a linear transformation that minimizes the reconstruction error, as measured by the Euclidean distance.
Linear Discriminant Analysis (LDA) optimally separates classes by maximizing the ratio of between-class variance and within-class variance. It ends up being very useful when it comes to create lower-dimensional representations of the original data. Nonetheless, for other purposes such as classification, LDA relies on the total number of classes, the data being normally distributed, and not having a small sample size.
Average Neighborhood Margin Maximization (ANMM) resolves all problems of LDA by maximizing the margin between homogeneous neighborhoods and heterogeneous neighborhoods. Furthermore, it does not assume any class distribution whatsoever and also performs well on small sample sizes.

Next post is going to be about algorithms that improve Nearest Neighbors Classifiers, particularly k-Nearest Neighbors. I hope you found this post useful to read in the same way as it was for me to write.

References

[1] Principal Component Analysis. (n.d.) . Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Principal_component_analysis

[2] MIT OpenCourseWare. (2017, Aug 17). 19. Principal Component Analysis [Video file]. Retrieved from https://www.youtube.com/watch?v=WW3ZJHPwvyg.

[3] Suárez, J. L., García, S., & Herrera, F. (2018). A Tutorial on Distance Metric Learning: Mathematical Foundations, Algorithms and Software. arXiv preprint arXiv:1812.05944.

[4] Raschka, S. (2014). Linear Discriminant Analysis. Retrieved from https://sebastianraschka.com/Articles/2014_python_lda.html.

[5] Wang, Fei & Zhang, Changshui. (2007). Feature Extraction by Maximizing the Average Neighborhood Margin. IEEE Conference on Computer Vision and Pattern Recognition. 10.1109/CVPR.2007.383124.