Distance Metric Learning

Iñaki Liendo @Sebasti69376202

When dealing with classification problems we are aiming for our model to accomplish two distinct goals, namely i) minimize the distance between similar points and ii) maximize the distance between dissimilar points.

Popular approaches focus on predefined metrics such as categorical crossentropy and cosine similarity. Though these similarity metrics work well in general, there will come a time where we need to design our own distance metric specific to our problem in order for the model to achieve optimal performance. This is what Distance Metric Learning (DML) is all about: coming up with that problem-specific similarity measure, based on our training data [1]. It also has applications on dimensionality reduction techniques, which I'll cover in one post of this four-part blog series.

Distance metric learning has been demonstrated to be very useful when used jointly with k-nearest neighbors (kNN) [2], as it helps learn an appropriate distance metric instead of just using Euclidean distance. "Unfortunately, Euclidean distances ignore any statistical regularities that might be estimated from a large training set of labeled examples. Ideally, one would like to adapt the distance metric to the application at hand." [2]

In order to learn useful metrics, researchers decided to generalize the notion of euclidean distance by performing a linear transformation LL before calculating the squared distanceDL(xi,xj)=L(xixj)22D_{L}(\vec{x}_{i},\vec{x}_{j})=||L(\vec{x}_{i}-\vec{x}_{j})||^2_{2}. The core challenge for DML is then to estimate LL so as to minimize the distance DD. Other researchers have gone a step further and defined the Mahalanobis metrics, which hold a special place for computing distances in kNN classification.

Mahalanobis Metrics

The only difference between learning a linear transformation LL and a Mahalanobis distance metric DM(xi,xj)=(xixj)TM(xixj)D_{M}(\vec{x_{i}}, \vec{x_{j}}) = (\vec{x_i} - \vec{x_j})^TM(\vec{x_i}-\vec{x_j}) is that on the latter, the matrix MM needs to be positive semidefinite, meaning that DMD_{M} is more prone to achieve convexity than DLD_{L}. On the other hand, lower-projections are easier to achieve using the matrix LL[1, 2].

As a side note, I would like to mention that although it can be proved that DL=DMD_{L}=D_{M}the interpretation for both is different since using the matrix LL means that the learned distance is euclidean after performing the transformation; and using the matrix MM implies learning a whole new metric [1].


Even when DML was introduced as a generalization of measuring distances ─going from Euclidean to Mahalanobis metrics─ in order to learn tailored losses suitable for our specific problem, it is applicable to other domains as well [1]:

  • Dimensionality reduction: by allowing either transformation matrix to be low-rank, we are learning a linear mapping from a high-dimensional space to a low-dimensional one. This may be useful for combating the dimensionality curse, which a lot of machine learning algorithms are susceptible to fall in; keeping a low computational cost in both space and time; and being able to visualize the dataset, as this may be useful for both debugging and familiarizing yourself with the distribution of your data.

  • Axes selection: although quite similar to dimensionality reduction, it allows for the coordinate axes to reveal information about the data, such as its variance and principal components.

  • Clustering algorithms: they tackle the problem that I first mentioned on this entry, which is that of keeping similar points together while separating them from dissimilar ones.

As we can see, Distance Metric Learning is not just about trying to improve the model's accuracy by some percentage with some specially designed loss function. It is also used as a means to arrange the data in such a way that it is more readable for the user, and thus, easier to present. There are several approaches to the DML problem such as eigenvector methods and convex optimization [2] which we will not talk about as they are not within the scope of this post. We will instead analyze the techniques that try to optimize a problem, along with the algorithms they make use of.

Types of problems

Following the work of Suárez et al. [1], there are four techniques that attempt to solve or optimize a problem:

  1. Dimensionality Reduction Techniques. They try to estimate a linear transformation matrix that maps from the dataset space to a lower dimensional space.

  2. Algorithms to Improve Nearest Neighbor Classifiers. Algorithms specifically designed to improve the accuracy of kNN such as Neighborhood Components Analysis and Large Margin Nearest Neighbors.

  3. Algorithms to Improve Nearest Centroid Classifiers. Algorithms specifically designed to improve distance-based classifiers such as NCMML and NCMC.

  4. Information Theory Based Algorithms. After estimating a probability distributions over the data, use algorithms that bring them closer or further away based on their divergences.

In these posts series I will only talk about the first three techniques, as they are ─at least in my experience─ more popular than the last one. Each technique will then be thoroughly analyzed, as well as its corresponding algorithms. So feel free to go check out the one that most catches your attention.


Summing everything up, it is possible for us to search for alternatives instead of mindlessly employing predefined metrics, especially when it comes to classification problems. We are not bound to use handcrafted designs since we can make machines come up with tailored solutions for our specific problems; this is what machine learning is about! However, it does not mean that we are free to crunch some random numbers into a black box and expect nice things to come out of it.

Using Distance Metric Learning for solving problems is one more weapon in the arsenal. There are times where it will come in handy, and there are times where it won't. The typical use cases, as aforementioned, for DML are combating the dimensionality curse, saving computational costs, interpreting data and improving accuracy, among others.

In the next posts we will analyze the inner-workings of the most popular techniques used within DML.


[1] Suárez, J. L., García, S., & Herrera, F. (2018). A Tutorial on Distance Metric Learning: Mathematical Foundations, Algorithms and Software. arXiv preprint arXiv:1812.05944.

[2] Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb), 207-244.