# Optimal Transport Methods Machine Learning

## Introduction

Optimal transport concepts applied to machine learning applications can also be referred to as computational Optimal Transport (OT). At its core, machine learning focuses on making comparisons between complex objects. To properly measure these similarities, a metric is needed, which is a distance function. Optimal transport is a natural tool in this case, as it is concerned with efficient distance functions (e.g. the Wasserstein metric).

In addition, optimal transport respects the underlying structure and geometry of a problem while providing a framework for comparing probability distributions. Optimal transport methods have received attention from researchers in fields as varied as economics, statistics, and quantum mechanics. The categories that applications of OT methods in a machine learning context can be divided into include learning, domain adaptation, Bayesian inference, and hypothesis testing[1].

### Learning Methods

These methods have used transport-based distances in a wide range of research contexts, ranging from language analysis to image recognition. To a certain extent, the geometry underlying optimal transport models helps to generate faster algorithms as well as establish data connections. Optimal transport models still have limitations, however, as some structure may not be captured. This structure can be described as either intrinsic or extrinsic. It is intrinsic if the distributions reflect structured objects (e.g. images with segments) and it can be characterized as extrinsic if there is additional information that creates structure (such as groupings) [2]. The following are some types of models that fall into the learning methods in the OT category.

#### Graph-based Semi-supervised Learning

Graph-based semi-supervised learning is an effective approach for classification from a large variety of domains, including image and text classification. It is possible to use graph-based algorithms with this method, and the method is often useful for unlabeled data. To imagine what data is best analyzed by this type of model, consider a graph with multiple nodes. These graph nodes are usually associated with a simple label, like a real number. Now consider more complex node labels, such as probability distributions or histograms. A concrete example of this would be a set of traffic cameras distributed across a geographical area, each of which produces a histogram or a distribution of commuter traffic over a given period of time. Many other interesting research questions exhibit a similar structure (e.g. climate data, rankings of restaurants, etc.). Optimal transport-inspired methods have used the two-Wasserstein distance between distributions to analyze models of this type [3] The work of Solomon, et. al. in this area led to additional insights by Gao, et. al, where Wassertstein propagation on graphs was used as the basis for a message-passing algorithm. This technique facilitated a generalization that could be applied to hypergraphs through Wasserstein barycenters [4].

#### Generative Adversarial Networks (GAN)

A class of frameworks for machine learning developed in 2014 by Ian Goodfellow (Goodfellow). Machine learning frameworks where two neural networks are used compete against each other in a game-theoretic sense. If a GAN is trained on a given set of training data, it can produced a new set of data that is at the surface similar to the original set of data (for example, it can create new, similar images based on an initial training set of images). These techniques have been used in semi-supervised learning. They offer increased flexibility in use of the objective function, including the use of f-divergences and Jensen-Shannon [5].

##### Wasserstein GAN (WGAN)

This is classified as a type of tool for unsupervised learning, which uses a minimization of the distance between data distribution contained in the training set and the distribution of the observed data. In certain cases this produces a more stable training process. It minimizes an approximation of the earth mover (EM) distance. The optimization problem associated with the WGAN was shown to be sound by Arjovsky. This type of GAN reduces some of the problems associated with GAN approaches, including a reduction in the need to balance the training of the generator and the discriminator [5]. The WGAN also estimates the EM distance in a continuous manner via training the discriminator [5]. The following is an algorithm implementing the WGAN approach [5].

#### Algorithm: Wasserstein Generative Adversarial Networks (WGAN)

   Require : ${\displaystyle \alpha }$ learning rate, ${\displaystyle c}$ clipping parameter, ${\displaystyle m}$ batch size, ${\displaystyle n_{\text{critic }}}$  number of iterations of critic per generator iteration.
Require : ${\displaystyle w_{0},}$ initial critic parameters, ${\displaystyle \theta _{0},}$ initial generator's parameters.
1: while ${\displaystyle \theta }$ has not converged do
2:    for ${\displaystyle t=0,\ldots ,n_{\text{critic }}}$ do
3:          Sample ${\displaystyle \left\{x^{(i)}\right\}_{i=1}^{m}\sim \mathbb {P} _{r}}$ a batch from the real data.
4:          Sample ${\displaystyle \left\{z^{(i)}\right\}_{i=1}^{m}\sim p(z)}$ a batch of prior samples.
5:          ${\displaystyle g_{w}\leftarrow \nabla _{w}\left[{\frac {1}{m}}\sum _{i=1}^{m}f_{w}\left(x^{(i)}\right)-{\frac {1}{m}}\sum _{i=1}^{m}f_{w}\left(g_{\theta }\left(z^{(i)}\right)\right)\right]}$
6:          ${\displaystyle w\leftarrow w+\alpha \cdot \operatorname {RMSProp} \left(w,g_{w}\right)}$
7:          ${\displaystyle w\leftarrow \operatorname {clip} (w,-c,c)}$
8:    end for
9:       Sample ${\displaystyle \left\{z^{(i)}\right\}_{i=1}^{m}\sim p(z)}$ a batch of prior samples.
10:      ${\displaystyle g_{\theta }\leftarrow -\nabla _{\theta }{\frac {1}{m}}\sum _{i=1}^{m}f_{w}\left(g_{\theta }\left(z^{(i)}\right)\right)}$
11:      ${\displaystyle \theta \leftarrow \theta -\alpha \cdot \operatorname {RMSProp} \left(\theta ,g_{\theta }\right)}$
12: end while



Recall that the EM distance is both continuous and differentiable, this implies that the critic should be trained to toward optimality. As the critic is trained more, the gradient (Wasserstein) becomes more reliable. The Wasserstein GAN can have problems due to the enforcement of the Lipschitz constraint the critic, which required weight clipping (reflected in algorithm) [6]

### Restricted Bolzman Machines (RBM)

Restricted Boltzman Machines (RBM) are probabilistic graphical models and can obtain hierarchical features at multiple levels. Originally created under the name "Harmonim" in 1986 by Smolensky, RBM can learn a probability distribution over a given set of inputs. These machines are able to learn complex and multi-scale distributions based on empirical data. The RBM generally has a layer of latent variables with a given probability distribution. This probability distribution is defined over a set of binary observed variables with a parameter that is to be learned.

Learning of model parameters is usually facilitated by Kullback-Lieber divergence (divergence from the training sample to the model). In research by Montavan, it is assumed that a metric can be established between observations [7]. This was used to define the Wasserstein distance between the training sample distribution that the distribution formed by the Boltzman machine.

### Entropy-regularized Wasserstein Loss

This has been used for multi-label classification. It is characterized by a relaxation of the transport problem which addresses unnormalized measure. It does this by replacing the equality constraints with soft penalties with respect to KL-divergence.[8]

In this case the goal is to learn about or extrapolate from one domain to another, often by finding domain-invariant representations [9] In cases where these sorts of models are useful the distributions are generally different but related. This is a technique that is often used to transfer information based on labelled data to unlabeled data. For example, in some cases the labels are available in the source domain, but the classification needs to be conducted on the target. If the classifier is trained on the source domain, the data will have performance deficiencies in the targeted domain [10].

By obtaining the best transportation plan connecting the probability distributions of source and target domains, estimates of learning samples are estimated. The transformation is non-linear and invertible. This allows for the use of a variety of machine learning methods that can be used on the transformed dataset. Regularized, unsupervised models have been used, as well as Joint Class Proportion and Optimal Transport (JCPOT) to address multi-source domain adaptation. [8]The following is a summary of an algorithm implemented by Redko, et al:[11]

#### Algorithm: Joint Class Proportion and Optimal (JCPOT)

   1: Parameters: ${\displaystyle \epsilon ,}$ maxIter, ${\displaystyle \forall k\left(\mathrm {C} ^{(k)}and\lambda ^{(k)}\right)}$
2: ${\displaystyle cpt\leftarrow 0}$, ${\displaystyle err\leftarrow \infty }$
3: ${\displaystyle \zeta ^{(k)}\leftarrow \exp \left(-{\frac {\mathbf {C} ^{(k)}}{\epsilon }}\right),\quad \forall k}$
4: while ${\displaystyle cpt}$ < maxIter and ${\displaystyle err}$ > threshold do
5:   ${\displaystyle \zeta ^{(k)}\leftarrow \operatorname {diag} \left({\frac {\mathbf {m} }{c(k)\mathbf {1} }}\right)\zeta ^{(k)},\quad \forall k}$
6:   ${\displaystyle \mathbf {h} ^{(cpt)}\leftarrow exp(\sum _{k=1}^{K}\lambda ^{(k)}\log(\mathbf {D} _{1}^{(k)}\zeta ^{(k)}1))}$
7:   ${\displaystyle \zeta ^{(k)}\leftarrow \zeta ^{(k)}\operatorname {diag} \left({\frac {\mathbf {D} _{2}^{(k)}\mathbf {h} }{\zeta ^{(k)}\mathbf {1} }}\right),\quad \forall k}$
8:   ${\displaystyle \operatorname {err} \leftarrow \left\|\mathbf {h} ^{(cpt)}-\mathbf {h} ^{(cpt-1)}\right\|_{2}}$
9:   ${\displaystyle \quad cpt\leftarrow cpt+1}$


This algorithm describes the optimization problem solved in Redko, et al which used operators defined such that proportions between the source and target domains are equal. A Wasserstein barycenter problem was used to achieve this [11] [12].

## Bayesian Inference

An important aspect of Bayesian inference is calculating expectations given a posterior probability function. This creates a complicated problem, mainly because of multidimensional integrals. The method usually used to solve these problems is Monte Carlo Markov Chains, which in turn introduce complications (slow convergence to empirical expectation, lack of sample independence) [1]. The need for these complicated MCMC models was bypassed in the work of El Moselhy and Marzouk [13] In a transport based method, Bayesian inference was formulated in terms of optimal transport theory, and the existence and uniqueness of a measure-preserving map was established. The map was subsequently parametrized and via the solution of an optimization problem. This approach helps bypass many of the problems associated with the MCMC models.

## Hypothesis Testing

There are many methods of hypothesis testing that connect to machine learning. This might be more colloquially described as "goodness-of-fit," or a summary of how well the actual observed data fits the data predicted by a given model. The most common tests used in this context are the chi-square, Kolmogorov-Smirnov and the Anderson-Darling tests. These tests are generally based on some distance between probability laws [14]. As such, it is not surprising that del Barrio et al described tests of goodness of fit based on the ${\displaystyle L_{2}}$-Wasserstein Distance [14]. The Wasserstein distance was also used (with an entropic smoothing) to investigate the connections between univariate (PP/QQ plots, ROC/ODC curves) and multivariate tests (energy statistics, kernel based maximum mean discrepancy) [15].

Recent research on cutting edge topics relating to optimal transport applications to machine learning can be found at MSRI Workshops.

Tutorials on machine learning methods can be found at the following: Optimal Transport for Machine Learning.

## References

1. Kolouri, Soheil, Serim Park, Matthew Thorpe, Dejan Stepcey, Gustavo Rohde. Optimal Mass Transport: Signal Processing and Machine-Learning Applications. IEEE Signal Process Mag. 2017 Jul. 34(4): 43-59.
2. Alvarez-Melis, David, Et al. Structured Optimal Transport. Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018, Lanzarote, Spain. PMLR: Volume 84.
3. Solomon, J., et al. Wasserstein Propagation for Semi-Supervised Learning. Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014.
4. Gao, Tingran, et al. Wasserstein Soft Label Propagation on Hypergraphs: Algorithm and Generalization Error Bounds. Univ. of Chicago. 2019
5. Arjovsky, Martin, et al. Wasserstein GAN. arXiv:1701.07875v3 [stat.ML] 6 Dec 2017
6. Gulrajani, I., et al. Improved Training of Wasserstein GANs. arXiv:1704.00028v3 [cs.LG] 25 Dec 2017.
7. Montavon, G., Klaus-Robert Muller, Marco Cuturi. Wasserstein Training of Restricted Boltzmann Machines. 30th Conference on Neural Information Processing Systems (NIPS 2016)
8. Peyre, Gabriel, and Marco Cuturi, "Computational Optimal Transport", Github, https://optimaltransport.github.io/pdf/ComputationalOT.pdf.
9. Courty, Nicolas, et al. Optimal Transport for Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. X, No. X, Jan XX. 2018
10. Flamary, R. Domain Adaptation with Optimal Transport: From Mapping to Learning with Joint Distribution
11. Redko, I., et al. Optimal Transport for Multi-source Domain Adaptation under Target Shift. arXiv:1803.04899v1 [stat.ML] 13 Mar 2018 et al.
12. Benamou, Jean-David, Carlier, Guillaume, Cuturi, Marco, Nenna, Luca, & Peyré, Gabriel. 2015. Iterative Bregman projections for regularized Transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111–A1138
13. El Moselhy, Tarek A., Youssef M Marzouk. Bayesian Inference with Optimal Maps. Journal of Computational Physics 231(23). September 2011.
14. del Barrio E, Cuesta-Albertos JA, Matrán C, et al. Tests of goodness of fit based on the l_2-Wasserstein distance. The Annals of Statistics. 1999; 27(4):1230–1239
15. Ramdas, Aaditya, Nicolás García Trillos, and Marco Cuturi. On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests. Entropy 2017, 19(2), 47