# Sinkhorn's Algorithm

Sinkhorn's Algorithm is an iterative numerical method used to obtain an optimal transport plan ${\displaystyle \pi \in \Gamma (\alpha ,\beta )}$ for the Kantorovich Problem with entropic regularization in the case of finitely supported positive measures ${\displaystyle \alpha ,\beta \in {\mathcal {M}}_{+}(X)}$. For further reading see Peyré & Cuturi (pg. 62-73)[1]

## Continuous Problem Formulation

Entropic regularization modifies the Kantorovich problem by adding a Kullback-Leibler divergence term to the optimization goal. Specifically, the general form of the problem is now to determine

${\displaystyle L_{c}^{\epsilon }(\alpha ,\beta )=\inf _{\pi \in \Gamma (\alpha ,\beta )}\underbrace {\int _{X\times Y}c(x,y)\mathop {} \!\mathrm {d} \pi (x,y)} _{{\text{Kantorovich functional }}\mathbb {K} (\pi )}+\epsilon \underbrace {\operatorname {KL} (\pi \mid \alpha \otimes \beta )} _{\text{entropic term}}}$

where ${\displaystyle \alpha \otimes \beta }$ is the product measure of ${\displaystyle \alpha }$ and ${\displaystyle \beta }$, and where

${\displaystyle \operatorname {KL} (\mu \mid \nu )=\int _{X}\log \left({\frac {\mathrm {d} \mu }{\mathrm {d} \nu }}(x)\right)\mathop {} \!\mathrm {d} \mu (x)+\int _{X}(\mathop {} \!\mathrm {d} \nu (x)-\mathop {} \!\mathrm {d} \mu (x))}$

whenever the Radon-Nikodym derivative ${\displaystyle {\tfrac {\mathrm {d} \mu }{\mathrm {d} \nu }}}$ exists (i.e. when ${\displaystyle \mu }$ is absolutely continuous w.r.t. ${\displaystyle \nu }$) and ${\displaystyle +\infty }$ otherwise. This form of the KL divergence is applicable even when ${\displaystyle \mu ,\nu \in {\mathcal {M}}_{+}(X)}$ differ in total mass and it reduces to the standard definition whenever ${\displaystyle \mu }$ and ${\displaystyle \nu }$ have equal total mass. From this definition it immediately follows that for ${\displaystyle \epsilon >0}$ an optimal coupling ${\displaystyle \pi ^{*}}$ must be absolutely continuous w.r.t ${\displaystyle \alpha \otimes \beta }$. As a result, the optimal plan is in some sense less singular and hence "smoothed out."

## Discrete Problem Formulation

To apply Sinkhorn's algorithm to approximate ${\displaystyle L_{c}^{\epsilon }(\alpha ,\beta )}$, it will be necessary to assume finite support so let ${\displaystyle \alpha =\textstyle \sum _{i=1}^{n}a_{i}\delta _{x_{i}}}$ and ${\displaystyle \beta =\textstyle \sum _{j=1}^{m}b_{i}\delta _{y_{j}}}$ and denote the corresponding vector of weights by ${\displaystyle \mathbf {a} \in \mathbb {R} _{+}^{n}}$ and ${\displaystyle \mathbf {b} \in \mathbb {R} _{+}^{m}}$. Additionally let ${\displaystyle C_{ij}=c(x_{i},y_{j})}$ and denote the discrete version of ${\displaystyle \Gamma (\alpha ,\beta )}$ by ${\displaystyle U(a,b)=\{P\in \mathbb {R} ^{n\times m}\mid \textstyle \sum _{j}P_{ij}=a_{i},\textstyle \sum _{i}P_{ij}=b_{j}\}}$. This lets us write the entropic Kantorovich problem as

${\displaystyle L_{c}^{\epsilon }(\mathbf {a} ,\mathbf {b} )=\inf _{P\in U(\mathbf {a} ,\mathbf {b} )}\sum _{i,j}C_{ij}P_{ij}+\epsilon \operatorname {KL} (P\mid \mathbf {a} \mathbf {b} ^{T})}$

where

${\displaystyle \operatorname {KL} (P\mid \mathbf {a} \mathbf {b} ^{T})=\sum _{i,j}P_{ij}\log \left({\frac {P_{ij}}{a_{i}b_{j}}}\right)+a_{i}b_{j}-P_{i,j}}$

See here for more on discrete optimal transport.

## Characterizing the Solution

The solution to the discrete problem formulation is unique and has a special form.

Theorem (Peyré & Cuturi, Proposition 4.3 on pg. 63)[1]
The solution ${\displaystyle P\in \mathbb {R} ^{n\times m}}$ to discrete regularized Kantorovich problem is unique and has the form ${\displaystyle P_{ij}=u_{i}K_{ij}v_{j}}$ for some ${\displaystyle \mathbf {u} \in \mathbb {R} ^{n},\mathbf {v} \in R^{m}}$ where ${\displaystyle K_{ij}=e^{-C_{ij}/\epsilon }}$. Moreover, ${\displaystyle \mathbf {u} }$ and ${\displaystyle \mathbf {v} }$ are unique up to multiplication and division by some scaling factor.

## Sinkhorn's Algorithm

Sinkhorn's algorithm takes advantage of the aforementioned characterization result to iteratively approximate the scaling factors ${\displaystyle \mathbf {u} }$ and ${\displaystyle \mathbf {v} }$. The procedure is simple and only involves matrix-vector multiplication and entrywise division as follows

${\displaystyle u^{(\ell +1)}={\frac {\textbf {a}}{K{\textbf {v}}^{(\ell )}}}\quad {\text{and}}\quad v^{(\ell +1)}={\frac {\textbf {a}}{K^{T}{\textbf {u}}^{(\ell )}}}}$

Once a sufficient number of iterations ${\displaystyle L}$ have been taken, we let ${\displaystyle {\hat {P}}_{ij}=u_{i}^{(L)}K_{ij}v_{j}^{(L)}}$ be our approximation of the optimal plan.