WGAN Part.2


In the previous article, we said that the use of KL scatter in GAN is problematic, so the authors of WGAN immediately proposed the Wasserstein Distance to replace the previous KL and JS scatter

\[W(\mathbb{P}_r,\mathbb{P}_g)=\color{green}\inf_{\color{blue}\gamma\in\Pi(\mathbb{P}_r,\mathbb{P}_g)}\color{red}{\mathbb{E}_{(x, y)\sim\gamma}}||x-y||\]

Formula explanation

The inf in the above equation refers to the lower bound

x, y can be considered as the parts of two distributions

The latter part can be rewritten in the following form

\[\mathbb{E}_{(x, y)\sim\gamma}||x-y||=\int_y\int_x\gamma(x, y)||x-y||dxdy=\sum{x, y}||x-y||\gamma (x, y)\]

This part is actually calculating a weighted sum (the x → y part multiplied by a distance matrix)

To make it easier to understand, I have drawn a diagram.

![Matrix](03 WGAN part2.assets/Matrix.png)

Suppose we have the distribution of the noise and the real image as above, our generator wants to transform the distribution of the noise to the distribution of the real image, there can be many kinds of solutions, we label each small interval

We want to change the number of interval 3 to 8, so we can move the data from the original interval 3 to another interval, divide 5 from interval 4, divide 1 from interval 1 plus use interval 5 to make up 8 from interval 3.

Of course, there are many kinds of scraping methods, both good and bad, and the above is one of the more stupid ones, so we need an indicator to judge whether our method is good or not. So here we introduce a distance matrix that represents the distance that the original interval is partially shifted.

![Matrix](03 WGAN part2.assets/Matrix.png)

p.s. The weighted sum of x and y distances can actually be considered as expectation

After reading the above, you will basically understand what the WGAN formula is doing

Incidentally, Wasser and stein come from the German words "water" and "stone" respectively, meaning that the change in distribution is like a river scouring the sediment at the bottom of the river (solving the optimal solution to the above problem can also be seen as the river taking the muddy (The optimal solution to the above problem can also be thought of as the river taking the mud-rich area to the back of the river and filling in the holes as it encounters them.)

Let the formula be calculated

As mentioned just now, our formula is to calculate the lower exact bound, but if we follow the above statement alone, the problem is not fixed due to the possible scaling method and there are very many cases. So this formula is not directly used to calculate, need to do some processing of this formula, so that it can be directly calculated

After all, I'm not a math major, so I don't understand a lot of it, but it's necessary to understand the idea.

\[\begin{align}W(\mathbb{P}r,\mathbb{P}g)&=\inf{\gamma\in\Pi(\mathbb{P}r, \mathbb{P}g)} \mathbb{E}{(x,y)\sim \gamma}[||x-y||]\\&= \sup{||f||L\le1}\mathbb{E}{x\sim\mathbb{P}r}[f(x)] - \mathbb{E}{x \sim \mathbb{P}g} [f(x)] \dots ①\\&= \color{blue}\max{w \in W}\mathbb{E}_{x\sim\mathbb{P}r}[f_w(x)] - \mathbb{E}{z \sim \mathbb{P}z} [f_w(g{\theta}(z))] \end{align}\]

in Formula (2), we convert the lower positive boundary to an upper positive boundary. and eliminate γ.

From formula (1) to formula(2), we use Kantorovich-Rubinstein Duality

From formula (2) to formula(3), we use Lipschitz Constraint

Formula (1) to Formula(2)

\[\begin{align}W(\mathbb{P}_r, \mathbb{P}g)&=\inf{\gamma\in\Pi(\mathbb{P}r, \mathbb{P}g)} \mathbb{E}{(x,y)\sim \gamma}[||x-y||]\\&=\inf{\gamma\in\Pi(\mathbb{P}_r, \mathbb{P}g)} \int_y \int_x \gamma(x,y) ||x-y|| dx dy\\&=\inf{\gamma\in\Pi(\mathbb{P}_r, \mathbb{P}_g)} {<\Pi,D>|A\Pi = b,\Pi\ge0}\end{align}\]

<Π,D> is equivalent to the previously mentioned weighted sum, the transpose of one-dimensional matrix × one-dimensional matrix

\[\Pi = \left(\begin{matrix}\gamma(x_1, y_1) \\ \gamma(x_1, y_2)\\ \vdots\\ \gamma(x_2, y_1)\\ \gamma(x_2, y_2)\\ \vdots\\ \gamma(x_n, y_1)\\ \gamma(x_n, y_2)\end{matrix}\right)D = \left(\begin{matrix}d(x_1, y_1)\\ d(x_1, y_2)\\\vdots\\d(x_2, y_1)\\ d(x_2, y_2)\\\vdots\\d(x_n, y_1)\\d(x_n, y_2)\end{matrix}\right)\]

Why we say: \(A\Pi = b\)?

\[A=\left(\begin{array}{ccc|ccc|c|ccc|c} 1 & 1 & \dots & 0 & 0 & \dots & \dots & 0 & 0 & \dots & \dots \\0 & 0 & \dots & 1 & 1 & \dots & \dots & 0 & 0 & \dots & \dots\\\vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \ddots & \ddots \\0 & 0 & \dots & 0 & 0 & \dots & \dots & 1 & 1 & \dots & \dots\\\vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \ddots & \ddots \\\hline 1 & 0 & \dots & 1 & 0 & \dots & \dots & 1 & 0 & \dots & \dots\\0 & 1 & \dots & 0 & 1 & \dots & \dots & 0 & 1 & \dots & \dots\\\vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \ddots & \ddots \\0 & 0 & \dots & 0 & 0 & \dots & \dots & 1 & 1 & \dots & \dots\\\vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \ddots & \ddots \end{array}\right) \\\Pi =\left(\begin{matrix}\gamma(x_1, y_1)\\ \gamma(x_1, y_2)\\ \vdots\\\hline\gamma(x_2, y_1)\\ \gamma(x_2, y_2)\\\vdots\\\hline\vdots\\\hline\gamma(x_n, y_1)\\ \gamma(x_n, y_2)\\ \vdots\\\hline\vdots \end{matrix}\right) B =\left(\begin{matrix}p_r(x_1)\\ p_r(x_2)\\ \vdots \\ p_r(x_n) \\ \vdots \\\hline p_g(x_1)\\ p_g(x_2)\\ \vdots \\ p_g(x_n) \\ \vdots \\\end{matrix}\right)\]