WGAN Part.1

Problems of traditional GAN

The problem of training final results

Can we reach the global optimal solution in gan-Goodfellow so that \(p_g = p_{data}\)?

For the model to reach a globally optimal solution, then it implies that \(D=D^*\). However, the authors of WGAN prove in the first paper that this is not possible

Assume: D is the optimal discriminator (D = D*)

Part.1

\(P_r\)(the support set of the dataset distribution, since the dataset distribution is from -āˆž to +āˆž, we can only separate their support sets) is a low-dimensional manifold in high-dimensional space (smaller dimensionality than the real space)

The distribution of the generated image has zero measure in the real space (also a low-dimensional manifold in the high-dimensional space)

If there is no intersection between these two datasets, we can always find a discriminator to separate them completely (optimal discriminator)

Part.2

If these two datasets intersect (the authors show that the probability of two datasets not being tangent can be considered as 1), since both datasets are low-dimensional manifolds in high-dimensional space, their intersection should also be a low-dimensional manifold with zero measure in high-dimensional space

Then we can still separate the two datasets by dividing them in the direction of the intersection with zero measure

Part.3

By derivation, if the two support sets can be separated, then the JSD scatter of these two datasets is log2. This means that if our discriminator reaches the optimal discriminator, the gradient is actually zero, which means that the model is not able to learn anything

\[\begin{align}f(D*) &=\int P_{data}\log(\frac{2p_{data}(x)}{p_{data}(x) + p_g(x)} - \log2)dx+P_g(x)\log\frac{2p_{data}(x)}{p_{data}(x) + p_g(x)} - \log2)dx\\&=-\log4 + D_{KL}(P_{data}||\frac{P_{data} + p_g}{2}) + D_{KL}(P_g||\frac{P_{data} + P_g}{2})\\&=-\log 4+2JSD(P_{data}||P_g)\end{align}\]

Supplementary

About how log2 was introduced

Since D is the optimal discriminator, we have completely separated the support sets of Pdata and Pg, so their integrals can also be separated, and at the same time the data distribution of g in data can be regarded as 0 and vice versa. Switching to 0 eliminates and we get log2

(The probability distribution has an integral of 1 over its own support set)

\begin{align}f(D) &= \int P_{data}\log(\frac{2p_{data}(x)}{p_{data}(x) + p_g(x)} - \log2)dx + \int P_{g}\log(\frac{2p_{g}(x)}{p_{data}(x) + p_g(x)} - \log2)dx\\&=-\log4+\int_{x\in M} P_{data}\log(\frac{2p_{data}(x)}{p_{data}(x) + p_g(x)})dx + \int_{x\in M} P_{g}\log(\frac{2p_{g}(x)}{p_{data}(x) +p_g(x)})dx\end{align}

Problems in the training process

If during the training process D ā†’ D*

Back to the GAN formula

\[\min_G\max_DV(D, G) = \mathbb{E}_{x\sim {P_{data}(x)}}[logD(x)]+\mathbb{E}_{x\sim {P_{data}(x)}}[log(1-D(G(z)))]\]

The original loss function used for training

\[\mathbb{E}_{z \sim P_z}[\log(1- D(G(z))]\]

As mentioned in the previous article if D ā†’ D*, the loss function will converge to 0 more and more, which means that the model becomes harder and harder to train as the discriminator keeps getting better and better, so we have the new loss function

\[\mathbb{E}_{z \sim P_z}[-\log D(G(z))]\]

However, the authors of the paper point out that the gradient of this function is equivalent to

\[\text{KL}(P_g||P_r) - \text{JSD}(P_g||P_r)\]

This reflects two very serious problems:

Problem 1

We want the training gradient to be large, so we expect the KL scatter to increase and the JS scatter to decrease. But both KL scatter and JS scatter describe the same physical measure, so they should increase or decrease at the same time. This indicates that there is a problem with our regression scheme itself. In the actual training process, we can also see that the results are oscillating all the time.

Problem 2

For the KL scatter in the above gradient

\[\text{KL}(P_g||P_r) = \int_x P_g\log\frac{P_g}{P_r}dx\]

In this equation, we can see that if Pgā†’Pr then the KL scatter tends to be closer to 0, in which case the loss function will be very small, and worse, once the generated result is very different from the original data, it will give a huge loss value, so once the model accidentally generates a picture that is very similar to the original data, the later results will scramble to that side. The model will be scolded if it is a little innovative and the difference with the original data increases. This causes the model to be afraid to innovate (the cost of innovation is too high).


*It is ironic that the problematic discriminatory approach to punishment leads to a fear of innovation.

Supplementary

Mode collapse

Mode collapse refers to a phenomenon that may occur during GAN training. Specifically, given a z, when z changes, the corresponding G(z) does not change, then at this localization, GAN undergoes mode collapse, that is, it cannot produce continuously changing samples, and thus the sample diversity is insufficient.
Take the example of GAN generating handwritten digits.

inter collapse

The generated numbers have little variety.
For example, the only numbers generated are 0, 1, 7, 9.

inner collapse:

The numbers generated are almost the same
For example, all the generated 1s look the same

Summary

In the first paper of WGAN, the authors point out the reasons for the poor results of GAN based on a rigorous mathematical derivation. The results on the tactical side of GAN (using KL or JS scatter) are rejected, but the strategic side, i.e. regression with distribution, is still retained.