## Problem Restatement

This part mainly introduces the principle and realization method of GAN (Generative Adversarial Nets), GAN is proposed by lan.J et al. In 2014, they propose a new framework for estimating generative models via an adversarial process, in which simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptron’s, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference net-works during either training or generation of samples[1].

It can be said that the past few years in the field of machine learning the brightest and hottest new idea is to generate a confrontation network. This idea not only gave birth to a lot of theoretical papers, but also brought an endless practical application. Yann LeCun himself has been generously praised: This is the best idea of the past few years! So we chose GAN as one of our research contents.

## Method Introduction

First of all, introduce the generative model, which in the history of machine learning has been occupying a pivotal position. When we have a lot of data, such as images, voice, text, etc., if the model can help us to simulate the distribution of these high-dimensional data, then for many applications will be of great benefit.

For scenarios where the amount of data is scarce, generating models can help generate data and increase the number of data, thereby improving learning efficiency using semi-supervised learning. Language model is one of the widely used examples of generation model. Through reasonable modeling, language model can not only help to generate language fluent sentences, but also has a wide range of auxiliary applications in machine translation, chat dialogue and other research fields.

GAN uses the maximum likelihood estimate to create the model:

We all know that the probability of simultaneous occurrence of a group of independent events is:

And in real life, we probably do not know what each P (probability distribution model) is, we know that we can observe the source data. So, the maximum likelihood estimate is the way that gives the observation data to evaluate the model parameters (that is, how the distribution model should be estimated). First, suppose we independently sample $X_1,X_2,⋯X_n$ and we use the f model, but its argument θ is unknown, then this equation can be expressed as:

Take the logarithm on both sides:

Set:

$\log L(\theta|X_1,X_2,…X_n)$ called logarithmic likelihood, $\overline{\varphi}$ called average log likelihood. Then the largest average logarithmic likelihood is:

That is the meaning of the first formula.

Most powerful generation models require the use of the Markov chain, and GAN is the only one-to-one generation model that is directly observed from data. To understand the generation of the confrontation model (GAN), we must first know that the generation of the confrontation model is two things: one is the discriminant model and one is the model.

For example, we have some real data, but also a mess of false data. A effort to take the fake data from hand to imitate the real data, and rub into the real data. B is desperately trying to distinguish between real and false data. Here, A is a generation model, the goal is to fool the recognition of B. And B is a discriminant model, trying to distinguish all the false data. So, with the B’s identification skills more and more powerful, A forged data skills are more and more skilled. Where A is the generative model:

Here is our generation model. As shown, it disguises the real data x by generating the model G by generating the noise data z (that is, the false data we say). Of course, because GAN is still a neural network, the generation model needs to be differentiable.

The process of training is also very intuitive, we can use SGD-like algorithm of choice on two minibatches simultaneously: A minibatch of training examples and A minibatch of generated samples. Of course, we can also train a set of each run, the other group is running K times, this can prevent one of them cannot keep up with the rhythm.

Similarly, since the use of SGD optimization, we need an objective function to determine and monitor the results of learning. Here, $J(D)$ represents the objective function of the discriminant network a cross entropy function. Where the left part indicates that D judges that x is true x and the right part indicates the case where the noise data z is forged by the generated network G identified by D.

In this way, the same, $J(G)$ is the representative of the network to generate the objective function, its purpose is to follow the anti-D, so in front of a negative sign[2]:

A more intuitive explanation of this process is as follows: suppose that at the beginning of the training, the true sample distribution, the generated sample distribution, and the discriminant model are black lines, green lines, and blue lines in the figure, respectively. It can be seen that at the beginning of the training, the discriminant model is not well distinguishable from the real sample and the resulting sample. Then when we build the model fixedly, and the optimization of the model, the optimization results as shown in the second picture, we can see that this time the discriminant model has been able to better distinguish between the generation of data and real data. The third step is to fix the discriminant model, improve the generation model, and try to let the discriminant model cannot distinguish between the generated picture and the real picture. In this process, we can see that the image distribution generated by the model is closer to the real picture distribution. , Until the final convergence, generated distribution and true distribution coincidence[3].

## Environment of Reproduction

### Environment

• Windows 10.1 Professional
• i7-6700HQ CPU @ 2.60GHz x8
• Src in Python 3.5.x with Jupyter notebook

### Requirement

• tensorflow (1.1.0)
• numpy (1.11.3)
• matplotlib (2.0.0)
• seaborn (0.7.1)
• scipy (0.18.1)

## Result of Reproduction

We trained adversarial nets an a range of datasets including MNIST[4], the Toronto Face Database (TFD)[5], and CIFAR-10 [6]. The generator nets used a mixture of rectifier linear activations[7][8] and sigmoid activations, while the discriminator net used maxout[9] activations. Dropout [10] was applied in training the discriminator net. While our theoretical framework permits the use of dropout and other noise at intermediate layers of the generator, we used noise as the input to only the bottommost layer of the generator network.

We estimate probability of the test set data under pgby fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution. The σ parameter of the Gaussians was obtained by cross validation on the validation set. This procedure was introduced in Breuleux et al. [11] and used for various generative models for which the exact likelihood is not tractable[12][13][14]. Results are reported in Table 1. This method of estimating the likelihood has somewhat high variance and does not perform well in high dimensional spaces but it is the best method available to our knowledge. Advances in generative models that can sample but not estimate likelihood directly motivate further research into how to evaluate such models.

Table 1 Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log-likelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we computed the standard error across folds of the dataset, with a different σ chosen using the validation set of each fold. On TFD, σ was cross validated on each fold and mean log-likelihood on each fold were computed. For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.

In Figures 1 and 2 we show samples drawn from the generator net after training. While we make no claim that these samples are better than samples generated by existing methods, we believe that these samples are at least competitive with the better generative models in the literature and highlight the potential of the adversarial framework.

Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing.

Besides the Reproduction above from authors of the essay, we also try to achieve a toy model of GAN on Gaussian Distribution.

### Figures of Testing Model

Initial
Initially, generate $P_{data}$ and original decision boundary:

Pre-train Decision Surface
If decider is reasonably accurate to start, we get much faster convergence.

Training Loss
Loss Value during iteration

Iteration
After 10,000 iters (Inspite of faster presentation, during less than 15 secs. in this method):

## Reference

[1] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Advances in neural information processing systems. 2014: 2672-2680.
[2] http://www.sohu.com/a/121189842_465975
[3] https://baijiahao.baidu.com/po/feed/share?wfr=spider&for=pc&context=%7B%22sourceFrom%22%3A%22bjh%22%2C%22nid%22%3A%22news_3737750434789358240%22%7D
[4] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
[5] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010-001, U. Toronto.
[6] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
[7] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153.IEEE.
[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS’2011.
[9] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks.In ICML’2013.
[10] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
[11] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an RBM-derived process. Neural Computation, 23(8), 2053–2073.
[12] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive auto-encoders. In ICML’12.
[13] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep representations. In ICML’13.
[14] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. In ICML’14.

## Discriminator Strategy

### The Optimal Discriminator

For G fixed, the optimal discriminator D is:

Differentiate: