Generative adversarial network
A generative adversarial network is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks compete with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.
Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics. Though originally proposed as a form of generative model for unsupervised learning, GANs have also proved useful for semi-supervised learning, fully supervised learning, and reinforcement learning.
The core idea of a GAN is based on the "indirect" training through the discriminator, another neural network that can tell how "realistic" the input seems, which itself is also being updated dynamically. This means that the generator is not trained to minimize the distance to a specific image, but rather to fool the discriminator. This enables the model to learn in an unsupervised manner.
GANs are similar to mimicry in evolutionary biology, with an evolutionary arms race between both networks.
Definition
Mathematical
The original GAN is defined as the following game:The generator's task is to approach, that is, to match its own output distribution as closely as possible to the reference distribution. The discriminator's task is to output a value close to 1 when the input appears to be from the reference distribution, and to output a value close to 0 when the input looks like it came from the generator distribution.
Each probability space defines a GAN game.
There are 2 players: generator and discriminator.
The generator's strategy set is, the set of all probability measures on.
The discriminator's strategy set is the set of Markov kernels, where is the set of probability measures on.
The GAN game is a zero-sum game, with objective function
The generator aims to minimize the objective, and the discriminator aims to maximize the objective.
In practice
The generative network generates candidates while the discriminative network evaluates them. This creates a contest based on data distributions, where the generator learns to map from a latent space to the true data distribution, aiming to produce candidates that the discriminator cannot distinguish from real data. The discriminator's goal is to correctly identify these candidates, but as the generator improves, its task becomes more challenging, increasing the discriminator's error rate.A known dataset serves as the initial training data for the discriminator. Training involves presenting it with samples from the training dataset until it achieves acceptable accuracy. The generator is trained based on whether it succeeds in fooling the discriminator. Typically, the generator is seeded with randomized input that is sampled from a predefined latent space. Thereafter, candidates synthesized by the generator are evaluated by the discriminator. Independent backpropagation procedures are applied to both networks so that the generator produces better samples, while the discriminator becomes more skilled at flagging synthetic samples. When used for image generation, the generator is typically a deconvolutional neural network, and the discriminator is a convolutional neural network.
Relation to other statistical machine learning methods
GANs are implicit generative models, which means that they do not explicitly model the likelihood function nor provide a means for finding the latent variable corresponding to a given sample, unlike alternatives such as flow-based generative model.Compared to fully visible belief networks such as WaveNet and PixelRNN and autoregressive models in general, GANs can generate one complete sample in one pass, rather than multiple passes through the network.
Compared to Boltzmann machines and linear ICA, there is no restriction on the type of function used by the network.
Since neural networks are universal approximators, GANs are asymptotically consistent. Variational autoencoders might be universal approximators, but it is not proven as of 2017.
Mathematical properties
Measure-theoretic considerations
This section provides some of the mathematical theory behind these methods.In modern probability theory based on measure theory, a probability space also needs to be equipped with a σ-algebra. As a result, a more rigorous definition of the GAN game would make the following changes:
Each probability space defines a GAN game.Since issues of measurability never arise in practice, these will not concern us further.
The generator's strategy set is, the set of all probability measures on the measure-space.
The discriminator's strategy set is the set of Markov kernels, where is the Borel σ-algebra on.
Choice of the strategy set
In the most generic version of the GAN game described above, the strategy set for the discriminator contains all Markov kernels, and the strategy set for the generator contains arbitrary probability distributions on.However, as shown below, the optimal discriminator strategy against any is deterministic, so there is no loss of generality in restricting the discriminator's strategies to deterministic functions. In most applications, is a deep neural network function.
As for the generator, while could theoretically be any computable probability distribution, in practice, it is usually implemented as a pushforward:. That is, start with a random variable, where is a probability distribution that is easy to compute, then define a function. Then the distribution is the distribution of.
Consequently, the generator's strategy is usually defined as just, leaving implicit. In this formalism, the GAN game objective is
Generative reparametrization
The GAN architecture has two main components. One is casting optimization into a game, of form, which is different from the usual kind of optimization, of form. The other is the decomposition of into, which can be understood as a reparametrization trick.To see its significance, one must compare GAN with previous methods for learning generative models, which were plagued with "intractable probabilistic computations that arise in maximum likelihood estimation and related strategies".
At the same time, Kingma and Welling and Rezende et al. developed the same idea of reparametrization into a general stochastic backpropagation method. Among its first applications was the variational autoencoder.
Move order and strategic equilibria
In the original paper, as well as most subsequent papers, it is usually assumed that the generator moves first, and the discriminator moves second, thus giving the following minimax game:If both the generator's and the discriminator's strategy sets are spanned by a finite number of strategies, then by the minimax theorem,that is, the move order does not matter.
However, since the strategy sets are both not finitely spanned, the minimax theorem does not apply, and the idea of an "equilibrium" becomes delicate. To wit, there are the following different concepts of equilibrium:
- Equilibrium when generator moves first, and discriminator moves second:
- Equilibrium when discriminator moves first, and generator moves second:
- Nash equilibrium, which is stable under simultaneous move order:
Main theorems for GAN game
The original GAN paper proved the following two theorems:Interpretation: For any fixed generator strategy, the optimal discriminator keeps track of the likelihood ratio between the reference distribution and the generator distribution:where is the logistic function.
In particular, if the prior probability for an image to come from the reference distribution is equal to, then is just the posterior probability that came from the reference distribution:
Training and evaluating GAN
Training
Unstable convergence
While the GAN game has a unique global equilibrium point when both the generator and discriminator have access to their entire strategy sets, the equilibrium is no longer guaranteed when they have a restricted strategy set.In practice, the generator has access only to measures of form, where is a function computed by a neural network with parameters, and is an easily sampled distribution, such as the uniform or normal distribution. Similarly, the discriminator has access only to functions of form, a function computed by a neural network with parameters. These restricted strategy sets take up a vanishingly small proportion of their entire strategy sets.
Further, even if an equilibrium still exists, it can only be found by searching in the high-dimensional space of all possible neural network functions. The standard strategy of using gradient descent to find the equilibrium often does not work for GAN, and often the game "collapses" into one of several failure modes. To improve the convergence stability, some training strategies start with an easier task, such as generating low-resolution images or simple images, and gradually increase the difficulty of the task during training. This essentially translates to applying a curriculum learning scheme.
Mode collapse
GANs often suffer from mode collapse where they fail to generalize properly, missing entire modes from the input data. For example, a GAN trained on the MNIST dataset containing many samples of each digit might only generate pictures of digit 0. This was termed "the Helvetica scenario".A typical mechanism for mode collapse is the generator only generating one or a few of the likely values, or a very incomplete picture of the target distribution. As the discriminator is only trained to distinguish real from fake samples, it will correctly identify the generated samples as real, but no penalty is imposed on the GAN's ability to generate data that represents the full range of the target distribution.
Weak discriminators, for instance underparametrized ones, or ones trained too slow compared to the generator, may as well be unable to fully discriminate over the entire support of the distribution, and only become able to correctly discriminate a very incomplete part of the target distribution.
Some researchers perceive the root problem to be a weak discriminative network that fails to notice the pattern of omission, while others assign blame to a bad choice of objective function. Many solutions have been proposed, but it is still an open problem.
Even the state-of-the-art architecture, BigGAN, could not avoid mode collapse. The authors resorted to "allowing collapse to occur at the later stages of training, by which time a model is sufficiently trained to achieve good results".