One-shot learning (computer vision)
One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.
Motivation
The ability to learn object categories from few examples, and at a rapid pace, has been demonstrated in humans. It is estimated that a child learns almost all of the 10 ~ 30 thousand object categories in the world by age six. This is due not only to the human mind's computational power, but also to its ability to synthesize and learn new object categories from existing information about different, previously learned categories. Given two examples from two object categories: one, an unknown object composed of familiar shapes, the second, an unknown, amorphous shape; it is much easier for humans to recognize the former than the latter, suggesting that humans make use of previously learned categories when learning new ones. The key motivation for solving one-shot learning is that systems, like humans, can use knowledge about object categories to classify new objects.Background
As with most classification schemes, one-shot learning involves three main challenges:- Representation: How should objects and categories be described?
- Learning: How can such descriptions be created?
- Recognition: How can a known object be filtered from enveloping clutter, irrespective of occlusion, viewpoint, and lighting?
- Model parameters: Reuses model parameters, based on the similarity between old and new categories. Categories are first learned on numerous training examples, then new categories are learned using transformations of model parameters from those initial categories or selecting relevant parameters for a classifier.
- Feature sharing: Shares parts or features of objects across categories. One algorithm extracts "diagnostic information" in patches from already learned categories by maximizing the patches' mutual information, and then applies these features to the learning of a new category. A dog category, for example, may be learned in one shot from previous knowledge of horse and cow categories, because dog objects may contain similar distinguishing patches.
- Contextual information: Appeals to global knowledge of the scene in which the object appears. Such global information can be used as frequency distributions in a conditional random field framework to recognize objects. Alternatively context can consider camera height and scene geometry. Algorithms of this type have two advantages. First, they learn object categories that are relatively dissimilar; and second, they perform well in ad hoc situations where an image has not been hand-cropped and aligned.
Theory
Bayesian framework
Given the task of finding a particular object in a query image, the overall objective of the Bayesian one-shot learning algorithm is to compare the probability that object is present vs the probability that only background clutter is present. If the former probability is higher, the algorithm reports the object's presence, otherwise the algorithm reports its absence. To compute these probabilities, the object class must be modeled from a set of training images containing examples.To formalize these ideas, let be the query image, which contains either an example of the foreground category or only background clutter of a generic background category. Also let be the set of training images used as the foreground category. The decision of whether contains an object from the foreground category, or only clutter from the background category is:
where the class posteriors and have been expanded by Bayes' theorem, yielding a ratio of likelihoods and a ratio of object category priors. We decide that the image contains an object from the foreground class if exceeds a certain threshold. We next introduce parametric models for the foreground and background categories with parameters and respectively. This foreground parametric model is learned during the learning stage from, as well as prior information of learned categories. The background model we assume to be uniform across images. Omitting the constant ratio of category priors,, and parametrizing over and yields
The posterior distribution of model parameters given the training images, is estimated in the learning phase. In this estimation, one-shot learning differs sharply from more traditional Bayesian estimation models that approximate the integral as. Instead, it uses a variational approach using prior information from previously learned categories. However, the traditional maximum likelihood estimation of the model parameters is used for the background model and the categories learned in advance through training.
Object category model
For each query image and training images, a constellation model is used for representation. To obtain this model for a given image, first a set of N interesting regions is detected in the image using the Kadir–Brady saliency detector. Each region selected is represented by a location in the image, and a description of its appearance,. Letting and and the analogous representations for training images, the expression for R becomes:The likelihoods and are represented as mixtures of constellation models. A typical constellation model has P parts, with N interest regions. Thus a P-dimensional vector h assigns one region of interest to each model part. Thus h denotes a hypothesis for the model and a full constellation model is represented by summing over all possible hypotheses h in the hypothesis space. Finally the likelihood is written
The different 's represent different configurations of parts, whereas the different hypotheses h represent different assignations of regions to parts, given a part model. The assumption that the shape of the model and appearance are independent allows one to consider the likelihood expression as two separate likelihoods of appearance and shape.
Appearance
The appearance of each feature is represented by a point in appearance space. "Each part in the constellation model has a Gaussian density within this space with mean and precision parameters." From these the appearance likelihood described above is computed as a product of Gaussians over the model parts for a give hypothesis h and mixture component.Shape
The shape of the model for a given mixture component and hypothesis h is represented as a joint Gaussian density of the locations of features. These features are transformed into a scale and translation-invariant space before modelling the relative location of the parts by a 2-dimensional Gaussian. From this, we obtain the shape likelihood, completing our representation of. To reduce the number of hypotheses in the hypothesis space, only those hypotheses that satisfy the ordering constraint that the x-coordinate of each part is monotonically increasing are considered. This eliminates hypotheses from.Conjugate densities
To compute, the integral must be evaluated, but is analytically intractable. The object category model above gives information about, so what remains is to examine, the posterior of, and find a sufficient approximation to render the integral tractable. Previous work approximates the posterior by a function centered at, collapsing the integral in question into. This is normally estimated using a maximum likelihood or maximum a posteriori procedure. However, because in one-shot learning, few training examples are used, the distribution will not be well-peaked, as is assumed in a function approximation. Thus instead of this traditional approximation, the Bayesian one-shot learning algorithm seeks to "find a parametric form of such that the learning of is feasible". The algorithm employs a Normal-Wishart distribution as the conjugate prior of, and in the learning phase, variational Bayesian methods with the same computational complexity as maximum likelihood methods are used to learn the hyperparameters of the distribution. Then, since is a product of Gaussians, as chosen in the object category model, the integral reduces to a multivariate Student's T distribution, which can be evaluated.Implementation
Feature detection and representation
To detect features in an image so that it can be represented by a constellation model, the Kadir–Brady saliency detector is used on grey-scale images, finding salient regions of the image. These regions are then clustered, yielding a number of features and the shape parameter, composed of the cluster centers. The Kadir–Brady detector was chosen because it produces fewer, more salient regions, as opposed to feature detectors like multiscale Harris, which produces numerous, less significant regions.The regions are then taken from the image and rescaled to a small patch of 11 × 11 pixels, allowing each patch to be represented in 121-dimensional space. This dimensionality is reduced using principal component analysis, and, the appearance parameter, is then formed from the first 10 principal components of each patch.