Rigid motion segmentation

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.
There are a number of methods that have been proposed to do so. There is no consistent way to classify motion segmentation due to its large variation in literature. Depending on the segmentation criterion used in the algorithm it can be broadly classified into the following categories: image difference, statistical methods, wavelets, layering, optical flow and factorization. Moreover, depending on the number of views required the algorithms can be two or multi view-based. Rigid motion segmentation has found an increase in its application over the recent past with rise in surveillance and video editing. These algorithms are discussed further.

Introduction to rigid motion

In general, motion can be considered to be a transformation of an object in space and time. If this transformation preserves size and shape of the object it is known as a Rigid Transformation. Rigid transform can be rotational, translational or reflective. We define rigid transformation mathematically as:

where F is a rigid transform if and only if it preserves isometry and space orientation.
In the sense of motion, rigid transform is the movement of a rigid object in space. As shown in Figure 1: this 3-D motion is the transformation from original co-ordinates to transformed co-ordinates which is a result of rotation and translation captured by rotational matrix R and translational vector T respectively. Hence the transform will be:

where,

has 9 unknowns which correspond to the rotational angle with each axis and has 3 unknowns which account for translation in X,Y and Z directions respectively.
This motion in time when captured by a camera corresponds to change of pixels in the subsequent frames of the video sequence. This transformation is also known as 2-D rigid body motion or the 2-D Euclidean transformation. It can be written as:

where,

X→ original pixel co-ordinate.
X'→ transformed pixel co-ordinate.
R→ orthonormal rotation matrix with R ⋅ R^T = I and |R| = 1.
t→ translational vector but in the 2D image space.
To visualize this consider an example of a video sequence of a traffic surveillance camera. It will have moving cars and this movement does not change their shape and size.
Moreover, the movement is a combination of rotation and transformation of the car in 3D which is reflected in its subsequent video frames. Thus the car is said to have a rigid motion.

Motion segmentation

Image segmentation techniques are interested in segmenting out different parts of the image as per the region of interest. As videos are sequences of images, motion segmentation aims at decomposing a video in moving objects and background by segmenting the objects that undergo different motion patterns. The analysis of these spatial and temporal changes occurring in the image sequence by separating visual features from the scenes into different groups lets us extract visual information. Each group corresponds to the motion of an object in the dynamic sequence. In the simplest case motion segmentation can mean extracting moving objects from a stationary camera but the camera can also move which introduces the relative motion of the static background.
Depending upon the type of visual features that are extracted, motion segmentation algorithms can be broadly divided into two categories. The first is known as direct motion segmentation that uses pixel intensities from the image. Such algorithms assume constant illumination. The second category of algorithms computes a set of features corresponding to actual physical points on the objects. These sparse features are then used to characterize either the 2-D motion of the scene or the 3-D motion of the objects in the scene.
There are a number of requirements to design a good motion segmentation algorithm. The algorithm must extract distinct features that represent the object by a limited number of points and it must have the ability to deal with occlusions. The images will also be affected by noise and will have missing data, thus they must be robust.
Some algorithms detect only one object but the video sequence may have different motions.
Thus the algorithm must be multiple object detectors. Moreover, the type of camera model, if used, also characterizes the algorithm. Depending upon the object characterization of an algorithm it can detect rigid, non-rigid motion or both. Moreover, algorithms used to estimate single rigid-body motions can provide accurate results with robustness to noise and outliers but when extended to multiple rigid-body motions they fail. In case of view-based segmentation techniques described below, this happens because the single fundamental matrix assumption is violated as each motion will now be represented by means of a new fundamental matrix corresponding to that motion.

Segmentation algorithms

As mentioned earlier that there is no particular way to distinguish Motion Segmentation techniques but depending on the basis of the segmentation criterion used in the algorithm it can be broadly classified as follows:

Image difference

It is a very useful technique for detecting changes in images due to its simplicity and ability to deal with occlusion and multiple motions. These techniques assume constant light source intensity. The algorithm first considers two frames at a time and then computes the pixel by pixel intensity difference. On this computation it thresholds the intensity difference and maps the changes onto a contour. Using this contour it extracts the spatial and temporal information required to define the motion in the scene. Though it is a simple technique to implement it is not robust to noise. Another difficulty with these techniques is the camera movement. When the camera moves there is a change in the entire image which has to be accounted for. Many new algorithm have been introduced to overcome these difficulties.

Statistic theory

Motion segmentation can be seen as a classification problem where each pixel has to be classified as background or foreground. Such classifications are modeled under statistic theory and can be used in segmentation algorithms. These approaches can be further divided depending on the statistical framework used. Most commonly used frameworks are maximum
a posteriori probability, Particle Filter and Expectation Maximization.
MAP uses Bayes' Rule for implementation where a particular pixel has to be classified under predefined classes. PF is based on the concept of evolution of a variable with varying
weights over time. The final estimation is the weighted sum of all the variables. Both of these methods are iterative. The EM algorithm is also an iterative estimation method. It computes the maximum likelihood estimate of the model parameters in presence of missing or hidden data and decided the most likely fit of the observed data.

Optical Flow

Optical flow helps in determining the relative pixel velocity of points within an image sequence. Like image difference, it is also an old concept used for segmentation. Initially the main drawback of OF was the lack of robustness to noise and high computational costs but due to recent key-point matching techniques and hardware implementations, these limitations have diminished.
To increase its robustness to occlusion and temporal stopping, OF is generally used with other statistical or image difference techniques.
For complicated scenarios, particularly when the camera itself is moving, OF provides a basis for estimating the fundamental matrix where outliers represent other objects moving independently in the scene.
Alternatively, optical flow based on line segments instead of point features can also be used to segment multiple rigid-body motions.

Wavelet

An image is made up of different frequency components. Edges, corners and plane regions can be represented by means of different frequencies. Wavelet based methods perform analysis of the different frequency components of the images and then study each component with different resolution such that they are matched to its scale. Multi-scale decomposition is used generally in order to reduce the noise. Though this method provides good results, it
is limited with an assumption that the movement of objects is only in front of the camera.
Implementations of Wavelet-based techniques are present with other approaches, such as optical flow and are applied at various scale to reduce the effect of noise.

Layers

Layers based techniques divide the images into layers that have uniform motion. This approach determines the different depth layer in the image and finds which layer the object or part of the image lies in. Such techniques are used in stereo vision where it is needed to compute the depth distance. The first layer based technique was proposed in 1993. As humans also use layer based segmentation, this method is a natural solution to occlusion problems but it is very complex with requirement of manual tuning.

Factorization

Tomasi and Kanade introduced the first factorization method. This method tracked features in a sequence of images and recovered the shape and motion. This technique factorized the trajectory matrix W, determined after the tracking of different features over the sequence into two matrices: motion and structure using Singular Value Decomposition. The simplicity of the algorithm is the reason for its wide use but they are sensitive to noise and outliers. Most of these methods are implemented under the assumption of rigid and independent motion.

View based algorithms

Further motion detection algorithms can also be classified depending upon the number of views: two and multi view-based approaches namely. The two-view based approaches are usually based on epipolar geometry. Consider two perspective camera views of a rigid body and find its feature correspondences. These correspondences are seen to satisfy either an epipolar constraint for a general rigid-body or a homography constraint for a planar object. Planar motion in a sequence is the motion of the background, facade or the ground. Thus it is a degenerate case of rigid body motion together with general rigid body objects e.g. cars. Hence in a sequence we expect to see more than one type of motion, described by multiple epipolar constraints and homographies.
The view based algorithms are sensitive to outliers but recent approaches deal with outliers by using random sample consensus and enhanced Dirichlet process mixture models. Other approaches use global dimension minimization to reveal the clusters corresponding to the underlying subspace. These approaches use only two frames for motion segmentation even if multiple frames are available as they cannot use multi frame information.
Multiview-based approaches utilize the trajectory of feature points unlike two-view based approaches. A number of approaches have been provided which include Principle Angles Configuration and Sparse Subspace Clustering methods. These
work well in two or three motion cases. These algorithms are also robust to noise with a tradeoff with speed, i.e. they are less sensitive to noise but slow in computation. Other algorithms with a multi-view approach are spectral curvature clustering, latent low-rank representation-based method and ICLM-based approaches. These algorithms are faster and more accurate than the two-view based but require greater number of frames to maintain the accuracy.

Problems

Motion segmentation is a field under research as there are many issues which provide scope of improvement. One of the major problems is of feature detection and finding correspondences. There are strong feature detection algorithms but they still give false positives which can lead to unexpected correspondences. Finding these pixel or feature correspondences is a difficult task. These mismatched feature points from the objects and the background often introduce outliers. The presence of image noise and outliers further affect the accuracy of structure from motion estimation.
Another issue is that of motion models or motion representations. It requires the motion to be modeled or estimated in the given model used in the algorithm. Most algorithms perform 2-D motion segmentation by assuming the motions in the scene can be modeled by 2-D affine motion models. Theoretically, this is valid because 2-D translational motion model can be represented by general affine motion model. However, such approximations in modeling can have negative consequences. The translational model has two parameters and the affine model has 6 parameters so we estimate four extra parameters. Moreover, there may not be enough data to estimate the affine motion model so the parameter estimation might be erroneous.
Some of the other problems faced are:

Prior knowledge about the objects or about the number of objects in the scene is essential and it is not always available.
Blurring is a common issue when motion is involved.
Moving objects can create occlusions, and it is possible that the whole object can disappear and reappear in the scene.
Measurement of 3-D feature correspondences in the images can be noisy in terms of pixel coordinates.

Robust algorithms have been proposed to take care of the outliers and implement with greater accuracy. The Tomasi and Kanade factorization method is one of the methods as mentioned above under factorization.

Applications

Motion segmentation has many important applications. It is used for video compression. With segmentation, it is possible to eliminate the redundancy related to the repetition of the same visual patterns in successive images. It can also be used for video description tasks, such as logging, annotation and indexing. By using Automatic object extraction techniques video content with object-specific information can be segregated. Thus concept can be used by search engines and video libraries. Some specific applications include:

Video surveillance in security applications
Sports scene analysis
Road safety applications in intelligent vehicles
Video indexing
Traffic monitoring
Object recognition