Rigid motion segmentation
In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.
There are a number of methods that have been proposed to do so. There is no consistent way to classify motion segmentation due to its large variation in literature. Depending on the segmentation criterion used in the algorithm it can be broadly classified into the following categories: image difference, statistical methods, wavelets, layering, optical flow and factorization. Moreover, depending on the number of views required the algorithms can be two or multi view-based. Rigid motion segmentation has found an increase in its application over the recent past with rise in surveillance and video editing. These algorithms are discussed further.
Introduction to rigid motion
In general, motion can be considered to be a transformation of an object in space and time. If this transformation preserves size and shape of the object it is known as a Rigid Transformation. Rigid transform can be rotational, translational or reflective. We define rigid transformation mathematically as:where F is a rigid transform if and only if it preserves isometry and space orientation.
In the sense of motion, rigid transform is the movement of a rigid object in space. As shown in Figure 1: this 3-D motion is the transformation from original co-ordinates to transformed co-ordinates which is a result of rotation and translation captured by rotational matrix R and translational vector T respectively. Hence the transform will be:
where,
has 9 unknowns which correspond to the rotational angle with each axis and has 3 unknowns which account for translation in X,Y and Z directions respectively.
This motion in time when captured by a camera corresponds to change of pixels in the subsequent frames of the video sequence. This transformation is also known as 2-D rigid body motion or the 2-D Euclidean transformation. It can be written as:
where,
X→ original pixel co-ordinate.
X'→ transformed pixel co-ordinate.
R→ orthonormal rotation matrix with R ⋅ RT = I and |R| = 1.
t→ translational vector but in the 2D image space.
To visualize this consider an example of a video sequence of a traffic surveillance camera. It will have moving cars and this movement does not change their shape and size.
Moreover, the movement is a combination of rotation and transformation of the car in 3D which is reflected in its subsequent video frames. Thus the car is said to have a rigid motion.
Motion segmentation
Image segmentation techniques are interested in segmenting out different parts of the image as per the region of interest. As videos are sequences of images, motion segmentation aims at decomposing a video in moving objects and background by segmenting the objects that undergo different motion patterns. The analysis of these spatial and temporal changes occurring in the image sequence by separating visual features from the scenes into different groups lets us extract visual information. Each group corresponds to the motion of an object in the dynamic sequence. In the simplest case motion segmentation can mean extracting moving objects from a stationary camera but the camera can also move which introduces the relative motion of the static background.Depending upon the type of visual features that are extracted, motion segmentation algorithms can be broadly divided into two categories. The first is known as direct motion segmentation that uses pixel intensities from the image. Such algorithms assume constant illumination. The second category of algorithms computes a set of features corresponding to actual physical points on the objects. These sparse features are then used to characterize either the 2-D motion of the scene or the 3-D motion of the objects in the scene.
There are a number of requirements to design a good motion segmentation algorithm. The algorithm must extract distinct features that represent the object by a limited number of points and it must have the ability to deal with occlusions. The images will also be affected by noise and will have missing data, thus they must be robust.
Some algorithms detect only one object but the video sequence may have different motions.
Thus the algorithm must be multiple object detectors. Moreover, the type of camera model, if used, also characterizes the algorithm. Depending upon the object characterization of an algorithm it can detect rigid, non-rigid motion or both. Moreover, algorithms used to estimate single rigid-body motions can provide accurate results with robustness to noise and outliers but when extended to multiple rigid-body motions they fail. In case of view-based segmentation techniques described below, this happens because the single fundamental matrix assumption is violated as each motion will now be represented by means of a new fundamental matrix corresponding to that motion.
Segmentation algorithms
As mentioned earlier that there is no particular way to distinguish Motion Segmentation techniques but depending on the basis of the segmentation criterion used in the algorithm it can be broadly classified as follows:Image difference
It is a very useful technique for detecting changes in images due to its simplicity and ability to deal with occlusion and multiple motions. These techniques assume constant light source intensity. The algorithm first considers two frames at a time and then computes the pixel by pixel intensity difference. On this computation it thresholds the intensity difference and maps the changes onto a contour. Using this contour it extracts the spatial and temporal information required to define the motion in the scene. Though it is a simple technique to implement it is not robust to noise. Another difficulty with these techniques is the camera movement. When the camera moves there is a change in the entire image which has to be accounted for. Many new algorithm have been introduced to overcome these difficulties.Statistic theory
Motion segmentation can be seen as a classification problem where each pixel has to be classified as background or foreground. Such classifications are modeled under statistic theory and can be used in segmentation algorithms. These approaches can be further divided depending on the statistical framework used. Most commonly used frameworks are maximuma posteriori probability, Particle Filter and Expectation Maximization.
MAP uses Bayes' Rule for implementation where a particular pixel has to be classified under predefined classes. PF is based on the concept of evolution of a variable with varying
weights over time. The final estimation is the weighted sum of all the variables. Both of these methods are iterative. The EM algorithm is also an iterative estimation method. It computes the maximum likelihood estimate of the model parameters in presence of missing or hidden data and decided the most likely fit of the observed data.
Optical Flow
helps in determining the relative pixel velocity of points within an image sequence. Like image difference, it is also an old concept used for segmentation. Initially the main drawback of OF was the lack of robustness to noise and high computational costs but due to recent key-point matching techniques and hardware implementations, these limitations have diminished.To increase its robustness to occlusion and temporal stopping, OF is generally used with other statistical or image difference techniques.
For complicated scenarios, particularly when the camera itself is moving, OF provides a basis for estimating the fundamental matrix where outliers represent other objects moving independently in the scene.
Alternatively, optical flow based on line segments instead of point features can also be used to segment multiple rigid-body motions.
Wavelet
An image is made up of different frequency components. Edges, corners and plane regions can be represented by means of different frequencies. Wavelet based methods perform analysis of the different frequency components of the images and then study each component with different resolution such that they are matched to its scale. Multi-scale decomposition is used generally in order to reduce the noise. Though this method provides good results, itis limited with an assumption that the movement of objects is only in front of the camera.
Implementations of Wavelet-based techniques are present with other approaches, such as optical flow and are applied at various scale to reduce the effect of noise.