Volumetric capture
Volumetric capture or volumetric video is a technique that captures a three-dimensional space, such as a location or performance. This type of volumography acquires data that can be viewed on flat screens as well as using 3D displays and VR headset. Consumer-facing formats are numerous and the required motion capture techniques lean on computer graphics, photogrammetry, and other computation-based methods. The viewer generally experiences the result in a real-time engine and has direct input in exploring the generated volume.
History
Recording talent without the limitation of a flat screen has been depicted in science-fiction for a long time. Holograms and 3D real-world visuals have featured prominently in Star Wars, Blade Runner, and many other science-fiction productions over the years. Through the growing advancements in the fields of computer graphics, optics, and data processing, this fiction has slowly evolved into a reality. Volumetric video is the logical next step after stereoscopic movies and 360° videos in that it combines the visual quality of photography with the immersion and interactivity of spatialized content and could prove to be the most important development in the recording of human performance since the creation of contemporary cinema.Computer graphics and VFX
Creating 3D models from video, photography, and other ways of measuring the world has always been an important topic in computer graphics. The ultimate goal is to imitate reality in minute detail while giving creatives the power to build worlds atop this foundation to match their vision. Traditionally, artists create these worlds using modeling and rendering techniques developed over decades since the birth of computer graphics. Visual effects in movies and video games paved the way for advances in photogrammetry, scanning devices, and the computational backend to handle the data received from these new intensive methods. Generally, these advances have come as a result of creating more advanced visuals for entertainment and media, but have not been the goal of the field itself.LIDAR
scanning describes a survey method that uses laser-sampled points densely packed to scan static objects into a point cloud. This requires physical scanners and produces enormous amounts of data. In 2007 the band Radiohead used it extensively to create a music video for "House of Cards", capturing point cloud performances of the singer's face and of select environments in one of the first uses of this technology for volumetric capture. Director James Frost collaborated with media artist Aaron Koblin to capture 3D point-clouds used for this music clip, and while the final output of this work was still a rendered flat representation of the data, the capture and mindset of the authors was already ahead of its time. Point clouds, being distinct samples of three-dimensional space with position and color, create a high fidelity representation of the real world with a huge amount of data. However, viewing this data in real-time was not yet possible.Structured light
In 2010 Microsoft brought the Kinect to the market, a consumer product that used structured light in the infrared spectrum to generate a 3D mesh from its camera. While the intent was to facilitate and innovate in user input and gameplay, it was very quickly adapted as a generic capture device for 3D data in the volumetric capture community. By projecting a known pattern onto the space and capturing the distortion by objects in the scene, the result capture can then be computed into different outputs. Artists and hobbyists started to make tools and projects around the affordable device, sparking a growing interest in volumetric capture as a creative medium.Researchers at Microsoft then constructed an entire capture stage using multiple cameras, Kinect devices, and algorithms that generated a full volumetric capture from the combined optical and depth information. This is now the , used today as part of both their research division and in certain select commercial experiences such as the Blade Runner 2049 VR experience. There are currently three studios in operation: Redmond, WA; San Francisco, CA; and London, England. While this remains a very interesting setup for the high-end market, the affordable price of a single Kinect device led more experimental artists and independent directors to become active in the volumetric capture field. Two results from this activity are and . EF EVE™ supports unlimited number of Azure Kinect sensors on one PC giving full volumetric capture with easy setup. It also has automatic sensors calibration and VFX functionality. Depthkit is a software suite that allows the capture of geometry data with one structured light sensor including the Azure Kinect, as well as high quality color detail from an attached witness camera.
Photogrammetry
describes the process of measuring data based on photographic reference. While being as old as photography itself, only through advances over the years in volumetric capture research has it now become possible to capture more and more geometry and texture detail from a large number of input images. The result is usually split into two composited sources, static geometry and full performance capture. For static geometry, sets that are captured with a large number of overlapping digital images are then aligned to each other using similar features in the images and used as a base for triangulation and depth estimation. This information is interpreted as 3D geometry, resulting in a near-perfect replica of the set. Full performance capture, however, uses an array of video cameras to capture real-time information. Those synchronized cameras are then used frame-by-frame to generate a set of points or geometry that can be played back at speed, resulting in the full volumetric performance capture that can be composited into any environment. In 2008, 4DViews installed a first volumetric video capture system at DigiCast studio in Tokyo. Later in 2015, 8i contributed in the field, and recently Intel, Microsoft, and Samsung have joined in by creating their own capture stages for performance capture and photogrammetry.Virtual reality
As volumetric video developed into a commercially applicable approach to environment and performance capture, the ability to move about the results with six degrees of freedom and true stereoscopy necessitated a new type of display device. With the rise of consumer-facing VR in 2016 through devices such as the Oculus Rift and HTC Vive, this was suddenly possible. Stereoscopic viewing and the ability to rotate and move the head as well as move in a small space allows immersion into environments well beyond what was possible in the past. The photographic nature of the captures combined with this immersion and the resulting interactivity is one giant step closer to being the holy grail of true virtual reality. With the rise of 360° video content, the demand for 6-DOF capture is rising, and VR in particular drives the applications for this technology, slowly fusing cinema, games and art with the field of volumetric capture research. Volumetric video is currently being used to deliver virtual concerts via the Scenez application on Meta Quest and Apple Vision Pro Devices.Light fields
s describe at a given sample point the incoming light from all directions. This is then used in post processing to generate effects such as depth of field as well as allowing the user to move their head slightly. Since 2006 Lytro is creating consumer-facing cameras to allow the capture of light fields. Fields can be captured inside-out in camera or outside-in from renderings of 3D geometry, representing a huge amount of information ready to be manipulated. Currently data rates are still a large issue and the technique has a large potential for the future as it samples light and displays the result in a variety of ways.Another by-product of this technique is a reasonably accurate depth map of the scene. Meaning each pixel has information about its distance from the camera. Facebook is using this idea in its Surround360 camera family to capture 360° video footage that is getting stitched with the help of distance maps. Extracting this raw data is possible and allows a high-resolution capture of any stage. Again the data rates combined with the fidelity of the depth maps are huge bottlenecks but soon to be overcome with more advanced depth estimation techniques, compression, as well as parametric light fields.
Workflows
Different workflows to generate volumetric video are currently available. These are not mutually exclusive and are used effectively in combinations. Here are some examples that show a couple of them:Mesh-based
This approach generates a more traditional 3D triangle mesh similar to the geometry used for computer games and visual effects. The data volume is usually less but the quantization of real-world data into lower resolution data limits the resolution and visual fidelity. Trade-offs are generally made between mesh density and final experience performance.Photogrammetry is usually used as a base for static meshes, and is then augmented with performance capture of talent via the same underlying technology of videogrammetry. Intense clean up is required to create the final set of triangles. To extend beyond the physical world, CG techniques can be deployed to further enhance the captured data, employing artists to build onto and into the static mesh as necessary. The playback is usually handled by a real-time engine and resembles a traditional game pipeline in implementation, allowing interactive lighting changes and creative and archivable ways of compositing static and animated meshes together.
Point-based
Recently the spotlight has shifted towards point-based volumetric capture. The resulting data is represented as points or particles in 3D space carrying attributes such as color and point size with them. This allows for more information density and higher resolution content. The data rates required are big and current graphics hardware is not optimized to render this, being optimised to a mesh-based render pipeline.The main advantage of points is the potential for higher spatial resolution. Points can either be scattered on triangle meshes with pre-computed lighting, or used directly from a LIDAR scanner. Performance of talent is captured the same way as per the mesh-based approach, but more time and computational power can be used at production time to further improve the data. At playback, 'level of detail' can be utilized to manage the computational load on the playback device, increasing or decreasing the number of polygons. Interactive light changes are harder to realize as the bulk of the data is pre-baked. This means that while the lighting information stored with the points is very accurate and high-fidelity, it lacks the ability to easily change in any given situation. Another benefit of point capture is that computer graphics can be rendered with very high quality and also stored as points, opening the door for a perfect blend of real and imagined elements.
After capturing and generating the data, editing and compositing is done within a realtime engine, connecting recorded actions to tell the intended story. The final product can then be viewed either as a flat rendering of the captured data, or interactively in a VR headset.
While one goal, with the point-based approach to volumetric capture, is to stream point data from cloud to the user at home, allowing the creating and dissemination of realistic virtual worlds on demand - a second goal more recently considered would be a real-time data stream of live events. This requires very high bandwidth as pixel information includes depth data