KinectFusion: Real-Time Dynamic 3D Surface Reconstruction and Interaction Shahram Izadi1 , Richard A. Newcombe1,2 , David Kim1 , Otmar Hilliges1 , David Molyneaux1 , Steve Hodges1 , Pushmeet Kohli1 , Jamie Shotton1 , Andrew J. Davison2 , Andrew Fitzgibbon1 1 Microsoft Research∗, 2 Imperial College London†
Figure 1: We introduce a new system for acquiring high-quality, geometrically precise 3D models of an entire room rapidly using a single moving Kinect camera. The system generates high-quality models from noisy Kinect data in real-time. An entire room or smaller objects can be reconstructed in seconds (top sequences). We demonstrate a number of compelling new interactive possibilities such as multi-touch on any arbitrarily shaped surface (bottom left sequence); real-time rigid body physics simulated on a dynamic reconstructed model (bottom middle sequence); and rapid segmentation and tracking of objects within the model (bottom right sequence).
1
Introduction
We present KinectFusion, a system that takes live depth data from a moving Kinect camera and in real-time creates high-quality, geometrically accurate, 3D models. Our system allows a user holding a Kinect camera to move quickly within any indoor space, and rapidly scan and create a fused 3D model of the whole room and its contents within seconds. Even small motions, caused for example by camera shake, lead to new viewpoints of the scene and thus refinements of the 3D model, similar to the effect of image superresolution. As the camera is moved closer to objects in the scene more detail can be added to the acquired 3D model. To achieve this, our system continually tracks the 6DOF pose of the camera and rapidly builds a representation of the geometry of arbitrary surfaces. Novel GPU-based implementations for both camera tracking and surface reconstruction allow us to run at interactive real-time rates that have not previously been demonstrated. We define new instantiations of two well known graphics algorithms designed specifically for parallelizable GPGPU hardware.
2
KinectFusion
The main system pipeline first takes the live depth map from Kinect and converts from image coordinates into 3D points and normals in the coordinate space of the camera. Next the tracking phase computes a rigid 6DOF transform that closely aligns the current oriented points with the previous frames, using a novel GPU implementation of the ICP algorithm [Besl and McKay 1992]. This defines a relative rigid transform from the previous camera pose to the current. These transforms are incrementally applied to a global transform that defines the global pose of the camera. Given the ∗ {shahrami,b-davidk,otmarh,a-davmo,shodges,pkohli,jamiesho,awf}
@microsoft.com ajd}@doc.ic.ac.uk
† {rnewcombe,
Copyright is held by the author / owner(s). SIGGRAPH 2011, Vancouver, British Columbia, Canada, August 7 – 11, 2011. ISBN 978-1-4503-0921-9/11/0008
global pose of the camera, points and normals are converted into global coordinates, and a single consistent 3D model is updated. Instead of simply fusing point clouds, we reconstruct surfaces based on a novel GPU-based implementation of volumetric truncated signed distance functions [Curless and Levoy 1996]. Voxels within the volume are updated based on our globally converted measurements. Each voxel stores a running average of its distance to the assumed position of a physical surface. Finally we use GPU accelerated raycasting to render a view of the volume and the 3D surfaces it contains given the current global pose of the camera. The view of the volume equates to a synthetic depth map, which can be used as a less noisy more globally consistent base or reference frame for the next iteration of ICP tracking. This allows us to track by comparing the current live depth map with our less noisy raycasted view of the model, as opposed to using only the live depth maps frame-to-frame. The system can reconstruct a scene within seconds and enables interactive possibilities including: extending multi-touch interactions to any arbitrarily shaped reconstructed surface; advanced features for augmented reality; real-time physics that are simulated live on the dynamic model; and novel methods for segmentation and tracking of scanned objects. We also present extensions to our GPUbased tracking algorithm to distinguish scene motion from camera motion thus dealing with dynamic scenes, in particular ones where users are interacting. See Figure 1 and accompanying video for examples.
References B ESL , P., AND M C K AY, N. 1992. A method for registration of 3D shapes. 239–256. C URLESS , B., AND L EVOY, M. 1996. A volumetric method for building complex models from range images. In ACM Transactions on Graphics (SIGGRAPH).