Augmenting 3D urban environment using mobile devices

Viewer
Transcript

Augmenting 3D urban environment using mobile devices Yi Wu

Maha El Choubassi

Igor Kozintsev∗

Intel Corporation

A BSTRACT We describe an augmented reality prototype for exploring a 3D urban environment on mobile devices. Our system utilizes the location and orientation sensors on the mobile platform as well as computer vision techniques to register the live view of the device with the 3D urban data. In particular, the system recognizes the buildings in the live video, tracks the camera pose, and augments the video with relevant information about the buildings in the correct perspective. The 3D urban data consist of 3D point clouds and corresponding geo-tagged RGB images of the urban environment. We also discuss the processing steps to make such 3D data scalable and usable by our system. 1

I NTRODUCTION

Mobile augmented reality (MAR) has attracted great interests recently. Mobile technology improvement in digital cameras, location and orientation sensors, computational resources, and the power of cloud-sourced information has transformed augmented reality from bulky hardware setups into a new generation of applications running on mobile platforms such as smartphones. Many MAR systems rely exclusively on GPS sensor and compass, to determine the device’s location and orientation, and then augment points of interest. Examples include Layar [3], Wikitude [4], and others. Another form of MAR systems that is becoming increasingly common in mobile and web applications is the use of phone cameras and object recognition techniques to overlay content onto what the camera is viewing. Systems such as Google Goggles [2], Amazon’s SnapTell [1], and the system of [5] are snapshot based and can identify scene content. In this paper, we present a MAR system that augments the 3D urban scenes in the live video of mobile platforms. We combine the rich sensors capabilities of the platform with vision algorithms to recognize the scene’s buildings and register the device’s live view with the 3D urban data. We have our video 3D MAR system runR AtomTM processor inside. ning on a mobile platform with Intel 2

S YSTEM D ESCRIPTION

Our MAR system is partitioned into server and client components. Geo-tagged 3D urban data are processed on the server. Based on the device GPS location, we download a geographically constrained subset of 3D data from the server and pre-cache it on the client. Therefore, on the client and not the server, we recognize buildings in the live video, track the device’s pose, and augment relevant information in the correct location and perspective. 2.1 Server: 3D Data processing Due to Internet services, such as Google Earth, 3D representations of cities are becoming ubiquitous and made to the public. Our urban scenes data come from Earthmine. Earthmine acquired images and 3D data with stereo cameras on top of a vehicle at intervals ∗ e-mail:{yi.y.wu,

maha.el.choubassi, and igor.v.kozintsev}@intel.com

IEEE International Symposium on Mixed and Augmented Reality 2011 Science and Technolgy Proceedings 26 -29 October, Basel, Switzerland 978-1-4577-2185-4/10/$26.00 ©2011 IEEE

of approximately 20 meters. For each location, we have a streetlevel spherical panoramic RGB image, a depth map, the GPS of the panorama center and the vehicle’s orientation, and a 3D points cloud. However, the 3D point cloud and raw image (RGB+depth) data need huge storage space, undesirable for mobile devices. We also cannot match the spherical image with the 2D projective video directly. Therefore, we pre-process Earthmine data for compression and build a database useful for our applications. • 3D point cloud to 3D facade geometry We use 3D facade geometry, i.e. buildings facades and the ground plane, to approximate urban scenes, where planar structures are prevalent. We extract multiple planar segments from the 3D points’ clouds and the images. Hence, we obtain a compact and scalable data representation, much more suitable for 3D MAR applications. Each 3D point in the cloud represents the actual 3D location of an image pixel. To extract the planar segments, for example the 2 facades in Figure 1, we adopt a random sample consensus (RANSAC) approach and we combine both the image and 3D points’ cloud to guide the sampling process. Before that, we subsample the large number of 3D points for faster computation. In the algorithm, we iterate Nransac times over the following steps: First, we randomly select Nre f pixels from the image. For each pixel, denoted as reference pixel: • in a local neighborhood around the reference pixel, we randomly select 2 pixels such that the 3 pixels are noncollinear, • we compute the normal nref in R3 to the plane Pre f formed by the 3D points corresponding to the 3 pixels. 3D points of neighboring pixels are more likely to lie on the same plane, • for each 3D point, M in R3 , we test if it lies on the plane Pre f and compute its projection error: E = [nref · (M − Mref )]2 , where Mref is the 3D point corresponding to the reference pixel. If the error E is less than the tolerance threshold ε , i.e., E < ε , we decide that the point M lies on the plane Pre f , and • finally, we compute the score of Pre f , as the normalized number of 3D points ”belonging“ to it: scorere f = |{M ∈ Pre f }|/N, N is the total number of points in the 3D points cloud. Next, we pick the largest score among the Nre f obtained planes. If it is larger than a threshold L, i.e., scorere f > L, we pick Pre f as an ”extracted” plane, and we obtain the least squares estimate of its normal vector and bias and eliminate, from the 3D points cloud, all the points M that ”belong” to it, i.e., whose projection error to this plane is less than the tolerance ε . • Spherical view to 2D view Visual input from a mobile device camera is projective and not spherical like the Earthmine data. Therefore, we need to unfold the spherical image panoramas by projecting them on 2D views. In our approach, we project the spherical panorama on a fixed set of views that equally partition the 360 degrees angular range around the panorama and that overlap to avoid abrupt changes. We used 8 views with 90 degrees field of view and 45 degrees overlap.

241

Figure 1: Left: San Francisco city hall. Right: points cloud/facades.

• Visual Features Extraction Instead of storing the raw image data, which consumes large space, we extract visual features from projected 2D images and use them in the database to represent the images themselves. SIFT [7] and SURF [6] keypoint detectors and descriptors are state-of-the-art visual feature candidates. In this paper, we chose SURF for its trade-off between speed and accuracy. • Augmentation Content Generation A well-known problem of large-scale MAR systems is the augmentation content. Manually labeling each building is not feasible. Instead, we use geocoding to generate such content. In particular, we compute the GPS coordinates of a planar segment’s center, potentially the center of a building facade. Once a geocoding system is in place, we can deploy it as a service for consumers to geo-tag legacy images and video content. 2.2 Client: Online processing Based on client GPS sensor, we pre-load a constrained subset of the database of Section 2.1 and set a local server on the device (left of Figure 2). As the user changes its location, this server is incrementally updated via the network. The sensor data and video input from the mobile device feed into our detection and tracking algorithms (right of Figure 2). The goal of on-line building localization is to detect and track the buildings in the camera’s view, and hence augment relevant content in the correct perspective. • Building Detection Recognizing a building among all world buildings is difficult and time-consuming. Coupling the mobile device’s location, e.g., GPS sensor or WIFI or 3G location services, and its orientation, e.g., compass and accelerometer, we simplify the problem by only matching against a location and orientation constrained image database. We extract SURF features from input video frames on the device and compare with visual features of pre-processed database images. Once we detect the building, we overlay the corresponding content. We can simultaneously estimate the device pose from the features correspondences between the query and the matched image. However, detection-based pose estimation can not run at the video frame rate because building detection is computationally expensive even after the geo-filtering. • Orientation Sensor and Optical Flow Pose Tracking Once the building is detected, we track device pose by fusing orientation sensor data and camera inputs in real-time. To track the video content, we use an efficient low-parameter motion-model estimation technique [8], based on a multi-resolution, iterative, gradient-based strategy. The algorithm uses a robust function to ignore irrelevant foreground motion. To refine the tracking and compensate for the drift, we fuse orientation sensor data with visual input and generate the final results. Pose tracking is fast and satisfactory as long as no new building enters the scene. • Multi-thread Detection and Tracking Although both detection and tracking modules can estimate the camera pose, they take different lengths of time. Therefore, we

242

Figure 2: Multi-threaded 3D MAR: detecting/tracking buildings and fusing vision with sensors and 3D data for real-time augmentation.

design a multi-threaded framework as in Figure 2. We also coordinate the outputs from the building detection thread and the tracking thread. We define a confidence measure for the reliability of the building detection output. After the first such output, we augment the device’s live view with content based on updates from tracking to the detection output. When new detection results are available, we update the display if the confidence value is larger than prior results. However, we always slowly age the confidence of prior results to represent the tracking drift. 3 I MPLEMENTATION AND C ONCLUSION We deployed our system on a Moorestown Intel platform with 600 MHz/512MB single-core hyperthreaded Atom processor. We tested the system at Santana Row neighborhood in California, where Earthmine data covers 721 street locations. From our qualitative experiments, building recognition, tracking, and augmentation work at good accuracy. The detection algorithm, alone, runs at 1 fps on average, and the speed varies with the constrained database size. The tracking algorithm is faster (20 fps). With multithreading, the overall system runs at 7 to 10 fps on average. In conclusion, we presented a video MAR system that registers the mobile device live view with rough 3D models of urban scenes. It is the first fully functional end-to-end system with detection and tracking algorithms running efficiently on Intel Atom platform, and an expandable database of visual features, 3D models, and metadata. In the future, we will build benchmarks to provide a quantitative evaluation of our matching accuracy and running speed. ACKNOWLEDGEMENTS The authors thank Earthmine for providing the 3D data. R EFERENCES [1] [2] [3] [4] [5]

www.snaptell.com. www.google.com/mobile/goggles. www.layar.com. www.wikitude.org . G. Baatz, K. K¨oser, D. Chen, R. Grzeszczuk, and M. Pollefeys. Handling Urban Location Recognition as a 2D Homothetic Problem. pages 266–279. The 11th European conference on Computer vision, 2010. [6] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. Lecture Notes in Computer Science, 3951:404, 2006. [7] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [8] O. Nestares and D. Heeger. Robust Multiresolution Alignment of MRI Brain Volumes. pages 705–715.

2D/3D Web Visualization on Mobile Devices