Towards Real Time 3D Tracking and Reconstruction on ...

Viewer
Transcript

©2010 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Towards Real Time 3D Tracking and Reconstruction on a GPU Using Monte Carlo Simulations ∗ ´ Jairo R. Sanchez

† ´ Hugo Alvarez

Diego Borro‡

CEIT and Tecnun (University of Navarra) ´ ´ Spain Manuel de Lardizabal 15, 20018 San Sebastian,

A BSTRACT This paper addresses the problem of camera tracking and 3D reconstruction from image sequences, i.e., the monocular SLAM problem. Traditionally, this problem is solved using non-linear minimization techniques that are very accurate but hardly used in real time. This work presents a highly parallelizable random sampling approach based on Monte Carlo simulations that fits very well on the graphics hardware. The proposed algorithm achieves the same precision as non linear optimization, getting real time performance running on commodity graphics hardware. Both accuracy and performance are evaluated using synthetic data and real video sequences captured with a hand-held camera. Moreover, results are compared with an implementation of Bundle Adjustment showing that the presented method gets similar results in much less time. 1

I NTRODUCTION

Real time simultaneous localisation and mapping (SLAM) consists of calculating both the camera motion and the 3D reconstruction of the observed scene at the same time using external measures. This technology is directly applicable, for example, to robot navigation or to augmented reality applications. External measures, like video images or laser triangulation sensors, are used to build a virtual representation of the world. This reconstruction is recursively used to locate the robot or the camera. These measures may contain both outliers and inaccuracies that makes the reconstruction to be incorrect, so robust estimators are needed in order to obtain good results. When the system works with a single camera, it is known as monocular SLAM [4]. Monocular SLAM has been widely studied in previous works. However, almost solutions are based on very expensive batch optimisation techniques that are hardly implemented in real time. In fact, although current computers can execute some SLAM algorithms in real time, they use all its computing power leaving no time to make other tasks. In recent years, the use of high level programming languages for the graphics processor unit (GPU) is being popularized in many scientific areas. These devices are providing massively parallel processors for almost all desktop computers. The advantage of using the GPU rather than the CPU, is that the GPU has several computing units, that can run independently and simultaneously the same task on different data. These devices work as data-streaming processors [8] following the SIMD (Single Instruction Multiple Data) computational model. Anyway, each computing unit has less power than a single CPU, so the benefit of using the GPU comes when the size of the input data is large. This fact makes it necessary to adapt existing algorithms and data structures to fit this programming model. ∗ e-mail:

[email protected]

† e-mail:[email protected] ‡ e-mail:[email protected]

Existing monocular SLAM algorithms are not suitable to be implemented on the GPU because of their batch nature. Therefore, this work proposes an implementation of the monocular SLAM problem that can achieve a high level of accuracy in real time taking advantage of the graphics hardware. In this way, the algorithm gets the real time performance without leaving the computer unusable for other tasks. Instead of using serial batch optimisation techniques, we propose a paralellizable sampling technique, based on Monte Carlo simulations, that fits very well on the GPU programming model. In order to validate the algorithm, results are compared with the implementation fo the popular Bundle Adjustment (BA) algorithm provided in [13] that uses Levenberg-Marquardt for the minimization. The main contribution of this work is a SLAM framework that can be implemented in massively parallel platforms, like GPUs, leaving the CPU free for any other tasks. The paper describes the algorithm and a reference implementation that works in real time on commodity graphics hardware, demonstrating that it can achieve a good level of accuracy. It consists of approximating the maximum likelihood estimator, randomly sampling from the space of possible locations of the camera and the 3D points. The convergence of the algorithm depends directly on the number of samples taken from the search space. Since each sample is independent from others, this method exploits well the data level parallelism required by this programming model. An overview of the algorithm is given in Section 3 and it is explained in detail in sections 5 and 6. An implementation written in CUDA is explained in Section 7 and experimental results are explained in Section 8. Although the algorithm has been developed using CUDA, it can be implemented in any other parallel programming platform. 2

R ELATED W ORK

Traditionally, the SLAM problem is addressed using recursive estimations (probabilistic maps) and batch optimisations (Bundle Adjustment). The first option is fast but sometimes not as accurate as one would wish. The second option is very accurate but not suitable for real time operation. Recursive estimators try to compute the reconstruction using Bayesian estimators, typically Extended Kalman Filters (EKF) [22]. We are interested in predicting the sequence of the state of our world Xk = {~x1 , . . . ,~xk } having a set of observations Yk = {~y1 , . . . ,~yk }. The problem is formulated as the conditional probability of the current state given the entire set of observations: P (~xk |Yk ). The key is that the current state probability can be expanded using the Bayes Rule. Usually the state ~xk is composed by the camera pose and the 3D structure, and the observation~yk is image measurements. Now the problem is how to approximate these densities. Batch optimisations are used when real time operation is not needed. The problem is solved using non-linear minimization like Levenberg-Marquardt (LM) [12] and Bundle Adjustment (BA) [20]. The objective is to minimize the reprojection error of the 3D points given the measurements on the image. These methods produce good results if the initial value of the parameters is close to

the image, the camera pose relative to the first frame is estimated using the 8 point algorithm [6]. Then, the 3D structure is built using linear triangulation. This method produces a very noisy structure map, but it is quickly improved by the structure optimizer. The 3D structure is a set of 3D points Z = {~z1 , . . . ,~zn } such that each point has an associated 2D feature. The 3D motion tracker estimates the pose using 3D-2D point matches. Drawing on the previous pose, random samples for the current camera pose are generated and weighted using the euclidean distance between measured features and the projections of the 3D points. The pose in frame k is represented as a rotation matrix Rk and a translation vector~tk with an associated weight wk . We assume the pinhole camera model, so the relation between a 2D feature in the frame k and its associated 3D point is given by ~yki Figure 1: SLAM Pipeline.

the global optimum. Authors, like [7], use the L∞ norm and convex optimization. However, as the author states, the method is very sensitive to outliers. Recent works like [1] are capable to reconstruct huge scenes using BA, but not in real time. Citing from [1]: “3D reconstruction took 27 hours on 496 compute cores.” Although they use massively parallel computers, their algorithm can not be used in real time because it is designed to work with large-scale scenes. Authors like [10] use local BA to get the optimization work in real time. They use two threads, one for tracking, and the other for the mapping. They get real time operation, but unlike our proposal, they occupy all the computing resources of the system. Since their implementation relies on task level parallelism, it can not be scaled to many core processors. In fact, at this time we have not found any thread oriented parallel SLAM implementation like ours. 3 M ETHOD OVERVIEW Existing SLAM methods are inherently iterative. Since our target platform is a parallel system, iterative methods are not the best candidates for exploiting the capabilities of them, because they usually have dependencies between the iterations that can not be solved efficiently on a parallel platform like a GPU. In this context, instead of adapting an existing method, we have developed a fully parallelizable algorithm for both camera tracking and 3D structure reconstruction based on Monte Carlo simulations [15]. This approach consists of generating inputs randomly from the domain of the problem and weight them using some type of function depending on the measurement obtained from the system. Monte Carlo simulations are suitable for problems where it is not possible to calculate the exact solution from a deterministic algorithm, which is the case we are dealing with. Since each sample is independent, the solution is very adequate for parallel systems and has very good scalability. In contrast, it is very computationally intensive, and it is not adequate for current serial CPUs, even multicores. Figure 1 shows the SLAM pipeline used in this work. It is a bottom-up approach, but the proposed algorithm is suitable for topdown approaches as well. First, a 2D feature tracker finds new features and tracks previously detected ones using optical flow and Kalman filtering algorithms using the implementationproposed in [18]. A frame k is represented as a set of features Yk = ~yk1 , . . . ,~ykn . Each new feature is initialized with a SIFT descriptor [14] used to detect features lost in the past. Although SIFT features are hard to compute, since our GPU implementations leaves the CPU free, we can assume this overhead comfortably. In the structure estimation step, the 3D structure is initialized using epipolar geometry. When a great displacement is detected in

=

ui vi

=

fu (~zi , k) fv (~zi , k)

= Π Rk~zi +~tk

(1)

where Π is the pinhole projection function. For simplicity, the calibration matrix can be omitted in Equation 1 if 2D feature points are represented in normalized coordinates instead of pixel coordinates. When the pose is estimated, the 3D structure is adjusted using the new pose. For this task we have used an improved version of the triangulation algorithm presented in [17]. It is done in the GPU using Monte Carlo simulations as well. In this case, each point in the map is sampled around its neighborhood and projected using stored frames. These projections are compared with tracked 2D feature points so the sample that minimizes the sum of euclidean distances is chosen as the new location of the 3D point. Section 4 describes the implementation of the feature tracker used in this work and sections 5 and 6 describe in detail the pose estimation and structure reconstruction algorithms respectively. 4

F EATURE T RACKING

Since the pipeline described in this work uses a bottom-up tracking, it is very important to have a robust feature point management. In order to get a good compromise between performance and effectiveness, we have used several techniques for the feature tracking. As said in Section 3, the SIFT descriptor of each feature point is computed. However, the original Difference of Gaussians feature detector has been replaced by the cheaper Shi and Tomasi detector [19]. Feature detection is carried out every frame, but only in image regions that have no features on it, i.e., in new areas. Feature points are tracked recursively calculating their optical flow using the Lukas-Kanade algorithm in two pyramidal reductions [2]. SIFT descriptors are only used for detect features lost in the past. Although SIFT features are hard to compute, we have implemented a simplified version inspired on [21]. Furthermore, we have fixed the SIFT kernel and we have precomputed all kernel orientations to accelerate online processing. Unlike the original version, we only retain one descriptor for each feature, independently of not detecting a strong kernel orientation. The main drawback of this restriction is that the matching quality will decrease if the system suffers a strong movement (blur motion) combined with a notable scale change. The first problem is avoided comparing only features around the reprojection positions of 3D points (spatial coherence, Section 6.4). The scale problem is also avoided using spatial coherence. However, if the tracking is completly lost, the spatial coherence can not be used. In this case, the scale change is not a problem since these losses are due to sudden movements that usually do not involve large scale changes. Even introducing these simplifications, the feature tracker is heavier than usual in this type of applications. However, we have exploited the fact that our GPU implementation leaves the CPU free, executing the SIFT algorithm in parallel on CPU using the OpenMP library.

5

5.2

P OSE E STIMATION

Weighting Samples

The pose is estimated each frame using the 2D features given by the feature tracker. Since these features might contain noise and outliers, the direct least squares solution using 2D-3D matches can lead to very large errors in the estimation. There are several solutions to avoid this problem, but because of our parallelism requirements we have chosen to develop a Monte Carlo solution, that can find the correct solution if the number of samples is high enough. In order to facilitate the sampling task, the pose is parameterized using a rotation quaternion and a translation 3-vector instead of its 4 by 4 matrix representation. Although we have successfully used the minimal representation using euler angles, quaternions are safer because they do not suffer from gimbal lock. Thus, the pose in T frame k is represented by a 7-vector in the form ~xk = ~qk ,~tk .

All the samples in the set A are weighted and the one with the best score is chosen as the correct pose. The weight of a sample can be seen as the probability of the observation given the state, i.e., of the 2D image points given the camera pose: the probability (i) P Yk |~xk . This probability can be approximated proportionally to the norm of the vector containing the projection residuals given the current pose and 3D structure. Any norm can be used, like the L2 norm or the L∞ norm. However, we have experimentally concluded that the L2 norm is the best option, so the the frame k is chosen as:

5.1

where Rk is the rotation matrix corresponding to ~qk . Other authors like [16] use the inlier/outlier likelihood, that produces good results as well, but the estimated motion is less smooth. This function is used only to choose the best hypothesis among the samples, but the likelihood of the frame is calculated as a threshold function that depends on the number of inliers that the sample gets. If there are enough inliers, the frame is considered a keyframe and it will be used by the structure adjustment algorithm. In addition, only points that have been inliers in the previous frames are used for the pose estimation. There are more complex methods for keyframe selection in the literature, but this method is easy to implement and gets good results with our random sampling method. Finally, in order to get a smooth motion and avoid jitter, the pose that is used to render virtual objects is filtered using a desp filter [11].

i

n o (1) (m) In each frame k a set of samples A = ~xk , . . . ,~xk is generated drawing on the previous frame. In order to avoid problems with erratic motion patterns, hypotheses are generated using the random walk sampling model. In this way, new samples are generated as random rotations and translations from the previous pose ~xk−1 so the sampling function is (i) ~ i ,~τi ) = ~qk−1 + ω ~ i ,~tk−1 +~τi ~xk = f (~xk−1 , ω

(2)

~ i and ~τi are samples where ω distributed ran taken from normally dom variables Ω ∼ N ~0, Σrot and T ∼ N ~0, Σtrans . Since rotations between frames are very small even with erratic motions, the covariance matrix Σrot is fixed for all frames. Note that an assumption of 10 degrees per frame (for example) means a total of 250 degrees per second at 25fps. Faster rotations would lead to excessive blur that would make the sequence non-trackeable. Moreover, the perturbation model applied to the quaternion is a simple vector addition (and a posterior normalization) instead of a quaternion composition. This addition has no direct geometric interpretation, but it modifies both angle and axis at the same time. Experimentally, we have concluded that the perturbation in the rotation axis has more importance than the perturbation in the magnitude of the rotation itself, leading to a fast and effective perturbation model. However, the magnitude of the translation can vary a lot between frames, so the covariance matrix Σtrans is updated every frame. To calculate this matrix, we assume that the camera only translates locally. In this way, taking Equation 1, the projection of a 3D point can be approximated as: ~yki ≈ Π(Rk−1~zi + tˆk )

(3)

where the vector tˆk is the only variable in this translational camera model. If we then calculate the average displacement of 2D features in the image k as ~ykc , then Equation 3 can be linearized as: ~ykc =~yk−1 + Jk−1 tˆk −~tk−1 c

(4)

where Jk−1 is the Jacobian matrix of Equation 3 evaluated at the centroid of the visible 3D points with respect to translation. Using Equation 4, an approximate translation vector tˆk can be obtained, so the covariance of the T random variable can be set as:

(5)

In this way, sampled poses will move in average the 3D point cloud so the displacement of its projections is similar to the optical flow of the tracked features.

6

(6)

j=1

(i)

Pose Sampling

Σtrans = diag tˆk −~tk−1

n

(i) (i) ~xk = argmin ∑ Π Rk ~z j +~tk −~ykj (i)

S TRUCTURE A DJUSTMENT

The structure reconstruction is performed using an extended version of [17]. Following the same argumentation as in the pose estimation, the 3D map is reconstructed using Monte Carlo simulations as well. Each 3D point is parameterized as a 3-vector T ~z j = x j , y j , z j . The structure estimation has two major parts, i.e. point initialization and point adjustment. New points are added constantly to the map making it grow and mapped points are adjusted using the information from new frames. 6.1

Point Initialization

At the beginning of the sequence, since the only data that can be obtained from images are 2D feature points, the 3D reconstruction is initialized using the epipolar geometry estimation. When a great pixel displacement is detected, the Essential matrix is calculated using the 8 point algorithm together with RANSAC and adjusted using LM. From this matrix, the camera pose relative to the first frame can be recovered [6]. This first pose estimation allows to calculate a prior 3D reconstruction using linear triangulation, that later is adjusted when more views become available. At the rest of the sequence, new points are added to the map as long as new 2D features are detected. These new points are computed using linear triangulation as well, but in this case the pose data come from the pose estimator instead of epipolar geometry. In order to minimize the uncertainty of the reconstruction, a constraint is imposed on the length of the baseline of the frames used for the triangulation [3]. Linear triangulation is a very ill-conditioned procedure and its results are unusable, but it is a computationally cheap starting point for our adjustment method. Already mapped points are adjusted when new keyframes become available if their projection residual is greater than a threshold, typically set in 2-3 pixels. The adjustment is done in two steps, i.e. point sampling and weighting. In fact, this is a way to solve the Bundle Adjustment problem in a parallel way.

6.2

Point Sampling

7.1

n o (1) (q) For each point ~z j , a set of random samples B = ~z j , . . . ,~z j is generated around its neighborhood. The sampling function follows the random walk model as well. (i) (−) ~ i =~z j + ψ ~i ~z j = f ~z j , ψ

(7)

(−)

where~z j

is the prior location of the point being adjusted and ψ is a sample from a normally distributed variable Ψ ∼ N ~0, Σ j . As in the case of pose sampling, the projection of the point~z j can (−)

be linearized around its prior location ~z j T uˆ j , vˆ j in the last frame:

having its observation

uˆ j u (−) = j + Jkj ~z j −~z j vˆ j vj

(8)

where Jkj is the Jacobian of Equation 1 evaluated in the last frame with respect to the variable ~z j . Using Equation 8 the covariance matrix can be set as: (−) Σ j = diag ~z j −~z j 6.3

(9)

Weighting Samples

For each sample, the point is projected along the past keyframes and its weight is calculated as the euclidean distance between those projections and the 2D features taken from the images. The sample which minimizes that error is taken as the new position of the point:

k

(i) ~z j = argmin ∑ Π Rk~z j +~tk −~ykj .

(10)

i

If after the adjustment the posterior residual is less than the prior and minor than a threshold, the point is modified. If not, the point is considered as an outlier and removed from the map. The 3D structure is stored in the global memory of the GPU as well. In this way, it is not necessary to transfer it every frame in the pose estimation step. 6.4

Remapping

When the 2D tracking of a feature is lost, the corresponding 3D point in the map is marked as lost. In each frame these lost points are projected using the pose information estimated by the 3D tracker. If the projections are located within the image, the system tries to remap them. Since each feature has a SIFT descriptor associated, we use a fixed size window around the projection of the point being remapped looking for a feature having a similar descriptor. 7

GPU I MPLEMENTATION

This section explains the details of the implementation of the algorithm in the parallel computing device. Our reference implementation has been developed using the CUDA computing language [9], however it can be programmed using any other alternative. The main characteristics desired in a CUDA program (or any other GPU based computing language) are few communications with the host, no dependencies between parallel threads, non divergent functions (avoiding if-then-else constructors) and high arithmetic versus memory access ratio. Taking this into account, the next sections explain the implementation details of the pose estimation and point adjustment algorithms.

Pose Estimation

The pose is estimated each frame using a single computing kernel. It has two parts: sampling and weighting. Both parts are performed by the parallel device. In this way, the data needed are the 3D map and its projections in the current frame. Since the GPU lacks from random number generating functions, we have adopted a pseudorandom sampling method. An array with random numbers is precomputed on the CPU and loaded into the memory of the device. This array contains 7 numbers for each hypothesis sampled (one for each component of the pose). We have chosen a standard normal distribution since it can be easily modified each frame using the covariance matrix computed in Equation 5. This pseudo-random sampling gives good results and avoids having to update the random number array for each frame. This array is stored in the texture memory of the device, since it has the best performance for the memory access patterns of the tracking kernel. We have partitioned the problem so that each computing thread samples and weights a single hypothesis. Thus, there is no need of communications between threads, since each hypothesis is processed regardless to others. In each frame, the CPU only transfers the coordinates of the tracked feature points. The 3D reconstruction is not transferred since it is permanently stored on the constant memory of the GPU. Algorithm 1 shows a resume of the computing kernel. Although there is a conditional statement in the kernel, it does not suppose a performance drop since the condition is evaluated equal in all the threads. After the execution of the kernel, a list of weighted samples is obtained. The best sample is chosen as the current camera pose. Searching the best sample implies looping through the list comparing the weight of all samples. Since the system typically works with a large amount of samples, transferring the whole list to the main memory of the CPU is a very slow operation. Because of this, the search is also done in the GPU using the reduction implementation proposed in [5]. The best pose is transferred to the CPU and moved to the pose history that is kept in the GPU memory.

Algorithm 1 Overview of the tracking kernel. Require: randomR → Texture with random rotations Require: randomT → Texture with random translations Require: map → Vector with 3D points Require: f eaturePoints → Vector with the feature points Require: Σtrans → Covariance matrix Require: prevPose → The pose in the previous frame idx = getT hreadIndex() poseSample.R = norm(randomR[idx] ∗ prevPose.R) poseSample.T = Σtrans ∗ randomT [idx] + prevPose.T residual = 0 for all P in map do if isVisible(P) in current frame then imageP = pro ject(P, poseSample) measure =correspondence of P in f eaturePoints residual+ = dist(imageP, measure) end if end for return residual

Figure 2 shows the dataflow in the execution of the pose estimation kernel. As seen, two floating point numbers are transferred for each feature before starting the execution and one pose vector after finishing. All this memory transfers can be done asynchronously, allowing the CPU to work on other task.

Figure 2: Execution of the pose estimation kernel.

7.2

Reconstruction Adjustment

The reconstruction is adjusted every frame if the tracked frame is marked as a keyframe. Only points having their projection error greater than a threshold are subjet to be adjusted (typically 2-3 pixels). Since the resolution strategy is similar to the one used in the tracking kernel, the problem is partitioned in a similar manner. There is only one computing kernel and each thread will compute a single hypothesis about the position of a point. Although we have stated in Section 6.2 that the space is randomly sampled, in order to achieve better performance and drawing on the fact that mapped points have 3 degrees of freedom, we have used the built-in trilinear texture interpolator of the CUDA device to sample the position of the point. Not all keyframes are used in the minimization. In order to cover most part of the sequence and to avoid falling into local minima because of using only the last frames, the chosen keyframes are scattered along the timeline. We have used a Fibonnaci sequence to generate the indices of the keyframes to use starting from currentFrame − Fib (2). In this way, a lot of recent frames are used in the adjustment but some old frames are used too, leading to a better global behaviour of the minimization. For each point, the CPU must transfer to the GPU the indices of the frames to be used in the adjustment and the 2D features in that frames. The poses corresponding to the frames are already stored in the GPU memory, since they are updated every frame. Algorithm 2 shows a resume of the adjustment kernel. Algorithm 2 Overview of the adjustment kernel. Require: point → Initial point location Require: interpTex → Interpolation texture Require: Σ~z j → Covariance of~z j Require: poseHistory → All previous poses Require: indicesToAd just → Indices of the frames to be used Require: imagePoints → Images of~z j in past frames idx = getT hreadIndex() pointSample = point + Σ~z j ∗ interpTex[idx] residual = 0 for all index in indicesToAd just do if isVisible(point) in index then imageP = pro ject(pointSample, poseHistory[index]) measure = imagePoints[index] residual+ = dist(imageP, measure) end if end for return residual Like in the tracking kernel, the best sample is chosen using a reduction algorithm and then transferred to the CPU memory. If the posterior projection error is less than the prior, the point is updated in the map. Figure 3 shows the overview of the kernel execution for each 3D point being adjusted.

Figure 3: Execution of the structure adjustment kernel.

8

E XPERIMENTAL R ESULTS

Experiments on real and synthetic data have been made in order to validate the performance and precision of the algorithm. The hardware used for the experiments is an Intel Core 2 duo at 3.0GHz with 4GB of RAM memory and a nVidia GeForce GTX260 with 896MB of RAM memory. For real data experiments a standard webcam has been used working at a resolution of 320x240. The focal length of the camera is known and lens distortions are not taken into account. In all the experiments, the point adjustment is done using between 10 and 15 frames. 8.1

Precision

The precision of the algorithm has been measured using synthetically generated data. These data are generated getting the projections of a virtual 3D object observed from a virtual camera. The algorithm is fed only with these projections and the estimated pose is compared with the position of the virtual camera in each frame. The virtual camera has a focal length of 500 pixels with a viewport of 320x240 and the object is 1 m wide located at 3 m of the camera centre. For testing purposes, the scale of the reconstruction is set manually because of the metric uncertainty inherent to this type of reconstruction methods. Figures 4a and 4b show the rotational and translational error of the estimated pose compared with the real, running on the same sequence with different noise levels. The first test is made with unmodified input data, and the second with 10% of incorrect matches and additive Gaussian noise with mean 0 and standard deviation of 1 in 2D features. As seen, the algorithm is quite robust against outliers and measuring errors. It gets an average error of 2 cm in the camera position and less than a degree in its orientation with the noisy dataset. Anyway, the translational error is mostly due to ambiguities in the reconstruction scale. Additionally, Figure 4c shows the projection error of the estimated 3D points compared with the 2D features given as input, which is the value of the objective function in the minimization process. As expected, the reprojection error increases with the additive noise in the input. However, the error is reduced as the sequence advances. It is the normal behaviour of a Monte Carlo simulation, since it converges to the optimum as the number of samples increases. Finally, Figure 5 shows the reprojection error in a real video sequence compared with BA. In the tests, both algorithms run in the same conditions, i.e., they are executed using the same feature tracker, the same structure initialization, same input data, etc. The structure optimization is carried out every frame in both cases and the frames are selected as described in Section 7.2. BA is parameterized with the default configuration given by the author in [13]. The projection errors are measured using the state of the SLAM algorithm in the last frame of the video. Although Figure 5 shows that the proposed method performs better than BA, it should be noted that since outlier detection and key frame selection depend on the output of the minimization algorithm, the SLAM sequence evolves differently in each case. This means that the final reconstruction has different amount of points, different scale, etc. However the

0.06

0.035

σ = 0, 0% outliers σ = 1, 10% outliers

3.5

σ = 0, 0% outliers σ = 1, 10% outliers

0.04

0.03

0.02

0.01

0.025

0.02

0.015

0.01

0.005

0 0

50

Frame

100

2.5

2

1.5

1

0.5

0 0

150

σ = 0.0 σ = 0.5 σ = 1.0

3

Projection Error (pix)

0.03

Rotation Error (rad)

Translation Error (m)

0.05

50

(a) Translational error.

Frame

100

0 0

150

50

(b) Rotational error.

Frame

100

150

(c) Projection error in pixels.

Figure 4: Tests on simulated data.

7

12

Bundle Adjustment GPU Optimizer

6

10

8

Error (pix)

Error (pix)

5

4

3

6

2

4

1

2 0 0

50

100

150

200

Frame

250

300

350

400

0 0

Figure 5: Projection error comparison.

50

100

150

200

250

Frame

300

350

400

450

500

Figure 6: Projection error in pixels in the outdoor scene.

difference between them is about 1-2 pixels, so we could say that they obtain the same precision level. 8.2

Performance

Figure 8a shows the total time needed by the algorithm in a real video sequence and the size of the reconstructed 3D map. In each frame, 218 hypotheses about the camera location and points positions are sampled and weighted. In average, the algorithm needs 25ms per frame. The real advantage of this method, is that most of the time, the CPU is free. Comparisons with a CPU implementation of this algorithm are not provided, since it is very slow and unusable for real time applications. Figure 9 shows some screenshots of the real sequence used for these tests. As shown in Figure 10, the system can also deal with outdoor scenes. In this example, the sequence has been recorded walking around a parking with a handheld camera. The camera translates about 10 meters and rotates about 90 degrees. Figure 6 shows that even with a bad initialization, the error drops to a few pixels quickly. Figure 8b shows the time needed by BA to process a fragment of the same video using the same parameters. As can be seen, it is not usable in real time tracking with the number of frames and features provided. A better parameterization could improve these times, but the difference is very large. Abrupt changes are because of LM minimization can detect when the solution is near to the local minima and it finishes the process. It is a desirable feature, however our adjustment can not detect this situation because of its parallel nature. The performance of the feauture tracker in portion of the test sequence is shown in Figure 7. Only the computation time of the SIFT

features in the remapping stage is shown, since it is the most time consuming part. It runs quite fast, because it uses multiple threads and scales well in a multiple-core CPU. Feature detection and optical flow computing time is always below 1ms. In addition, the computation of SIFT features is overlaped with the computations in the GPU. 9 C ONCLUSION This paper presents a new approach to solve the SLAM problem in real time. Its results are very similar to other state of art methods, however and important advantage is that it is designed to run on a GPU. Moreover, since it is a random sampling method, it is very robust against outliers and noise. Since it is an estimation by tracking algorithm, it has all the classic drawbacks of this type of methods. It is very sensitive to rapid camera motions that generates motion blur. Normally, this can be detected and the tracking can be recovered using remapping techniques [23]. However, rapid camera motions can lead to incorrect mappings and to an unrecoverable error state. It also depends on the lighting conditions, but it can be solved using features robust against illumination changes. In summary, this paper presents a GPU computational framework that can be used to solve many computer vision problems. In this work we have used it to solve in realtime the camera tracking and 3D reconstruction problems, but it can be applied to almost any measurement driven minimization algorithm, getting a fast, robust and easy to implement method. Augmented reality needs a lot of computational resources to

40

1400

35

30

Descriptors computed

1200

Time (ms)

1000 25 800 20 600 15 400 10

200

5

0 500

550

600

650

Frame

700

750

0 800

Figure 7: SIFT computation time.

work. The trend today is to improve the hardware by means of parallelism rather than in arithmetic power, thus as happened with 3D graphics years ago, a good solution is to use a specialized massively parallel hardware. In our opinion, in the field of augmented reality, random sampling algorithms are the best candidates for future methods because of their parallelism level, high accuracy and robustness against outliers. ACKNOWLEDGEMENTS The contract of Jairo R. S´anchez is funded by the Ministry of Education of Spain within the framework of the Torres Quevedo Pro´ gram, and the contract of Hugo Alvarez is funded by a grant from the Government of the Basque Country. R EFERENCES [1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building rome in a day. In International Conference on Computer Vision, Kyoto, Japan, 2009. [2] J.-Y. Bouguet. Pyramidal implementation of the lucas kanade feature tracker. Technical, Intel Corporation, 2000. [3] K. Cornelis. From uncalibrated video to augmented reality. PhD thesis, Katholike Universiteit Leuven, 2004. [4] A. J. Davison. Real-time simultaneous localisation and mapping with a single camera. In Ninth IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1403–1410, Nice, France, 2003. [5] M. Harris. Optimizing parallel reduction in cuda. Technical report, nvidia, http://developer.download.nvidia.com/. [6] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000. [7] F. Kahl. Multiple view geometry and the l infinity norm. In International Conference on Computer Vision, volume 2, pages 1002–1009, 2005. [8] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens. Programmable stream processors. IEEE Computer, pages 54–62, 2003. [9] D. B. Kirk and W. W. Hwu. Programming Massively Parallel Processors. 2010. [10] G. Klein and D. W. Murray. Parallel tracking and mapping for small ar workspaces. In International Symposium on Mixed and Augmented Reality, 2007. [11] J. LaViola. Double exponential smoothing: An alternative to kalman filter-based predictive tracking. In The Immersive Projection Technology and Virtual Environments, pages 199–206. ACM Press, 2003.

[12] K. Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics, 2(2):164– 168, 1944. [13] M. A. Lourakis and A. Argyros. SBA: A Software Package for Generic Sparse Bundle Adjustment. ACM Trans. Math. Software, 36(1):1–30, 2009. [14] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2(60):91–110, 2004. [15] N. Metropolis and S. Ulam. The monte carlo method. Journal of the Americal Statistical Association, 44(247):335–341, 1949. [16] M. Pupilli. Particle Filtering for Real-time Camera Localisation. PhD thesis, University of Bristol, 2006. ´ [17] J. S´anchez, H. Alvarez, and D. Borro. Gpu optimizer: A 3d reconstruction on the gpu using monte carlo simulations. In Poster Proceedings of the 5th International Conference on Computer Vision Theory and Applications (VISAPP’2010), pages 443–446, Angers, France, 2010. [18] J. S´anchez and D. Borro. Automatic augmented video creation for markerless environments. In Poster Proceedings of the 2nd International Conference on Computer Vision Theory and Applications (VISAPP’07), pages 519–522, Barcelona, Spain, 2007. [19] J. Shi and C. Tomasi. Good features to track. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’94), pages 593– 600, 1994. [20] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment a modern synthesis. Vision Algorithms: Theory an Practice, 2000. [21] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg. Real time detection and tracking for augmented reality on mobile phones. IEEE Transactions on Visualization and Computer Graphics, 16(3):355–368, 2010. [22] G. Welch and G. Bishop. An introduction to the kalman filter. In SIGGRAPH, Los Angeles, CA, USA, 2001. [23] B. Williams, G. Klein, and I. D. Reid. Real-time slam relocalisation. In IEEE International Conference on Computer Vision, Rio de Janeiro, 2007.

(a) Total time on a real sequence using GPU.

(b) Total time using Bundle Adjustment.

Figure 8: Execution times.

Figure 9: Example of a real sequence.

Figure 10: Example of an outdoor sequence.

KinectFusion: real-time dynamic 3D surface reconstruction and ...

Real-Time Detection, Tracking, and Monitoring of ...

Real-time automated 3D sensing, detection, and ...

Road Surface 3D Reconstruction Based on Dense Subpixel ...

OpenCV - 3D tracking API creation and tracking ... -

Parameter optimization in 3D reconstruction on a large ...

Globally Optimal Target Tracking in Real Time using ...

Label-free Photoacoustic Cell-Tracking in Real-time

real time eye tracking for human computer interfaces - CiteSeerX

Real-Time Human Pose Tracking from Range ... - Research at Google

Towards Real-Time Novel View Synthesis Using Visual ...

Towards A Real-time Cognitive Radio Network Testbed

Towards Real-time Management of Virtualized ...

Real-Time Particle Filtering with Heuristics for 3D ...

real-time 3d graphics streaming using mpeg-4

A novel method for 3D reconstruction: Division and ...

Oceanographic Real-Time Measurement on Buoyancy Beacon ...

Real-Time Vision-Aided Localization and Navigation Based on Three ...