International Symposium on Mixed and Augmented Reality 2010 2010. 10. 13~16, Seoul, Korea
Point-and-Shoot for Ubiquitous Tagging on Mobile Phones Wonwoo Lee, Y.Park, W. Woo GIST U-VR Lab. V. Lepetit EPFL CVLab.
Introduction •
We propose a novel 3D augmentation method with minimalist user interaction on mobile phones
•
In situ 3D augmentation through a simple point-and-shoot approach
• •
No complex 3D reconstruction Target detection from unseen viewpoints
Introduction •
We propose a novel 3D augmentation method with minimalist user interaction on mobile phones
•
In situ 3D augmentation through a simple point-and-shoot approach
• •
No complex 3D reconstruction Target detection from unseen viewpoints
Introduction • The proposed method follows a standard procedure of target learning / detection
Input Image
Online Learning
Real-time Detection
Introduction • The proposed method follows a standard procedure of target learning / detection
Input Image
Online Learning
Real-time Detection
Online Target Learning If
• Input: Image of the target plane • Output: Patch data and camera poses If
I1
I2
I3
Input patch
Input patch
1st pass: patch warping
Assumptions • Known camera parameters • Horizontal or vertical surface
2nd pass: radial blurring
3rd pass: Gaussian blurring
Input Image
I1
I2
I4 1st pass: patch warping
4th pass: Accumulation
Frontal View Generation
Blurred Patch Generation
Post-processing
Online Target Learning If
• Input: Image of the target plane • Output: Patch data and camera poses If
I1
I2
I3
Input patch
Input patch
1st pass: patch warping
Assumptions • Known camera parameters • Horizontal or vertical surface
2nd pass: radial blurring
3rd pass: Gaussian blurring
Input Image
I1
I2
I4 1st pass: patch warping
4th pass: Accumulation
Frontal View Generation
Blurred Patch Generation
Post-processing
Frontal View Generation • We need a frontal view to create the patch data and their associated poses
Targets whose frontal views are available
Frontal View Generation • However, frontal views are not always available in the real world
Targets whose frontal views are NOT available
Frontal View Generation • Objective: Fronto-parallel view image from the input image
• Approach: Exploit the phone’s built-in accelerometer
•
1 DoF assumption: Target plane has only pitch rotation)
the third row of Figure 12, the augmentation will correspond to a large object if the camera is far away from the surface; conversely, as shown in the fourth row, the augmentation will correspond to a small object if the camera is close to the surface. This is very intuitive but limited to some range of scale within which the user can move the phone, and the interface lets the user adjust the scale if needed.
standard to gener more effi dial blur to comp
Frontal View Generation
Warp ence ima a new pa Under the 1 DoF assumption As illustrated in Figure 2, we can finally describe how we genera 320 × ate a virtual fronto-parallel view of the target from the input image. Section 3 details how the set of templates are built and whereas etection and tracking. Experimental results are given Without loss of generality weinset the pose of the virtual camera Frontal view camera: [I|0] iments. We provide conclusions in Section 5. in the fronto-parallel location as [I|0]. The orientation obtained as Patch Virtual Using Z Captured view camera: [R|c] surface for the captured imTED W ORK explained above gives us the rotation matrix Rfrontal during le θ age, which is a rotation around the X-axis in this coordinate syscent works showed that it is possible to run Computer P view Zage. = d0On c gorithms for localization 3D to tracking on mobile tem. It isand easy see that the coordinates of the camera center c failures. 15, 16, R 17, = 18].Rot They arePall Captured ) based on feature points� X (θ [0, d0 sin θ p , d0to(1work − cos θ p )] , and the translation vector for the ore require a are fair amount of texture correctly. view ing unlik � Y c =captured [0, d0 have sinimage θrelatively cos θPFrom )] [4], the expression of the homogmobile phones often camis(1 tlow-quality =−−Rc. P , d0 summar h tend to blur the images under fast motion and make the raphy H f ←c that warps the capturedFigure image to thethevirtual frontal 2: Defining camera pose in the case of a vertical To ges ints difficult to detect. view is then: Without loss of generality the pose of the frontal view is defi refore considered Gepard, anwarping alternative from methodthe based Homography captured view tothethe frontal view rotations [I|0]. Then in case of a vertical surface, the coordinate te matching, which was proved to be adapted to poorly camera center c are [0, d0 sin θ p , d0 (1 − cos θ p )]� . also use bjects and blurry images [5]. Given an image patch to�de�−1 the trans rd generates a set of “mean patches”. Each mean patch is � tn −1 as the average of the patches seen over range of a rangem Haflimited = K R − K , which is further refined (5) using template ←c spatial orientation, s, and the ranges over all the mean patches cover all possid0 techniques. Then, by comparing an input patch to the mean patches, Radia In practice, Eq. (1) is not used directly as this would cognize it and get an estimate of the camera viewpoint. get a ne
•
Frontal View Generation
Guessing Target Pose • The orientation of a target (H / V)
is recommended based on the current pose of the phone π π − < θp < + 4 4
: the surface is vertical
Otherwise, the surface is horizontal
• Too stiff rotation cannot give a good frontal image
Blurred Patch Generation • Objective: learn the appearances of a target surface fast
• Adopt the approach of patch learning in ‘Gepard’ (Hinterstoisser et al. 2009)
•
Real-time learning of a patch on the desktop
Review: Gepard • Fast patch learning by linearizing image warping with PCA
• ‘Mean patch’ as a patch descriptor • •
Direct comparison with input image No complex descriptor generation
Review: Gepard • Difficult to directly apply to mobile phone platform
• •
Low performance of mobile phone CPU Large amount of pre-computed data is required (about 90MB)
Keypoint Recognition & Coarse
Keypoint Recognition & Coarse Blurred Patch Generation Our Solution:
Simple Descriptor for p
Keypoint Recognition Our Solution: & Coarse Pose Es
of • Approach: Use blurred patch instead Simple Descriptor for p
( (
mean patch Our Solution:
Input patch Gepard Ours
(
Mean Simple Descriptor for patchMean p Warped patches Mean patch
Our descriptorMean Mean
Our descriptor Blurring Our descriptor
(
( (
Mean
Mean
Mean
)
Blurred patch
Blurred Patch Generation • Generate blurred patches through multipass rendering in a GPU
•
Faster image processing through a GPU’s Parallelism
If
I1
Input patch
I2
1st pass: patch warping
I3
2nd pass: radial blurring
3rd pass: Gaussian blurring
I4
4th pass: Accumulation
Blurred Patch Generation • 1st Pass: Warping • •
Render the input patch from a certain viewpoint Much faster than on CPU
If
I1
Input patch
I2
1st pass: patch warping
I3
2nd pass: radial blurring
3rd pass: Gaussian blurring
I4
4th pass: Accumulation
Blurred Patch Generation • 2nd Pass: Radial blurring to the warped patch
•
Allow the blurred patch covers a range of poses close to the exact pose
If
I1
Input patch
I2
1st pass: patch warping
I3
2nd pass: radial blurring
3rd pass: Gaussian blurring
I4
4th pass: Accumulation
Blurred Patch Generation • 3rd Pass: Gaussian blurring to the radialblurred patch
•
Make the blurred patch robust to image noise
If
I1
Input patch
I2
1st pass: patch warping
I3
2nd pass: radial blurring
3rd pass: Gaussian blurring
I4
4th pass: Accumulation
Blurred Patch Generation • 4th Pass: Accumulation of blurred patches in a texture unit
•
Reduce the number of readback from GPU memory to CPU memory
If
I1
Input patch
I2
1st pass: patch warping
I3
2nd pass: radial blurring
3rd pass: Gaussian blurring
I4
4th pass: Accumulation
Post-Processing • Downsampling blurred patches •
(128x128) to (32x32)
• •
Zero mean and Stdev of 1
• Normalization Robustness to intensity changes
Detection & Tracking • •
User points the target through the camera Square patch at the center of the image is used for detection
Input patch at t
Patch detected in (t-1)?
NO
YES
Patch Descriptor Comparison
Pose Update
Pose Refinement
Patch Varification with NCC
Detection & Tracking • Initial pose is retrieved by comparing the
input patch with the learned mean patches
• ESM-Blur (Y.Park et al., ISMAR09) is applied for further pose refinement
• NEON instructions are used for faster pose refinement
Experimental Results • Patch size: 128 x 128 • Number of views used for learning: 225 • Maximum radial blur range: 10 degrees • Gaussian blur kernel: 11x11 • Memory requirement: 900 KB for 225 views
Experimental Results
Images used for learning
Detection from different views
Experimental Results
Detection in different scales
Experimental Results
Targets whose frontal views are unavailable
Experimental Results
Targets whose frontal views are unavailable
Experimental Results
Targets whose frontal views are unavailable
Experimental Results
More examples in real scenes
More examples in real scenes
Experimental Results • Instant 3D augmentation
Experimental Results • Share the learned data with nearby mobile phones via Bluetooth communication
Experimental Results
iPhone 3GS
PC
CPU
ARM 600MHz
Intel QuadCore 2.2 GHz
GPU
PowerVR SGX 535
GeForce 8800 GTX
Renderer
OpenGL ES 2.0
OpenGL 2.0
Experimental Results 11,019.2
12000
7,993.0
9000 5,396.2
6000 3000
2,746.3
3,396.6
0
547.6
600
14,162.6
Learning time (ms)
Learning time (ms)
15000
500
420.5
400
324.3
300 200
238.6 148.9
169.2
108
135
100 0
108
135
210
300
420
540
Number of views Postproc.
Readback
Gaussian blur Radial blur
210
300
420
540
Number of views
Accumulation
Postproc.
Warping
Gaussian blur Radial blur
iPhone 3GS
Readback
Accumulation Warping
PC
More views, more rendering Slow radial blur due on the mobile phone Possible speed improvement through shader optimization
Experimental Results • Comparison with Gepard (Hinterstoisser et al. 2009)
Data sets from www.metaio.com/research/
Gepard
Proposed
Sign-1
96.400002
93
Sign-2 Car Wall
84 96.3 91.199997
76 86.8 74
90.599998 97 98.56
90.2 95 92.67
73.199997 51.59 93.400002
57.8 41.2 68.2
Experimental Results
Grass Macmini Board graf1 stop_sign_f book_SMALL2
•
Comparison with Gepard (Hinterstoisser et al. 2009) 93.800003 69.199997 94.599998
95.2 82.2 98.6
Gepard
100
Detection Performance (%)
City Cafe Book
Proposed
75
50
25
0 Sign-1 Sign-2
Car
Wall
City
Cafe
Data set
Book
Grass Macmini Board
Limitations • Weak to repetitive textures and reflective surfaces
• Currently single target only
Conclusions • Potential applications • •
AR tagging on the real world AR Apps Anywhere anytime
• Future work • •
Addressing 1 DoF constraint More optimization on mobile phones