Stereo Vision based Robot Navigation

Viewer
Transcript

Stereo Vision based Robot Navigation Anusheel Bhushan

Abstract This paper introduces a method for autonomous navigation of a mobile robot in an indoor environment. Using information only from a stereo camera pair, the robot is able to identify and avoid obstacles and construct a geometric map of the 3D scene. Also a fast and robust closed-loop visual servoing method is introduced, enabling the robot to correct its path in realtime. Starting with an initial setting of a user-defined world coordinate system, the system autonomously updates the camera calibration, requiring no further user intervention. Keywords: Robotic Vision, Vision based navigation, Robot Localisation, Servo-control, Stereo Vision, Motion Analysis.

camera pair. Specifically, following are the main contributions of this paper: • We give a simple homography-based algorithm to sample feature points lying on obstacles. These points are used to sample the space around the robot, and construct a map of obstacles. • We make use of epipolar constraints, to expedite the search for image correspondences. • We give an algorithm to cluster points into objects, and also to determine free space for the robot to navigate. • Finally, we propose a fast and robust visual servoing algorithm, which only requires tracking of 2 image features.

The Robot Setup

1

Introduction

Autonomous robot navigation systems typically use information gathered from sensors such as infra-red and sonar. These systems suffer from inherent disadvantages such as disturbance from heat sources (infra-red sensors) and background noise (sonar sensors). In this paper, we propose a navigation system based purely on stereo vision. Stereo vision has been explored as a possible tool for robot navigation and map building. [7, 9] used stereo vision to guide a robot that could plan paths, construct maps and explore an indoor environment. They showed that stereo vision is a viable alternative to active sensing devices such as sonar and laser range finders. Vision based feedback systems have also been used in various systems for continuous control and feedback [1, 6, 10], human computer interfaces[3]. [8] introduced 2 12 D visual servoing, based on the estimation of the partial camera displacement from the current to the desired camera poses at each iteration of the control law. In this paper, we propose an autonomous navigation system, based only on information from a stereo

Figure 1: The Robot geometry

The robot has an on-board P III, 850 MHz processor running Linux. It can communicate via wireless radio link, but all the programs are executed on the robot, without external user intervention. The stereo setup consists of a pan-tilt unit with two CCD color cameras mounted on it. There is a high level library, which provides functions to give motion commands to the robot. We assume that the environment consists of a flat planar world, so that once the pan-tilt is fixed, the

3

robot has just 3 degree of freedom left. The height of the cameras above the ground and the orientation of the pan-tilt becomes fixed. The remaining degree of freedom correspond to the location of the cameras in the ground plane and the rotation about an axis perpendicular to the ground.

2

Obstacle Detection

Anything not on the ground plane is defined as an obstacle object. Figure 2 shows a set of stereo images grabbed by the robot. We wish to determine a set of 3D points that lie on the obstacles.

Camera Calibration

Camera Calibration is a well known problem in Computer vision. Tsai’s method[12, 13] gives a non-linear refinement based method, requiring the camera to observe a planar pattern from a few different orientations. Though this method is popular for its accuracy, it requires a large number of seed correspondences. Also, it cannot be easily extended to a case where the camera is in motion and the calibration has to be continuously upgraded. Similarly, Zhang’s method[14] requires the camera to observe a planar pattern shown at a few different orientations. We use a modified version of Tsai’s method allowing the camera to automatically update its calibration whenever it moves to a new location, without requiring any new correspondences. Initial Calibration: We run non-linear Levenberg-Marquardt optimization on a set of 3D points in the world whose 2D correspondences in the images are known. These points are hand clicked (using a graphical interface) before starting the system. However, after this, the system requires no more user intervention for calibration. The calibration requires an estimate of the internal camera matrices(K), the rotation matrix(R) and the camera center(C). The rotation matrix is specified by 3 angles each representing a rotation about the X, Y , Z axes. Thus, we have 11 parameters (3 for rotation, 3 for the camera center and 5 for the internal camera matrix). Using symbolic differentiation, we find a closed form expression for the Jacobian, which further expedites each iteration. The optimization is found to converge within 10 iterations. Subsequent Re-Calibration: Once the initial calibration is known, we maintain a pool of 3D points and track them, which enables us to recalibrate at any subsequent position. Moreover, since the internal parameters, the height and the yaw and pitch of the cameras remains constant, there are only 3 degrees of freedom for subsequent calibrations. We assume the ground to be planar. Thus, the recalibration process requires very few points to be tracked

Figure 2: Left and Right stereo images We determine a ground plane homogrpahy(H) that takes points on the ground plane in one image(I1 ) to the corresponding points in the other image(I2 ). This is done by specifying (atleast) 4 corresponding points on the ground in the images. I1 is then warped to I2 using H and subtracted from I2 . All the points on the ground plane would result in a value close to zero after subtraction. The non-zero pixels that remain correspond to obstacles, or objects not lying in the ground plane. Figure 3 shows the images after applying the homography and subtracting. This homography H remains constant even if the robot moves, since the relative orientation of the stereo cameras remains fixed. Thus, it can be calculated once and used throughout the navigation.

Figure 3: Difference images We then use the Harris cornner detector[4] to extract feature points in the left image. In order to consider points on obstacles only, we select corners that correspond to points not satisfying the ground plane homography. Figure 4(a) shows the extracted 2

features on obstacles in the left image and Figure 4(b) shows their corresponding points in the right image. The errors such as the lines on the ground visible after subtraction(Figure 3) can be removed by suitable thresholding. We consider a feature point to be non-ground if its neighbourhood has a large number of non-zero pixels after subtracting.

epipolar plane

X search window band around epipolar line

x

C

x’

C’

Figure 5: Epipolar Geometry

the two cameras are close to each other, so that the search space is small. Wide baseline strero would require more sophisticated techniques to find correspondence[2].

Figure 4: (a) Feature Points on obstacles in Left image. (b) Corresponding points in the Right image.

3.2 3D reconstruction Once we have a set of nonground feature points in the left image, we find their corresponding 3D coordinates. This requires first finding the corresponding image points in the right image and triangulating using the camera matrices to get the corresponding 3D coordinates.

3.1

Finding 3D

Since we have the point correspondences and the camera matrices, we can find the 3D location of these points. This is done by the standard method of shooting rays from the corresponding image points and finding their intersection. See [5] for details. The 3D points thus obtained are projected onto the ground. This is an orthographic projection by setting the y coordinate of the points to be zero. We thus get a map of the obstacles on the ground plane, which the robot should avoid. We assume that there is no object such that the robot can pass through/below it. Figure 6 shows the result of projecting the feature points in Figure 4

Stereo Image Correspondences

We find the correspondence of the left feature points in the right image. This is basically a two step process: 1. Application of the epipolar constraint The corresponding point will pass through the epipolar line. The epipolar line for a feature point can be easily found, since the epipolar geometry remains constant throughout the navigation and can easily be found given both the camera matrices. The set of candidate points for correspondence is thus reduced by only searching within a band around the epipolar line(see Figure 5). 2. Correlation based matching A correlation based search is done in the right image for every feature in the left. However, the search is restricted to regions lying in a band around the epipolar line. This reduces the search space and expedites the matching significantly. We are able to do this because

Figure 6: Orthographic Projection of feature points on the ground plane

3

4

Clustering

4. Now we need to find the boundary1 for each of the components. We thus need to remove cut edges2 as shown in Figure 8. This is done by doing a DFS over each of the components. The cut edges are removed from the graph and stored separately.

For next view planning, we would like to obtain a top view of the current surroundings. Once, we have a sampling of feature points of obstacles projected on the ground plane, we want to cluster these projected points into objects as they exist in 3D(Figure 7). Once we have the projected objects, the next view planning reduces to a geometric path planning problem in the plane.

Figure 8: Cut Edge

5. Now, to find the boundary, for each component C, do the following. Find the leftmost(l) and rightmost(r) vertex of C. Then, Figure 7: Projected points clustered into Objects, represented by their boundaries.

/* Upper boundary */ u = l; while u != r do select edge e = (u, v) s.t. e makes the min angle with the vertical at u. (largest dot product with vertical) (see Figure) u = v; enddo

We cannot use techniques like finding the convex hull of a cluster, because in case of extreme concavities such as an L shaped object, the convex hull would return an object representation that consumes free space which might have been of use in path planning. Instead, the boundary of a cluster is a more appropriate representation of an object. We give an algorithm to obtain a set of objects(represented by their boundaries) from a set of projected points as input.

4.1

The Algorithm

1. Do a delaunay triangulation on the points. We used an implementation provided by [11]. We now have a graph of points and edges from the triangulation. /* Lower boundary */ u = l; while u != r do select edge e = (u, v) s.t. e makes

2. From the triangulation, remove the edges with length is greater than the robot diameter(40 cm). This results in clusters which are sufficiently apart, so that the robot can navigate through them.

1 a set of connected edges that completely enclose all the points in the component 2 removal of which breaks a connected component into two components

3. Do a depth first search(DFS) to find all the connected components. Each component represents an object. 4

remain constant during a straight line motion. Ry (θ) is the rotation component due to rotation about the robot’s axis of rotation. The robot can be assumed to be rotating about some axis parallel to the Y -axis. x, y, z is the camera center. The unknowns are x, z and θ. We can assume that y, the height of the camera is constant. As KRxz is constant throughout the straight line motion, we can pre-multiply both sides of the equation by (KRxz )−1 to get      | −x X u     v  = Ry (θ)  I | −y   Y     | −z Z  w | 1 1

the max angle with the vertical at u. (least dot product with vertical) (see Figure) u = v; enddo 6. The set of edges returned by the above procedure is the boundary. Add to the above the cut edges that were removed in step 4. The resulting set of edges constitutes the object boundaries.

5

Motion Control

The path from the source to the destination is broken into a series of straight line paths. The motion is a combination of rotations and straight line motion. However, due to the mechanical nature of the servo motors, the actual motion that occurs might differ from the given commands. Thus, we propose a closed loop visual servoing feedback system, that provides a mechanism to provide correcting impulses to the robot to make it follow a given straight line path. Recalibrating after each small motion is a solution, but that will involve non-linear optimization, which would require tracking of a lot of points. We implement a different method as described below. This requires tracking of only two points. We align the robot to face the destination before each movement. Thus the motion is a combination of rotations followed by straight line translations.

where, u,v,w are the coordinates obtained by premultiplying (U, V, W ) by (KRxz )−1 .     | u cos(θ) 0 −sin(θ)  I |  v  =   0 1 0  | w sin(θ) 0 cos(θ) |  (X − x)cos(θ) − (Z − z)sin(θ) Y −y =  (X − x)sin(θ) + (Z − z)cos(θ) 

 −x X  Y −y   −z   Z 1 1  

or

The problem definition: We have a stereo head, which is initially calibrated. Using only grabbed images, we would like to have a mechanism for not only updating the calibration during the motion, but also guiding the motion itself. We just require tracking of two points in any one of the stereo cameras for updating the calibration. The camera follows a path f (x, z) = 0 on the ground plane, with the height y = const. (x, y, z) refers to the camera center. Say we want the robot to move along a straight line. The relation between the homogeneous image coordinates (U, V, W ) and a 3D point (X, Y, Z) is given as      X | −x U     V  = KRxz Ry (θ)  I | −y   Y     Z  | −z W 1 | 1

u/w

=

v/w

=

(X − x)cos(θ) − (Z − z)sin(θ) (X − x)sin(θ) + (Z − z)cos(θ) (Y − y) (X − x)sin(θ) + (Z − z)cos(θ)

So, given (u/w, v/w) and (X, Y, Z), we want to find the unknowns - θ, x, z. By a slight abuse of notation, we denote the homogenous coordinates (u/w, v/w) by (u, v). Note, for a straight line path, the deviation about Y axis is very small, so that we can safely assume, sin(θ) = θ cos(θ) = 1 Thus, we get u

=

v

=

(X − x) − (Z − z)θ (X − x)θ + (Z − z) (Y − y) (X − x)θ + (Z − z)

Dividing the above equations, we get

where K is the intrinsic camera matrix, Rxz is the rotation about the x and z axis, both of which

u (Y − y) = (X − x) − (Z − z)θ v 5

(1) (2)

   

If we are tracking 2 points, whose 3D information is known, we get 2 such equations: ui (Yi − y) vi uj (Yj − y) vj

=

(Xi − x) − (Zi − z)θ

=

(Xj − x) − (Zj − z)θ

Subtracting, we get uj ui (Yi − y) − (Yj − y) = (Xi − Xj ) − (Zi − Zj )θ vi vj

Thus, we get θ=

(Xi − Xj ) ui (Yi − y) uj (Yj − y) − + (Zi − Zj ) vi (Zi − Zj ) vj (Zi − Zj )

Figure 9: Visual Servoing - straight line path In the future, we would like to automate the remaining stages of our system, like automatic nextview planning, incremental map updating and path planning. At the moment, the user chooses the next-view manually, though the subsequent servoing, tracking, recalibration and map building are all done automatically. Though, we make a flat-world assumption, it is possible to extend the method to allow navigation on a non-flat terrain. However, the problem of visual servoing and tracking would involve more unknowns, and might not have a closed form solution. Further, modern streaming GPU architectures can be used to make modules like tracking and feature matching faster. This would enable a near realtime response by the system.

Once θ is known, eqns 1 and 2 are just linear equations in x and z and may be solved easily. (Yi − y) (θ + ui ) vi (1 + θ2 ) (Yi − y) = Zi − (1 − θui ) vi (1 + θ2 )

x = Xi − z

Thus, we have a mechanism to keep the camera calibration updated by tracking just 2 points, assuming the motion is almost a straight line. Also the motion is constrained to follow a straight line path as closely as possible. At each image sampling, we give a corrective motion to the robot, so that θ is close to 0 and the camera follows the required path f (x, z) = 0. Figure 9 shows a plot of the camera centers projected on the ground plane, with the motion constrained to follow a straight line path. The servoing feedback applies continuous corrections to the motion. The feedback loop operates almost at the frame-rate for the cameras, since only 2 feature points need to be tracked and the feedback expression is a closed form solution. Moreover, between successive frames, the amount of motion is very small, making the tracking of features very easy.

6

References [1] P. Allen, B. Yoshimi, and A. Timcenko. Handeye coordination for robotics tracking and grasping. Visual Servoing, pages 33–70, 1994. [2] L. Van Gool C. Strecha, R. Fransens. Widebaseline stereo from multiple views: a probabilistic account. 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), 1:552–559, 2004.

Conclusion and Future Work

[3] D. Gavrila and L. Davis. Tracking of humans in action: A 3d model-based approach, 1996.

In this paper, we introduced a method for stereo vision based autonomous navigation of a mobile robot in an indoor environment. The method demonstrates that under reasonable assumptions, stereo vision is a sufficient cue for navigation.

[4] Chris Harris and Mike Stephens. A combined corner and edge detector. Proceedings of The Fourth Alvey Vision Conference, Manchester, pages 147–151, 1988. 6

[5] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [6] S. Hutchinson, G. D. Hager, and P. Corke. A tutorial introduction to visual servo control. IEEE Trans. Robot Automat, 12, 1996. [7] C. Jennings, D. Murray, V. Tucakov, and J. Little. Spinoza: Visually guided mobile robot. mobile robot demonstration. IEEE conference on Computer Vision and Pattern Recognition, 1997.

Figure 10: Results: Input Stereo Images

[8] E. Malis, F. Chaumette, and S. Boudet. 2 1/2 d visual servoing. IEEE Transactions on Robotics and Automation, 15(2):238–250, April 1999. [9] D. Murray and C. Jennings. Stereo vision based mapping and navigation for mobile robots. IEEE International Conference on Robotics and Automation, 1997.

Figure 11: Results: Difference images after applying homography

[10] N. Papanikolopoulos, Pradeep Khosla, and Takeo Kanade. Visual tracking of a moving target by a camera mounted on a robot: A combination of control and vision. IEEE Transactions on Robotics and Automation, 9(1):14–35, 1993. [11] Jonathan Richard Shewchuk. Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator. In Applied Computational Geometry: Towards Geometric Engineering, volume 1148 of Lecture Notes in Computer Science, pages 203–222. Springer-Verlag, May 1996.

Figure 12: Results: (a) Left: Feature points on obstacles (b) Right: Image correspondences

[12] R. Tsai. An efficient and accurate camera calibration technique for 3d machine vision. Proc. CVPR ’86, pages 364–374, 1986. [13] R. Tsai. A versatile camera calibration technique for high accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses. IEEE journal on Robotics and Automation, pages 323–344, 1987. [14] Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell., 22(11):1330–1334, 2000.

Figure 13: Results: Orthographic projection of the obstacle maps

7