Outdoor Scene Image Segmentation Based On Background ieee.pdf ...

Viewer
Transcript

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

1007

Outdoor Scene Image Segmentation Based on Background Recognition and Perceptual Organization Chang Cheng, Andreas Koschan, Member, IEEE, Chung-Hao Chen, David L. Page, and Mongi A. Abidi

Abstract—In this paper, we propose a novel outdoor scene image segmentation algorithm based on background recognition and perceptual organization. We recognize the background objects such as the sky, the ground, and vegetation based on the color and texture information. For the structurally challenging objects, which usually consist of multiple constituent parts, we developed a perceptual organization model that can capture the nonaccidental structural relationships among the constituent parts of the structured objects and, hence, group them together accordingly without depending on a priori knowledge of the specific objects. Our experimental results show that our proposed method outperformed two state-of-the-art image segmentation approaches on two challenging outdoor databases (Gould data set and Berkeley segmentation data set) and achieved accurate segmentation quality on various outdoor natural scene environments. Index Terms—Boundary energy, image segmentation, perceptual organization.

I. INTRODUCTION

I

MAGE segmentation is considered to be one of the fundamental problems for computer vision. A primary goal of image segmentation is to partition an image into regions of coherent properties so that each region corresponds to an object or area of interest [30]. In general, objects in outdoor scenes can be divided into two categories, namely, unstructured objects (e.g., sky, roads, trees, grass, etc.) and structured objects (e.g., cars, buildings, people, etc.). Unstructured objects usually comprise the backgrounds of images. The background objects usually have nearly homogenous surfaces and are distinct from the structured objects in images. Many recent appearance-based methods have achieved high accuracy in recognizing these background object classes [40], [41], [53]. The challenge for Manuscript received August 17, 2010; revised April 16, 2011 and July 22, 2011; accepted August 16, 2011 Date of publication September 22, 2011; date of current version February 17, 2012. This work was supported in part by the University Research Program in Robotics under Grant DOE-DE-FG52-2004NA25589 and in part by the U.S. Air Force under Grant FA8650-10-1-5902. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Xilin Chen. C. Cheng is with Riverbed Technology, Sunnyvale, CA 94085 USA (e-mail: [email protected]). A. Koschan and M. A. Abidi are with the Imaging, Robotics, and Intelligent Systems Laboratory, Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN 37996 USA (e-mail: [email protected]; [email protected]). C.-H. Chen is with the Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA 23529 USA (e-mail: [email protected]). D. L. Page is with Third Dimension Technologies LLC, Knoxville, TN 37931 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2011.2169268

outdoor segmentation comes from the structured objects that are often composed of multiple parts, with each part having distinct surface characteristics (e.g., colors, textures, etc.). Without certain knowledge about an object, it is difficult to group these parts together. Some studies [2], [4], [40], [41], [53] tackle this difficulty by using object-specific models. However, these methods do not perform well when the images contain objects that have not been seen before. Different from these studies, in this paper, our research objective is to explore detecting object boundaries in outdoor scene images solely based on some general properties of the real-world objects, such as perceptual organization laws, without depending on a priori knowledge of the specific objects. It has been long known that perceptual organization plays a powerful role in human visual perception. Perceptual organization, in general, refers to a basic capability of the human visual system to derive relevant groupings and structures from an image without prior knowledge of its contents. The Gestalt psychologists summarized some underlying principles (e.g., proximity, similarity, continuity, symmetry, etc.) that lead to human perceptual grouping. They believed that these laws capture some basic abilities of the human mind to proceed from the part to the whole [8]. In addition to the classic Gestalt laws, recently, Jacobs [20] and Jacobs et al. [32] have pointed out that convexity also plays an important role on perceptual organization because many real-world objects such as buildings, vehicles, and furniture tend to have convex shapes. These Gestalt laws can be summarized by a single principle, i.e., the principle of nonaccidentalness, which means that these structures are most likely produced by an object or process, and are unlikely to arise at random [11], [33]. In other words, the validation of Gestalt laws originates from the fact that these laws reflect the general properties of the man-made and biological objects in the world [34]. However, there are several challenges for applying Gestalt laws to real-world applications. One challenge is to find quantitative and objective measures of these grouping laws. The Gestalt laws are in descriptive forms. Therefore, one needs to quantify them for scientific use. Another challenge consists of finding a way to combine the various grouping factors since object parts can be attached in many different ways. Under different situations, different laws may be applied. Therefore, a perceptual organization system requires combining as many Gestalt laws as possible. The greater the number of Gestalt laws incorporated, the better chance the perceptual organization systems may apply appropriate Gestalt laws in practices. Previous studies do not find elegant solutions for handling these two challenges.

1057-7149/$26.00 © 2011 IEEE

1008

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

The main contribution of this paper is a developed perceptual organization model (POM) for boundary detection. The POM quantitatively incorporates a list of Gestalt laws and therefore is able to capture the nonaccidental structural relationships among the constituent parts of a structured object. With this model, we are able to detect the boundaries of various salient structured objects under different outdoor environments. The experimental results show that our proposed method outperformed two state-of-the-art studies [49], [61] on two challenging image databases consisting of a wide variety of outdoor scenes and object classes. An earlier version of this paper appeared in [56]. The remainder of this paper is organized as follows: In Section II, we discuss some related studies, including image segmentation and boundary detection methods. In Section III, we describe our POM and scene image segmentation algorithm. The experimental results are presented in Section IV, and Section V concludes this paper. II. RELATED WORK Bottom-up image segmentation methods only utilize low-level features such as colors, textures, and edges to decompose an image into uniform regions. Bottom-up methods can be divided into two categories, namely, region-based and contour-based approaches. A group of approaches treats image segmentation as a graph cut problem. Shi and Malik [6] proposed the normalized cut criterion that removes the trivial solutions of cutting small sets of isolated nodes in the graph. Felzenszwalb and Huttenlocher [5] proposed an efficient graph-based generic image segmentation algorithm. As with the normalized cut method, this method also tries to capture nonlocal image characteristics. Comaniciu and Meer [48] treated image segmentation as a cluster problem in a spatial-range feature space. Their mean-shift segmentation algorithm has illustrated excellent performance on different image data sets and has been considered as one of the best bottom-up image segmentation methods. Some of these region-based methods have been widely used to generate coherent regions called superpixels for many applications [1], [42], [51], [53]. Contour closure is one of the important grouping factors identified by Gestalt psychologists. Early contour-based studies such as active contour methods only utilize boundary properties such as intensity gradients. Zhu and Yuille [29] first used both boundary and region information within an energy optimization model. For their method to achieve good performance, a set of initial seeds needs to be placed inside each homogenous region. Jermyn and Ishikawa [21] proposed a new form of energy function. Their energy function is defined on the space of boundaries in the image domain by a ratio of two integrals around the boundary. The numerator of the energy function is a measure of the “flow” of some quantity into or out of the region. The denominator is a generalized measure of the length of the boundary. The main contribution of this energy function is to incorporate the general types of region information. Our method is based on this form of energy function, which is addressed in detail in Section III. In addition to the above energy function, some studies [45], [46] are built on different energy functions such as the Mumford–Shah segmentation model [47].Recently, var-

ious boundary detection methods based on statistical learning have been proposed in technical literature. Martin et al. [24] treated boundary detection as a supervised learning problem. They used a large data set of human-labeled boundaries in natural images to train a boundary model. Their model can then predict the possibility of a pixel being a boundary pixel based on a set of low-level cues such as brightness, color, and texture extracted from local image patches. Dollar et al. [36] and Hoiem et al. [13] followed a similar idea. Noticing the importance of context information, Dollar et al. [36] designed their boundary detection algorithm based on a large number of generic features calculated over a large image patch. This algorithm expects the context information to be provided by a large aperture. Hoiem et al. [13] estimated occlusion boundaries based on both 2-D perceptual cues and 3-D cues such as surface orientation and depth estimates. Multiclass image segmentation (or semantic segmentation) has become an active research area in recent years. The goal here is to label each pixel in the image with one of a set of predefined object class labels. Many studies operate on pixel level. Shotton et al. [40] assigned a class label to a pixel based on a joint appearance, shape, and context model. In [50], Shotton et al. proposed the use of semantic texton forests for fast classification. A number of studies utilize superpixels as a starting point for their task. Gould et al. [54] proposed a superpixel-based conditional random field to learn the relative location offsets of categories. In their recent work [52], they developed a classification model that is defined in terms of a unified energy function over scene appearance and scene geometry structure. Other notable studies in this area include Micusik and Kosecka [42], Yang et al. [53], and He et al. [55]. Finally, we review some previous efforts attempting to apply Gestalt laws to guide image segmentation. A number of studies [8], [20], [37], [38] only applied one or two Gestalt laws (e.g., proximity, curvilinear, continuity, closure, or convexity, etc.) on 1-D image features (e.g., lines, curves, and edges) to find closed contours in images. Lowe [8] and Mahamud et al. [37] integrated proximity and continuity laws to detect smooth closed contour bounding unknown objects in real images. Ren et al. [38] developed a probabilistic model of continuity and closure built on a scale-invariant geometric structure to estimate object boundaries. Jacobs [20] emphasized that convexity plays an important role in perceptual organization and, in many cases, overrules other laws such as closure. Mohan and Nevatia [10] incorporated several Gestalt laws to detect a group of collated features describing objects. Their segmentation algorithm is based on a set of ad hoc geometric relationships among these collated features and is not based on the optimization of a measure of the value of a group. McCafferty [9] formulated the grouping problem in perceptual organization as an energy minimization problem where the energy of a grouping is defined as a function of how well it obeys the Gestalt laws. The total energy of a grouping is treated as the linear combination of the individual grouping energy values corresponding to the Gestalt laws. Desolneux et al. [39] studied four Gestalt laws, namely, similarity in colors, similarity in sizes, alignments and proximity in point, and line and curve image features. They proposed the corresponding quantitative measurements for the significance of the

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION

four Gestalt laws and also showed the importance of the collaboration of Gestalt laws in the perceptual organization process.

III. IMAGE SEGMENTATION ALGORITHM Here, we present a novel image segmentation algorithm for outdoor scenes. Our research objective here is to explore detecting object boundaries solely based on some general properties of the real-world objects, such as perceptual organization laws, without depending on object-specific knowledge. Our image segmentation algorithm is inspired by a POM, which is the main contribution of this paper. The POM quantitatively incorporates a list of Gestalt cues. By doing this, the POM can detect many structured object boundaries without having any object-specific knowledge of these objects. Most studies to date apply Gestalt laws on zero- or 1-D image features (e.g., points, lines, curves, etc.). Different to these studies, our method applies Gestalt laws on 2-D image features, i.e., object parts. We first give the formal definition of salient structured objects and object parts in images: Definition 1: A salient structured object refers to a structured object with an independent and detectable physical boundary. An independent physical boundary means that the boundary of the object should not be contained in another structured object. For example, the window of a building should be treated as a part of the building because the whole physical boundary of the window is contained in the building’s physical boundary. In addition, the physical boundary of a salient object should be detectable by state-of-the-art computer vision algorithms. For a group of people, if each individual is too small or several people wear the same color clothes, making it difficult to clearly detect each individual’s boundary with today’s computer vision technology, then the whole group of people should be treated as a salient object. Definition 2: An object part refers to a homogenous portion of a salient structured object surface in an image. Based on our empirical observation, most object parts have approximately homogenous surfaces (e.g., color, texture, etc.). Therefore, the homogenous patches in an image approximately correspond to the parts of the objects in the image. Throughout this paper, we use this definition for object parts. In the remainder of this section, we first introduce how to recognize the common background objects such as skies, roads, and vegetation in outdoor natural scenes. Then, we present our POM and the boundary detection algorithm. Finally, we describe our image segmentation algorithm based on the POM. A. Background Identification in Outdoor Natural Scenes According to [40], objects appearing in natural scenes can be roughly divided into two categories, namely, unstructured and structured objects. Unstructured objects typically have nearly homogenous surfaces, whereas structured objects typically consist of multiple constituent parts, with each part having distinct appearances (e.g., color, texture, etc.). The common backgrounds in outdoor natural scenes are those unstructured objects such as skies, roads, trees, and grasses. These background

1009

objects have low visual variability and, in most cases, are distinguishable from other structured objects in an image. For instance, a sky usually has a uniform appearance with blue or white colors; a tree or a grass usually has a textured appearance with green colors. Therefore, these background objects can be accurately recognized solely based on appearance information. Suppose we can use a bottom-up segmentation method to segment an outdoor image into uniform regions. Then, some of the regions must belong to the background objects. To recognize these background regions, we use a technique similar to [40]. The key for this method is to use textons to represent object appearance information. The term texton is first presented in [44] for describing human textural perception. The whole textonization process proceeds as follows: First, the training images are converted to the perceptually uniform CIE color space. Then, the training images are convolved with a 17-D filter bank. We use the same filter bank as that in [41], which consists of Gaussians at scales 1, 2, and 4; the and derivatives of Gaussians at scales 2 and 4; and Laplacians of Gaussians at scales 1, 2, 4, and 8. The Gaussians are applied to all three color channels, whereas the other filters are applied only to the luminance channel. By doing so, we obtain a 17-D response for each training pixel. The 17-D response is then augmented with the CIE , , channels to form a 20-D vector. This is different from [41] because we found that, after augmenting the three color channels, we can achieve slightly higher classification accuracy. Then, the Euclidean-distance -means clustering algorithm is performed on the 20-D vectors collected from the training images to generate cluster centers. These cluster centers are called textons. Finally, each pixel in each image is assigned to the nearest cluster center, producing the texton map. More details about the textonization process are referred to [41]. After this textonization process, each image region of the training images is represented by a histogram of textons. We then use these training data to train a set of binary Adaboost classifiers [43] to classify the unstructured objects (e.g., skies, roads, trees, grasses, etc.). Similar to the result in [40], our classifiers also achieve high accuracy on classifying these background objects in outdoor images. An example of background identification is illustrated in Fig. 4(b).

B. POM Most images consist of background and foreground objects. Most foreground objects are structured objects that are often composed of multiple parts, with each part having distinct surface characteristics (e.g., color, texture, etc.). Assume that we can use a bottom-up method to segment an image into uniform patches, then most structured objects should be oversegmented to multiple patches (parts). After the background patches are identified in the image, the majority of the remaining image patches correspond to the constituent parts of structured objects [see Fig. 4(b)] for an example). The challenge here is how to piece the set of constituted parts of a structured object together to form a region that corresponds to the structured object without any object-specific knowledge of the object. To tackle this problem, we develop a POM. Accordingly, our image

1010

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

segmentation algorithm can be divided into the following three steps. 1) Given an image, use a bottom-up method to segment it into uniform patches. 2) Use background classifiers to identify background patches. 3) Use POM to group the remaining patches (parts) to larger regions that correspond to structured objects or semantically meaningful parts of structured objects. We now go through the details of our POM. Even after background identification, there is still a large number of patches (parts) remaining. Different combinations of the parts form different regions. How can we find a region that corresponds to a structured object? We want to use the Gestalt laws to guide us to find and group these kinds of regions. Our strategy is that, since there always exist some special structural relationships that obey the principle of nonaccidentalness among the constituent parts of a structured object, we may be able to piece the set of parts together by capturing these special structural relationships. The whole process works as follows: We first pick one part and then keep growing the region by trying to group its neighbors with the region. The process stops when none of the region’s neighbors can be grouped with the region. To achieve this, we develop a measurement to measure how accurately a region is grouped. The region goodness directly depends on how well the structural relationships of parts contained in the region obey Gestalt laws. In other words, the region goodness is defined from perceptual organization perspective. With the region measurement, we can go find the best region that contains the initial part. In most cases, the best region corresponds to a single structured object or the semantically meaningful part of the structured object. This problem is formalized as follows. Problem Definition: Let represent a whole image that consists of the regions that belong to backgrounds and the regions that belong to structured objects , . After the background identification, we know that most of the structured objects in the image are contained in a subregion . Let be the initial partition of from a bottom-up segmentation method. Let denote a uniform patch from initial partition . For , is one of the constituent parts of an unknown structured object. Based on initial part , we want to find the maximum region so that the initial part and for any uniform patch , where , should have some special structural relationships that obey the nonaccidentalness principle with the remaining patches in . This is formulated as follows: with

(1)

, is the boundary of , and where is a region in is a boundary energy function. The boundary energy function provides a tool for measuring how good a region is. The goal here is to find the best region in that contains initial part . The boundary energy function is defined as follows [21]: (2) is the boundary length of . is the weight where function in region . The criterion of region goodness depends

on how weight function is defined and also boundary length . One can use to encode any information (e.g., color or texture) over region . In our case, we want to use to encode the local structural relationships between neighboring parts contained in region . Boundary length , on the other hand, reflects the global property of region . In the following, the boundary energy function has been designed in a way that the four Gestalt laws, i.e., similarity, symmetry, alignment, and proximity, affect boundary energy by affecting weight function . The convexity law, which is a global property, affects boundary energy based on the characteristic of boundary length . First, we define the weight function in patch as follows: with

(3)

in where is a weight vector. We empirically set our implementation. Vector , which is a point in the structural context space encoding the structure information of image patch . is a reference point in the structure’s context spaces, which encodes the structural information of initial part . Since is the only known information that we have about the unknown structured object, we use as a reference point. Notice that, at the beginning of the grouping process, region only contains part . In this case, , , which means that we always set the largest weighted element to initial part . Then, we try to grow region by including some of the neighbors of region and assign weights to the included neighbors accordingly so that we can measure if the growing region is better than the old one. Function calculates the absolute value for each vector element. Weight function having a large value inside a newly included patch means that current image patch has a strong structural relationship with the constituent parts of the unknown structured object that contains initial part . is the cohesiveness strength, which we will define later. is the boundary complexity of image patch , which can be measured as [15] (4) (5) (6) where is the number of pixels of the boundary of image patch , is the length of a sliding window over the entire boundary of patch . and are the respective strength and frequency of the singularity at scale (step) . and are the two end pixels of a segment of the boundary in the window. and are the pixels between and . is the number of notches in the window. A notch means a nonconvex portion of a polygon, which is defined as a vertex with an inner angle larger than 180 (see details in [15] and [18]). Examples of the boundary complexity of regular and irregular shapes are shown in Fig. 1. Based on the similarity of the boundary complexity, we can distinguish man-made object parts, which usually have regular shapes, from vegetations,

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION

Fig. 1. Examples of shape regularity. First row: regular shapes. Second row: irregular shapes. Notice that regular shape objects have smaller values than irregular shape objects.

which usually have irregular shapes. This is especially useful for distinguishing the vegetation that may not be recognized solely based on appearance. Therefore, the first Gestalt law we encode in the POM is the similarity law. After obtaining the boundary complexity of image patch , we then measure how tightly image patch is attached to the parts of the unknown structured object that contains initial part . is the cohesiveness strength and is calculated as for for

(7) where is a neighboring patch of patch . The maximum value of cohesiveness strength is one. For patch , the cohesiveness is always set as the maximum value since we know for sure that the patch belongs to the unknown structured object. Assume the cohesiveness strength of patch is known. measures the symmetry of and along a vertical axis and is defined as (8) and are the where is the Kronecker delta function; column coordinates of the centroids of and . If and are very close, this means that patches and are approximately symmetric along a vertical axis. If patch has a strong cohesiveness strength to the unknown object containing patch , then patch also has a strong cohesiveness strength to the unknown object containing patch . This means that patch is tightly attached to some parts of the unknown object. This is because parts that are approximately symmetric along a vertical axis are very likely belonging to the same object. Thus, this test encodes the symmetry law. Examples of symmetric relationships are shown in Fig. 2(a). measures the alignment of patches and (9) and are the boundaries of and , respectively, and is the extension of the common boundary between and . denotes the empty set. This alignment test encodes the continuity law. If two object parts are strictly aligned along a direction, then the boundary of the union of the two components will have a good continuation. Accordingly, alignment is a strong indication that these two parts may belong to the same object. Examples of alignment relationships are shown in Fig. 2(b).

where

Fig. 2. (a) Symmetry relationships: the red dots indicate the centroids of the components. (b) Alignment relationships. (c) Two components are strongly attached. (d) Two components are weakly attached.

If and are neither symmetric nor aligned, then the cohesiveness strength of image patch depends on how it is attached to . measures the attachment strength between and . It is defined as (10) where

neighbors

1011

and are constants (we empirically set and in our implementation). The attachment strength depends on the ratio of the common boundary length between and to the sum of the boundary lengths of and . If there is a large size difference between and (i.e., or ), it usually means that the large one belongs to the background such as a wall or a big vegetation. If and have similar sizes and share a long common boundary, this means that and might be adjacent in the 3-D world. In other words, is considered to be in close proximity to . For this case, and are considered to be strongly attached. is the angle between the line connecting the two ends of and the horizontal line starting from one end of . Many objects may be located next to each other in natural outdoor scenes. Therefore, even if patches and are tightly attached along the horizontal direction, patches and may still belong to two neighboring objects. We use “ ” in (10) to control the cohesiveness strength of two attaching patches according to the attachment orientation. In general, vertically attached parts have larger attachment strength than those horizontally attached. Examples of strong and weak attachments are shown in Fig. 2(c) and (d), respectively. We have explicitly encoded four Gestalt laws (i.e., similarity, symmetry, continuity, and proximity) into our POM. The four Gestalt laws affect weight function assigned to different parts and hence affect the boundary energy for different regions . The convexity law also affects the boundary energy for different regions . Since convexity is a global property, it affects the boundary energy in a different way. As shown in Fig. 3, patch is embedded into patch , which causes a concavity on patch . The boundary length of the region that contains and is shorter than that of the region that contains only patch due to the concavity on patch . As a result, the boundary energy of the region that contains patches and is smaller than that of the region that only contains patch . Therefore, patches and are treated as one entity by our POM. In summary, any parts that are embedded into a big entity will be

1012

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

4. Select a subset of with regions: so that , , there exists a path in to .

connecting

5. Measure the boundary energy of

with (2).

, set

6. If step 2. Fig. 3. Example of convexity relationship. The boundary energy (middle) region is larger than that of the (right) region that contains . Thus, our POM will group and together.

of and

Algorithm 1 Boundary detection based on perceptual organization INPUT:

,

, and reference region

OUTPUT: region that contains boundary energy in a local area of 1. Let 2. Let neighbors

with the minimal neighbors

. .

3. Repeat steps 4–7 for

.

, GOTO

7. Otherwise select the next set of from and repeat steps 4–7 until all possible have been tested. 8. Return

grouped together with the big entity due to their contribution of decreasing the boundary length of the new-formed big entity. This is because the embedded components increase the degree of convexity of the big entity they are embedded in. Similar to the human visual system, our POM can “perceive” a list of special structural relationships that obey the principle of nonaccidentalness such as similarity in shape regularity, symmetry, alignment, adjacency, and embedment. The “perception” is quantified by boundary energy whenever a new member is added to a group. If the new member has some structural relationships obeying the principle of nonaccidentalness with other members in the group, then the boundary energy of the newformed group is smaller than the one of the old group. Otherwise, the boundary energy of the new-formed group is larger than the one of the old group. The remaining task is to find the region in (1) that has the minimum boundary energy among all the regions that contain image patch . In other words, we want to find the best region that contains image patch and all image patches contained in the region that have some special structural relationships obeying the principle of nonaccidentalness with each other. This region often corresponds to the whole structured object or the semantically meaningful portion of the structured object. The challenge is that there may exist a large number of possible regions that contain image patch in , and it is computationally expensive to search all the possible regions to find the one with the global minimum boundary energy. Therefore, we develop an efficient boundary detection algorithm based on a breadth-first search strategy. Instead of finding the region with the global minimum boundary energy, the algorithm tries to find a region with the local minimum boundary energy. Although it is not guaranteed that the algorithm is always able to find the region with the optimal boundary energy, we have found that it works quite well in practice.

,

.

only contains image patch . Then, the At the beginning, algorithm measures the boundary energy of the combinations of and its immediate neighbors. The algorithm stops when no combination of and its immediate neighbors have smaller boundary energy than that of . The “ ” in step 3 tests the combination of and a single neighboring region of . The “ ” tests the combination of and a pair of connected neighboring regions of , and so on. In practice, we have found that even when , the algorithm performs well in general. C. Image Segmentation Algorithm The POM introduced in Section III-B can capture the special structural relationships that obey the principle of nonaccidentalness among the constituent parts of a structured object. To apply the proposed POM to real-world natural scene images, we need to first segment an image into regions so that each region approximately corresponds to an object part. In our implementation, we make use of Felzenszwalb and Huttenlocher’s [5] approach to generate initial superpixels for an outdoor scene image. We choose this method because it is very efficient and the result of the method is comparable to the mean-shift [48] algorithm. However, the initial superpixels are, in many cases, still too noisy. To further improve the segmentation quality, we apply a segment-merge method on the initial superpixels to merge the small size regions (i.e., region size 0.03% of the image size) with their neighbors. These small size regions are often caused by the texture of surfaces or by the inhomogeneous portions of some part surfaces. Since these small size image regions contribute little to the structure information (shape and size) of object parts, we merge them together with their larger neighbors to improve the performance of our POM. In addition, if two adjacent regions have similar colors, we also merge them together. By doing so, we obtain a set of improved superpixels. Most of these improved superpixels approximately correspond to object parts. We now turn to the image segmentation algorithm. Given an outdoor scene image, we first apply the segment-merge technique described above to generate a set of improved superpixels. Most of the superpixels approximately correspond to object parts in that scene. We build a graph to represent these superpixels: Let be an undirected graph. Each vertex corresponds to a superpixel, and each edge corresponds to a pair of neighboring vertices. We then use our background classifiers described in Section III-A to divide into two parts: backgrounds such as sky, roads, grasses, and trees and structured parts . Most of the

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION

1013

Fig. 4. Illustration of our segmentation pipeline. (a) Bottom: input images. Top: initial superpixels from [5]. (b) Top: improved superpixels with background objects identified. Sky is labeled as blue, ground is labeled as yellow, and vegetations (tree or grass) are labeled as green. Bottom: an example of perceptual stands for boundary energy. First, the bottom white part of the white car is selected. The for the part is measured as 5.06. organization process. for these two regions are measured as 5.39 and Then, our POM groups the two pieces of front windows with the white part based on convexity laws. The 5.46, respectively. Except for these two parts and the small segment of the front wheel, other parts do not have special geometric relations with the white part, for the region containing the sign part is measured as 5.09. Therefore, the such as the bottom part of the white sign behind the front part of the car. The of 5.46 is detected as the best region for the white part of the car. (c) Top: result of the first round perceptual organization. Notice that the region with different parts of the white car now have been grouped into two big pieces. These two big pieces are aligned. Bottom: final segmentation result after second round perceptual organization. Notice that the different parts of the white car are grouped together as a single object. (This figure is best viewed in color.)

structured objects in the scene are therefore contained in . We then apply our perceptual organization algorithm on . At the beginning, all the components in are marked as unprocessed. Then, for each unprocessed component in , we use the boundary detection algorithm described in Section III-B to detect the best region that contains vertex . Region may correspond to a single structured object or the semantically meaningful part of a structured object. We mark all the components comprising as processed. The algorithm gradually moves from the ground plane up to the sky until all the components in are processed. Then, we finish one round of perceptual organization procedure and use the grouped regions in this round as inputs for the next round of perceptual organization on . At the beginning of a new round of perceptual organization, we merge the adjacent components if they have similar colors and build a new graph for the new components in . This perceptual organization procedure is repeated for multiple rounds until no components in can be grouped with other components. In practice, we find that the result of two rounds of grouping is good enough in most cases. At last, in a postprocess procedure, we merge all the adjacent sky and ground objects together to generate final segmentation. An illustration of the algorithm’s pipeline is shown in Fig. 4. IV. EXPERIMENTAL RESULTS A. Gould Database We first test our image segmentation algorithm on a recently released Gould image data set (GDS) [52]. This data set contains

715 images of urban and rural scenes assembled from a collection of public image data sets: LabelMe [57], MSRC-21 [40], PASCAL [58], and geometric context [12]. The images on this data set are downsampled to approximately 320 pixels 240 pixels. The images contain a wide variety of man-made and biological objects such as buildings, signs, cars, people, cows, and sheep. This data set provides ground truth object class segmentations that associate each region with one of eight semantic classes (sky, tree, road, grass, water, building, mountain, or foreground). In addition, the object class labels, the ground truth object segmentations that associate each segment with one physical object, are also provided. The data set is publically available from the first author’s [52] website. Following the same setup in [52], we randomly split the data set into 572 training images and 143 testing images. We benchmarked a state-of-the-art class segmentation method Gould09 [49] for reference on this data set. Like our method, Gould09 also used superpixels as a starting point. We used the normalized cut algorithm [7] to generate 400 superpixels (per image) for use in the Gould09 method. The Gould09 method is a slight variant of the baseline method described in [54]. The baseline method in [54] achieved comparable result against the relative location prior method in [54], Shotton’s method [40], and Yang’s method [53] on the MSRC-21 [40] data set. Gould09 is trained on the training set and tested on the testing set. We first use the training images to train five background classifiers for background identification. Then, we test our POM method on both the testing set and the full GDS data set.

1014

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

TABLE I SEGMENTATION ACCURACY SCORE

We choose the method proposed by Martin [25] as the measurement for segmentation accuracy. A segmentation accuracy score is defined as (11) where and represent the set of pixels in the ground truth segment of an object and the machine-generated object segment, respectively. Because all the images in this data set are downsized to 320 pixels 240 pixels, we set the parameters of Felzenszwalb’s algorithm [29] to small values ( , , and ) to generate the initial superpixels from the input images. We found that Felzenszwalb’s algorithm with this set of parameters works well for small size images (320 240). We set parameters , , and for our POM. We used the 572 training images to learn five binary Adaboost [59] classifiers to identify five background object classes (i.e., sky, road, grass, trees, and water). The software of our work will be available from the first author’s website. Table I compares the performance of our method with that of the baseline method (Gould09) on the GDS. The table presents the segmentation accuracy on 8-class and overall objects (average), respectively . The segmentation accuracy measurement is based on (11). For each class, the score is averaged over all the salient object segments in the class. For overall objects, the score is averaged over all the detected salient object segments. If the size of a ground truth object segment is smaller than 0.5% of the image size, it is not a salient object and will not be accounted for in the segmentation accuracy. In total, we detected 2757 salient objects from 143 testing images and, on average, 19 objects per image. We are able to achieve an average improvement of 16.2% over the performance of the Gould09 method. Among 2757 salient objects detected in the testing images, the structured objects (buildings foregrounds) account for 52.6%. Our method significantly outperforms the Gould09 method on segmenting the structured objects. For the full data set, we detected 13 430 salient objects from 715 images and, on average, 18.8 objects per image. Structured objects account for 54.8% of the total detected salient objects. For the structured objects, POM does not gain any prior knowledge from training images. Our POM achieves very stable performance on segmenting the difficultly structured objects on the full data set. This shows that our POM can successfully handle various structured objects appearing in outdoor scenes. The last column in Table I presents the pixel-level accuracy. Pixel-level accuracy reflects how accurate the classification is for multiclass segmentation methods. Pixel-level accuracy is computed as the percentage of image

ON

GDS

pixels assigned to correct class label (see [40] for detailed definition). Our POM is not a multiclass segmentation method because it does not label each pixel of an image with one of eight semantic classes as Gould09 (see the last row in Fig. 5). Therefore, our POM does not have pixel-level accuracy. Gould09 seems to be adaptable to the variation of the number of semantic classes. The method achieved 70.1% pixel-level accuracy on the 21-class MSRC database according to [54] and achieved impressive 75.4% pixel-level accuracy on the 8-class GDS. However, the foreground class in GDS includes a wide variety of structured object classes such as cars, buses, people, signs, sheep, cows, bicycles, and motorcycles, which have totally different appearance and shape characteristics. This makes training an accurate classifier for classifying the foreground classes difficult. As a result, the Gould09 method cannot handle complicated environments where multiple foreground objects may appear close to each other. In such cases, the Gould09 method often labeled the whole group of physically different object instances such as people, car, and sign as one continuous foreground class region (see Fig. 5 for examples). This affects the performance of Gould09 on the object-level segmentation. If the foreground class can be further divided into more semantic object classes, the performance of the Gould09 method can be expected to improve on the GDS. The small number of semantic classes does not affect our method. Our method only requires identifying five background object classes (i.e., sky, trees, road, grass, and water). The remaining object classes are treated as structured objects. Our POM can handle many structured objects without recognizing them. From this perspective, our method is easy to train compared with the class segmentation methods in the literature. To gain a qualitative perspective of the performance of the two testing methods on this data set, we present several representative images, along with the ground truth segmentations, our method’s results, and the results of the Gould09 method in Fig. 5. The first example (see the first column in Fig. 5) contains a nearly centered person with vegetation backgrounds. The Gould09 method classified some part of the centered person as foreground. Our method pieced the major portion of the person together. The second example is a typical street scene that contains several structured objects (e.g., vehicle and building) and background objects. These structured objects are physically separated in the image. Our method segmented most of the vehicles and the buildings, whereas the Gould09 method erroneously merged the physically separated buildings together. The third image is a cluttered street scene that contains several vehicles parked in front of a building. These vehicles closely

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION

1015

Fig. 5. Examples of our POM segmentation algorithm on the GDS. (Row 1) Input images. (Row 2) Ground truth segmentations. (Row 3) POM (ours) results. (Row 4) Gould09’s results. (Row 5) Gould09’s class segmentation results. (This figure is best viewed in color.)

stay to each other. Our method segmented most of the vehicles and the background buildings well. Gould09 misclassified some parts of the buildings as foreground and merges the whole group of vehicles together. B. Berkeley Segmentation Data set Furthermore, we evaluate our POM image segmentation method on the Berkeley segmentation data set (BSDS) [60]. BSDS contains a training set of 200 images and a test set of 100 images. For each image, BSDS provides a collection of hand-labeled segmentations from multiple human subjects as

ground truth. BSDS has been widely used as a benchmark for many boundary detection and segmentation algorithms in technical literature. We directly evaluate our POM method on the test set of BSDS. The sizes of images in this data set are 481 321 , which are larger than the sizes of images in GDS. We use larger parameters ( , , and ) for Felzenszwalb’s algorithm [29] to generate the initial superpixels for an input image. We still set parameters , , and for our POM. We use the same background classifiers trained in the GDS data set to identify background objects in this data set.

1016

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

Fig. 6. Examples of our POM segmentation algorithm on the BSDS data set. (This figure is best viewed in color.)

We apply both region-based and boundary-based measurements to evaluate our POM method on the test set of BSDS. The region-based segmentation accuracy measurement is still based on (11). For each image, BSDS provides a collection of multiple human-labeled segmentations. For simplicity, we only select the first human-labeled segmentation of the collection as ground truth for the image. The score is averaged over all the salient object segments. If the size of a ground truth segment size is smaller than % 0.5 of the image size, it is not a salient object and will not be accounted for segmentation accuracy. In total, we detect 681 salient objects from 100 images and, on average, 6.8 objects per image. Our POM achieved an averaged segmentation accuracy score of 53% on the test set of BSDS. For the boundary-based measurement, we use the precision–recall framework recommended by BSDS. A precision–recall curve is a parameterized curve that captures the tradeoff between accuracy and noise. Precision is the fraction of detections that are true boundaries, whereas recall is the fraction of true boundaries that are detected. Thus, precision is the probability that the segmentation algorithm’s signal is valid, and recall is the probability that the ground truth data is detected. These two quantities can be combined in a single quality measure, i.e., F-measure, defined as the weighted harmonic mean of precision and recall. Boundary detection algorithms usually generate a soft boundary map for an image. A soft boundary map consists of one-pixel-wide boundaries valued from zero to one where high values signify greater confidence in the existence of a boundary. By thresholding the soft boundary map with multilevel and computing precision and recall at each level, a precision–recall curve can be generated for a boundary detection algorithm. Different from boundary

detection algorithms, our POM segmentation method generates a binary boundary map. Therefore, the precision–recall curve for POM degenerates into a single point. POM achieves with Precision and Recall on the BSDS data set. Our POM outperforms the global probability of boundary [61], which is a boundary detection algorithm achieving the best performance with on the BSDS data set. The boundary detection algorithm ranking and the software generating precision–recall curve on BSDS can be found on the web page of BSDS [60]. Some examples of our method’s results on BSDS are presented in Fig. 6. C. Limitation of Our Method Fig. 7 shows some bad examples of our method’s results. There are still some mistakes that we would like to address in the future. The segmentation of our POM is mainly based on the geometric relationships between different object parts. This requires obtaining the geometric properties (e.g., shape, size, etc.) of object parts. We assume that object parts have nearly homogenous surfaces, and hence, the uniform regions in an image correspond to object parts. Although this assumption holds in most cases, there are still some exceptions. For example, in Fig. 7(a), the black car body is painted into different patterns. As a result, the car body is oversegmented to many small parts. Under this situation, our POM could not detect any special relationships between the small parts and hence could not piece them together. Similar situations can be found on the woman’s clothing in Fig. 7(b) and the leopard in Fig. 7(c). Some object classes such as bicycles, motorcycles, or buildings have very complex structures. Some parts of the objects are not strongly attached to other parts. For these object classes,

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION

1017

Fig. 7. Examples of where our POM segmentation algorithm makes mistakes. (This figure is best viewed in color.)

our POM may not be able to piece the whole object together. Instead, it may only piece some semantically meaningful parts of the objects together [see Fig. 7(d)]. For these objects, higher level object-specific knowledge is still required to segment the entire objects. Another problem is caused by strong reflection. An example is shown in Fig. 7(e). Due to strong reflection, the upper rear part of the blue bus shows extremely bright white color. Our method identified the region as sky and hence did not piece the part with the bus. In some cluttered environments, one structured object may stand in front of another structured object. From some viewpoints, some parts of the front object may coincidently have special geometric relationships with some parts of the background objects . Under these situations, our POM may be confused and merge these parts together [see the car in the middle in Fig. 7(f)]). This problem can be addressed by recognizing the background structured objects. Currently, our method can only identify five homogenous background object classes (i.e., sky, road, trees, grass, and water). The mountains, buildings, and walls are also common background objects in outdoor scenes. We plan to enhance the background identification capability of our method in the future by training more classifiers to identify mountains, buildings, walls, etc. With the ability of identifying more background object classes, the performance of our method can be expected to be further improved. V. CONCLUSION AND DISCUSSION We have presented a novel image segmentation algorithm for outdoor natural scenes. Our main contribution is that we develop a POM. Our experimental results show that our proposed

method outperformed two competing state-of-the-art image segmentation approaches (Gould09 [49] and global probability of boundary [61]) and achieved good segmentation quality on two challenging outdoor scene image data sets (GDS [52] and BSDS [60]). It is well accepted that segmentation and recognition should not be separated and should be treated as an interleaving procedure. Our method basically follows this scheme and requires identifying some background objects as a starting point. Compared to the large number of structured object classes, there are only a few common background objects in outdoor scenes. These background objects have low visual variety and hence can be reliably recognized. After background objects are identified, we roughly know where the structured objects are and delimit perceptual organization in certain areas of an image. For many objects with polygonal shapes, such as the major object classes appearing in the streets (e.g., buildings, vehicles, signs, people, etc.) and many other objects, our method can piece the whole object or the main portions of the objects together without requiring recognition of the individual object parts. In other words, for these object classes, our method provides a way to separate segmentation and recognition. This is the major difference between our method and other class segmentation methods that require recognizing an object in order to segment it. This paper shows that, for many fairly articulated objects, recognition may not be a requirement for segmentation. The geometric relationships of the constituent parts of the objects provide useful cues indicating the memberships of these parts.

1018

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

REFERENCES [1] T. Malisiewicz and A. A. Efros, “Improving spatial support for objects via multiple segmentations,” in Proc. BMVC, 2007. [2] E. Borenstein and E. Sharon, “Combining top-down and bottom-up segmentation,” in Proc. IEEE Workshop Perceptual Org. Comput. Vis., CVPR, 2004, pp. 46–53. [3] X. Ren, “Learning a classification model for segmentation,” in Proc. IEEE ICCV, 2003, vol. 1, pp. 10–17. [4] U. Rutishauser and D. Walther, “Is bottom-up attention useful for object recognition?,” in Proc. IEEE CVPR, 2004, vol. 2, pp. 37–44. [5] P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based image segmentation,” Int. J. Comput. Vis., vol. 59, no. 2, pp. 167–181, Sep. 2004. [6] J. B. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000. [7] T. Cour, F. Benezit, and J. B. Shi, “Spectral segmentation with multiscale graph decomposition,” in Proc. IEEE CVPR, 2005, vol. 2, pp. 1124–1131. [8] D. Lowe, Perceptual Organization and Visual Recognition. Dordrecht, The Netherlands: Kluwer, 1985. [9] J. D. McCafferty, Human and Machine Vision: Computing Perceptual Organization. Chichester, U.K.: Ellis Horwood, 1990. [10] R. Mohan and R. Nevatia, “Perceptual organization for scene segmentation and description,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 6, pp. 616–635, Jun. 1992. [11] A. Witkin and J. Tenenbaum, “On the role of structure in vision,” in Human and Machine Vision, J. Beck, B. Hope, and A. Rosenfeld, Eds. New York: Academic, 1983. [12] D. Hoiem, A. A. Efros, and M. Hebert, “Geometric context from a single image,” in Proc. IEEE ICCV, 2005, vol. 1, pp. 654–661. [13] D. Hoiem, A. N. Stein, A. A. Efros, and M. Hebert, “Recovering occlusion boundaries from a single image,” in Proc. IEEE ICCV, 2007, pp. 1–8. [14] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface layout from an image,” Int. J. Comput. Vis., vol. 75, no. 1, pp. 151–172, Oct. 2007. [15] H. Su, A. Bouridane, and D. Crookes, “Scale adaptive complexity measure of 2-D shapes,” in Proc. IEEE ICPR, 2006, pp. 134–137. [16] B. Vasselle and G. Giraudon, “2-D digital curve analysis: A regularity measure,” in Proc. IEEE ICCV, 1993, pp. 556–561. [17] K. Plataniotis and A. Venetsanopoulos, Color Image Processing and Applications. Berlin, Germany: Springer-Verlag, 2000, ch. 1, pp. 268–269. [18] T. Brinkhoff, H. P. Kriegel, and R. Schneider, “Measuring the complexity of polygonal objects,” in Proc. 3rd ACM Int. Workshop, 1995, pp. 109–117. [19] D. D. Hoffman and M. Singh, “Salience of visual parts,” Cognition, vol. 63, no. 1, pp. 29–78, Apr. 1997. [20] D. W. Jacobs, “Robust and efficient detection of salient convex groups,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 1, pp. 23–37, Jan. 1996. [21] I. H. Jermyn and H. Ishikawa, “Globally optimal regions and boundaries as minimum ratio weight cycles,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 10, pp. 1075–1088, Oct. 2001. [22] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, LabelMe: A database and web-based tool for image annotation MIT, Cambridge, MA, Tech. Rep. MIT-CSAIL-TR-2005-056, 2005. [23] B. C. Russell, “Using multiple segmentations to discover objects and their extent in image collections,” in Proc. IEEE CVPR, 2006, vol. 2, pp. 1605–1614. [24] D. R. Martin, C. C. Fowlkes, and J. Malik, “Learning to detect natural image boundaries using local brightness, color, and texture cues,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 530–549, May 2004. [25] D. R. Martin, “An empirical approach to grouping and segmentation,” Ph.D. dissertation, U.C. Berkeley, Berkeley, CA, 2002. [26] R. Unnikrishnan, C. Pantofaru, and M. Hebert, “A measure for objective evaluation of image segmentation algorithm,” in Proc. IEEE CVPR, 2005, vol. 3, pp. 34–41. [27] H. Zhang, S. Cholleti, S. A. Goldman, and J. E. Fritts, “Meta-evaluation of image segmentation using machine learning,” in Proc. IEEE CVPR, 2006, vol. 1, pp. 1138–1145. [28] F. J. Estrada and A. D. Jepson, “Quantitative evaluation of a novel image segmentation algorithm,” in Proc. IEEE CVPR, 2005, vol. 2, pp. 1132–1139.

[29] S. C. Zhu and A. Yuille, “Region competition: Unifying snakes, region growing and Bayes/MDL for multi-band image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 9, pp. 884–900, Sep. 1996. [30] S. K. Shah, “Performance modeling and algorithm characterization for robust image segmentation,” Int. J. Comput. Vis., vol. 80, no. 1, pp. 92–103, Oct. 2008. [31] G. Griffin, A. Holub, and P. Perona, Caltech-256 object category dataset California Inst. Technol., Pasadena, CA, Tech. Rep. 7694, 2007. [32] Z. L. Liu, D. W. Jacobs, and R. Basri, “The role of convexity in perceptual completion: Beyond good continuation,” Vis. Res., vol. 39, no. 25, pp. 4244–4257, Dec. 1999. [33] D. W. Jacobs, “What makes viewpoint-invariant properties perceptually salient?,” J. Opt. Soc. Amer. A, Opt. Image Sci., vol. 20, no. 7, pp. 1304–1320, Jul. 2003. [34] V. Bruce and P. Green, Visual Perception: Physiology, Psychology and Ecology. Hillsdale, NJ: Lawrence Erlbaum Associates Ltd., 1990. [35] Z. Wu and R. Leahy, “An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 11, pp. 1101–1113, Nov. 1993. [36] P. Dollar, Z. W. Tu, and S. Belongie, “Supervised learning of edges and object boundaries,” in Proc. IEEE CVPR, 2006, vol. 2, pp. 1964–1971. [37] S. Mahamud, L. R. Williams, K. K. Thornber, and K. Xu, “Segmentation of multiple salient closed contours from real images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 4, pp. 433–444, Apr. 2003. [38] X. F. Ren, C. C. Fowlkes, and J. Malik, “Learning probabilistic models for contour completion in natural images,” Int. J. Comput. Vis., vol. 77, no. 1–3, pp. 47–63, May 2008. [39] A. Desolneux, L. Moisan, and J. M. Morel, “A grouping principle and four applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 4, pp. 508–513, Apr. 2003. [40] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context,” Int. J. Comput. Vis., vol. 81, no. 1, pp. 2–23, Jan. 2009. [41] J. Winn, A. Criminisi, and T. Minka, “Categorization by learned universal visual dictionary,” in Proc. IEEE ICCV, 2005, vol. 2, pp. 1800–1807. [42] B. Micusik and J. Kosecka, “Semantic segmentation of street scenes by superpixel co-occurrence and 3-D geometry,” in Proc. IEEE Workshop VOEC, 2009. [43] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” Ann. Statist., vol. 38, no. 2, pp. 337–374, Apr. 2000. [44] J. Malik, S. Belongie, T. Leung, and J. Shi, “Contour and texture analysis for image segmentation,” Int. J. Comput. Vis., vol. 43, no. 1, pp. 7–27, Jun. 2001. [45] T. Chan and L. Vese, “Active contours without edges,” IEEE Trans. Image Process., vol. 10, no. 2, pp. 266–277, Feb. 2001. [46] L. A. Vese and T. F. Chan, “A multiphase level set framework for image segmentation using the Mumford and Shah model,” Int. J. Comput. Vis., vol. 50, no. 3, pp. 271–293, Dec. 2002. [47] D. Mumford and J. Shah, “Optimal approximations by piecewise smooth functions and associated variational problems,” Commun. Pure Appl. Math., vol. 42, no. 5, pp. 577–685, Jul. 1989. [48] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp. 603–619, May 2002. [49] S. Gould, O. Russakovsky, I. Goodfellow, P. Baumstarck, A. Y. Ng, and D. Koller, The STAIR Vision Library (v2.3) 2009 [Online]. Available: http://ai.stanford.edu/~sgould/svl [50] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in Proc. IEEE CVPR, 2008, pp. 1–8. [51] C. Pantofaru, C. Schmid, and M. Hebert, “Object recognition by integrating multiple image segmentations,” in Proc. ECCV, 2008, pp. 481–494. [52] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into geometric and semantically consistent regions,” in Proc. IEEE ICCV, 2009, pp. 1–8. [53] L. Yang, P. Meer, and D. J. Foran, “Multiple class segmentation using a unified framework over man-shift patches,” in Proc. IEEE CVPR, 2007, pp. 1–8. [54] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller, “Multi-class segmentation with relative location prior,” Int. J. Comput. Vis., vol. 80, no. 3, pp. 300–316, Dec. 2008.

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION

[55] X. He, R. Zemel, and M. Carreira-Perpinan, “Multiscale CRFs for image labeling,” in Proc. IEEE CVPR, 2004, pp. 695–702. [56] C. Cheng, A. Koschan, D. L. Page, and M. A. Abidi, “Scene image segmentation based on perception organization,” in Proc. IEEE ICIP, 2009, pp. 1801–1804. [57] B. C. Russell, A. B. Torralba, K. P. Murphy, and W. T. Freeman, “LabelMe: A database and web-based tool for image annotation,” Int. J. Comput. Vis., vol. 77, no. 1–3, pp. 157–173, May 2008. [58] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. 2007. [59] [Online]. Available: http://graphics.cs.msu.ru/ru/science/research/machinelearning/adaboosttoolbox [60] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. IEEE ICCV, 2001, vol. 2, pp. 416–423. [61] M. Maire, P. Arbelaez, C. C. Fowlkes, and J. Malik, “Using contours to detect and localize junctions in natural images,” in Proc. IEEE CVPR, 2008, pp. 1–8. Chang Cheng received the B.E. degree in electronical machinery from Hangzhou Dianzi University, Hangzhou, China, in 1995, the M.E. degree in computer science from Southeast University, Nanjing, China, in 2001, and the Ph.D. degree in electrical engineering from The University of Tennessee, Knoxville, in 2010. He is currently a Technical Staff with Riverbed Technology, Sunnyvale, CA. His research interests are in image processing, computer vision, and mobile robotics.

Andreas Koschan (M’90) received the Diplom (M.S.) degree in computer science and the Dr.-Ing. (Ph.D.) degree in computer engineering from Technical University Berlin, Berlin, Germany, in 1985 and 1991, respectively. Currently, he is a Research Associate Professor with the Department of Electrical and Computer Engineering, The University of Tennessee, Knoxville. His work focused on color image processing and 3-D computer vision, including stereo vision and laser range-finding techniques. He is a coauthor of four textbooks on color and 3-D image processing. Dr. Koschan is a member of the Society for Imaging Science and Technology (IS&T).

1019

Chung-Hao Chen received the B.S. and M.S. degrees in computer science and information engineering from Fu Jen Catholic University, New Taipei City, Taiwan, in 1997 and 2001, respectively, and the Ph.D. degree in electrical engineering from The University of Tennessee, Knoxville, in 2009. In 2009, he joined the Department of Mathematics and Computer Science, North Carolina Central University, Durham, NC, as an Assistant Professor and retained this position until 2011. He is currently an Assistant Professor with the Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA. His research interests include object tracking, robotics and image processing.

David L. Page received the B.S. and M.S. degrees in electrical engineering from Tennessee Technological University, Cookeville, in 1993 and 1995, respectively, and the Ph.D. degree from The University of Tennessee, Knoxville, in 2003, through the Imaging, Robotics, and Intelligent Systems Laboratory. He was a Civilian Research Engineer with the Naval Surface Warfare Center, Dahlgren, VA. He has also served as a Research Assistant Professor with the Imaging, Robotics, and Intelligent Systems Laboratory, The University of Tennessee, until 2008 . Within this laboratory, he was involved with a variety of research topics ranging from robotic vision systems to multivideo security systems. He is currently a Partner and Chief 3-D Architect with Third Dimension Technologies LLC, Knoxville, TN, a technology start-up company developing revolutionary 3-D displays. In this capacity, David serves as the Lead Scientist for algorithm development of 3-D rendering and computer-vision-based calibration.

Mongi A. Abidi received the Principal Engineering degree in electrical engineering from the National Engineering School of Tunis, Tunisia, in 1981 and the M.S. and Ph.D. degrees in electrical engineering from The University of Tennessee, Knoxville, in 1985 and 1987, respectively. He is a Professor with the Department of Electrical and Computer Engineering, The University of Tennessee, where he directs activities in the Imaging, Robotics, and Intelligent Systems Laboratory as an Associate Department Head . He conducts research in the field of 3-D imaging, specifically in the areas of scene building, scene description, and data visualization.

Outdoor Scene Image Segmentation Based On Background.pdf ...

Scene image clustering based on boosting and GMM

Segmentation-based CT image compression

Segmentation of Markets Based on Customer Service

Spatiotemporal Video Segmentation Based on ...

Feature Space based Image Segmentation Via Density ...

Edge Based Image Segmentation Techniques: A Survey

Improving Feature Space based Image Segmentation ...

TV-Based Multi-Label Image Segmentation with Label ...

Image Retrieval: Color and Texture Combining Based on Query-Image*

Outdoor Robot Navigation Based on a Probabilistic ...

Multi-Task Text Segmentation and Alignment Based on ...

Contextual Query Based On Segmentation & Clustering For ... - IJRIT

Query Segmentation Based on Eigenspace Similarity

A Meaningful Mesh Segmentation Based on Local Self ...

Segmentation of Mosaic Images based on Deformable ...

Query Segmentation Based on Eigenspace Similarity

A Meaningful Mesh Segmentation Based on Local ...

Query Segmentation Based on Eigenspace Similarity

Interactive Segmentation based on Iterative Learning for Multiple ...

Robust Obstacle Segmentation based on Topological ...

Validation Tools for Image Segmentation

Segmentation of Mosaic Images based on Deformable ...