ON THE ROLE OF STRUCTURE IN PART-BASED ...

Viewer
Transcript

ON THE ROLE OF STRUCTURE IN PART-BASED OBJECT DETECTION Giuseppe Passino, Ioannis Patras, Ebroul Izquierdo Queen Mary, University of London, Mile End Road, London, E1 4NS, UK {giuseppe.passino,ioannis.patras,ebroul.izquierdo}@elec.qmul.ac.uk

ABSTRACT Part-based approaches in image analysis aim at exploiting the considerable discriminative power embedded in relations among image parts. Nonetheless, learning structural information is not always possible without the availability of a training set of classified parts, and taking into account this additional information can even degrade the performance of the system. In this paper, a discriminative graphical model for object detection is introduced and used in order to analyse and report results on the role of structural information in image classification tasks. 1. INTRODUCTION The problem of automatically analysing multimedia content in a semantic way is significant, both because it involves the intriguing task of making the machine perform a task that currently is prerogative of human beings, as well as for the impressive number of potential practical applications. Here we focus on the problem of low-level object detection in images, which implicates automatic inference on whether objects of given semantic categories are present or not in an image, given a set of training examples. The semantic categories present high variance in their appearance, so the challenge is to maximise the exploitation of high semantic-level information related to the role that objects have in the image. Part-based approaches are particularly appealing for this task, allowing to classify images according to the analysis of basic components of the images. The parts, or patches, are described according to their appearance, and their relations often enclose a substantial amount of information that can help the detection process. Probabilistic Graphical Models (PGM) have proven useful in related problems, such as semantic image segmentation, in which all the pixels of an image need to be labelled as belonging to a specific image class. Graphical structures enable a rigorous probabilistic analysis of the problem avoiding the explicit enumeration of the full set of hypotheses by limiting the direct dependencies among different variables. This is important since taking into account a The research leading to this paper was supported by the European Commission under contract FP6-027026, Knowledge Space of semantic inference for automatic annotation and retrieval of multimedia content - K-Space.

large number of parts tends to lead to intractable problems. However, in image segmentation problems usually a strongly labelled ground-truth, that is a set of fully segmented images, is available for training. Such ground truth is often not existent and therefore very expensive to obtain. In contrast, object detection systems rely on training data labelled at image-level with the categories present in each image (weak labelling). The analogy between the two problems is not immediate, and in the later case learning structural and appearance information is more difficult. This issue has not been sufficiently explored by the community yet. In this paper we introduce a discriminative PGM for the object detection task, and we consider different choices of patches locations and graph structures, drawing conclusions with particular consideration on the role played by the structural information. The main innovation factor in the contribution is the application of structural models usually employed in image labelling problems to the semantic object detection task. The paper is organised as follows. In Section 2 an overview of works that successfully took advantage of structural information in different tasks related to semantic multimedia analysis is given. In Section 3 the proposed probabilistic model is presented. Section 4 introduces the different structural choices that have been analysed in this work. The experimental set-up and results are reported in Section 5, and in Section 6 conclusions are drawn. 2. STRUCTURE IN PART-BASED MODELS Structural information has been exploited successfully, by the means of PGMs, in different image segmentation systems. For example in [1] the results of a probabilistic Latent Semantic Analysis (pLSA) are constrained with a Markov Random Field (MRF) imposing Potts-like potentials to the label probabilities. Discriminative PGMs are recently experiencing a large success in this field. In particular, several approaches have recently described the semantic segmentation problem with a Conditional Random Field (CRF) [2, 3, 4], a discriminative version of a MRF. Directly modelling the a posteriori probability of the labelling given the observation, a twofold advantage is achieved. On one side, there is no need to model the probability of the observation given the label of the part: this

limits to a certain extent the expressive power of the model, but the learning of this appearance model is a demanding task. On the other side, over-simplifying independence assumpions usually made in generative approaches in order to make the model tractable are not required and can be relaxed (this aspect has been exploited in [4] to take into account longrange parts dependencies). The presence of complex connections patterns, however, makes the inference in the graph intractable and a number of alternative solutions to embed interaction have been proposed (e.g., in [2, 3]). In the object detection area, most of the works focus on learning rigid structural constraints related to the appearance of object classes with a small variance (e.g., [5]). Not all the scene is analysed in a structured probabilistic framework, but only a model for the objects is learned. By contrast, object detection via CRF has been performed in [6] where a hidden CRF (hCRF) model is proposed. In this framework, patches are represented by interest points and connected via a minimum spanning tree algorithm. Interesting results have been reported, however no comparison with unconnected models has been made and the role of the connections is not clear.

Fig. 1. An example of Multiple-category hCRF. P P Ψ(X,y,h,θ) where Z(X; θ) = is a normalisation y he factor and θ is the model’s parameters vector. The function Ψ, called local function, is a sum of terms that depend on connected nodes in the graph: this is the key to achieve tractability in the estimation of the normalisation factor and the marginal probabilities. We have Ψ(x, y, h, θ) =

3. MULTIPLE CATEGORY HCRF

3.1. Probabilistic Modelling The introduced Multiple-category hCRF is inspired by the hCRF of [6] and the direct discriminative model in [7]. The problem is to establish which semantic classes are present in an image, given a set of weakly labelled examples. Different classes can appear simultaneously, and this can indeed provide additional co-presence information. The observation is composed of the feature vectors associated to each part. A binary random variable is associated to each image-level category, indicating its presence. A random variable is associated to each patch as well, indicating to which semantic class it belongs. The structure of the graph embedding the dependencies among these variables is represented in Fig. 1. For the detection of n object categories, image-level variables are represented by a n-dimensional vector y whose elements yi can take the values {0, 1}. Patch-level variables related to m image parts form an m-dimensional vector h, whose elements can take values in the set of available labels L plus the “void” label (hj ∈ Lp = L ∪ {void}). Finally, the observation will be a l × m matrix X whose columns xj will be the feature vectors associated to the j-th patch. The a posteriori probability of having a given set of image-level labels and a given patches configuration is eΨ(X,y,h,θ) , Z(X; θ)

θk φ1k (xi , hi ) +

i=1,k

In this work we will use a CRF to perform object detection and consider the role that structure achieves when considered within the graphical model framework.

p(y, h|X; θ) =

m X

(1)

+

n X

X

θk φ2k (hi , hj )+

i,j∈E,k

θk φ3k (yj ) +

j=1,k

m,n X

θk φ4k (yj , hi ) ,

i=1,j=1,k

(2) where the parameter index k in each sum spans on the subsets of parameters referring to the related functions φ1k , φ2k , φ3k and φ4k , that depend on the involved random variables. The φ functions are indicator functions (look-up tables), and the values of φ1 s are weighted with the coefficients of the patch feature vectors xi . E is the set of indices pairs associated to edges in the graph. The parts labels are never assigned explicitly, so the information is given by the probability of an P image-level labelling given the observation p(y|X; θ) = h p(y, h|X; θ). The patch layer has the role of an intermediate latent layer that is averaged out in the estimation of the presence of different classes. 3.2. Training and Inference The training of the model can be performed via a Maximum Likelihood (ML) approach, in which the probability of having the labelling (at image-level) of the training set is maximised with respect to the parameters of the model. The target function for the optimisation algorithm PN kθk2 is L(θ) = i=1 log (p(yi |Xi ; θ)) − 2σθ2 , where N is the cardinality of the training set. The last term of the likelihood is due to a Gaussian prior, introduced in order to prevent the model from overfitting the training data. The parameter σθ2 can be obtained via cross-validation. The optimum solution can not be calculated in closed form due to the presence of the

normalisation factor Z(X; θ) in (1), and the likelihood is not a convex function of θ due to the hidden layer. Here, we use a Newton gradient ascent method to find a local optimum in an iterative way. We impose an initialisation that favours smooth patch-level configurations. The gradient of the contributions Li = log (p(yi |Xi )) can be evaluated analytically using (1) and averaging out all the variables not in the domain of the φk associated to the vector component θk with respect to which the derivative is computed. Therefore, the derivatives will take one of the forms P ∂Li 1 (3) u,l p(hu = l|yi , Xi )φk (l, xiu )+ ∂θk (θ) = P − u,l p(hu = l|Xi )φ1k (l, xiu ) , ∂Li ∂θk (θ)

=

P

−

P

(u,v)∈E,l1 ,l2

(u,v)∈E,l1 ,l2

p( p(

hu =l1 hv =l2 hu =l1 hv =l2

|yi , Xi )φ2k (l1 , l2 )+ (4) |Xi )φ2k (l1 , l2 ) ,

∂Li ∂θk (θ)

P =φ3k (yjk i ) − yj p(yjk |Xi )φ3k (yjk ) , k P ∂Li 4 u,l p(hu = l|yi , Xi )φk (l, yjk i )+ ∂θk (θ) = P − u,l,ly p(hu = l, yjk = ly |Xi )φ4k (ly , l) ,

(5) (6)

depending on whether the index k of the parameter vector is associated to a function φ1 , φ2 φ3 or φ4 respectively. Marginal probabilities in (3)-(6) are calculated using a Belief Propagation (BP) algorithm. All the terms appearing in the gradient can be calculated with only two rounds of BP, one on an unconditioned graph similar to the one in Fig. 1, and the other on a conditioned one in which the image-level labels nodes are assigned and treated as priors on the patch labels. 4. STRUCTURAL CONFIGURATIONS In this work parts are taken on salient points or on a regular grid structure. The first approach is preferred in object detection systems, where the actual keypoints positions configuration can be useful to discriminate different objects. On the other side, segmentation problems often rely on regular structures that allow a uniform coverage of the image area [2, 4]. Having patches on a regular grid additionally eases the patch labelling process. The SIFT algorithm [8] has been adopted for salient points location and feature extraction. The choice of the connections among the patches is another key point in the system design. The reason is twofold: on one side, they determine the dependencies taken into account in the system; on the other side, a highly connected graph makes the inference in the model inefficient and approximate – only in loopless graphs inference can be performed exactly and efficiently at the same time. For regularly displaced patches, a rectangular lattice is the most obvious solution. Loopy Belief Propagation (LBP) is used to perform the inference efficiently. For patches taken at specific interest points, the imposed structure is a Minimum Spanning Tree,

Fig. 2. Piecewise message passing schedule for efficient convergence of the LBP. The grayed nodes act as priors and their probability distribution is fixed. similarly to [6], and the inference can be performed via BP in exact form. Additionally, models without inter-patches connections have been analysed in this work. A complication in the inference is due to the presence of category nodes. They significantly reduce the average distance between two nodes in the graphs. In message passing algorithms such BP this creates the undesired side-effect of enhancing the effect of a fluctuation in the nodes probability distributions. The proposed solution relies on an optimised schedule for the message passing procedure in LBP. A common approach to decide the order in which perturbations in the probabilities should propagate within the network is a biggest-first ordering. With this policy, at first messages carrying the greatest perturbation are propagated, resulting in an efficient schedule, with the drawback that each time a message is passed, all the influenced messages perturbations have to be re-estimated. In our system some knowledge about the nature of the problem can be exploited. It is known that all the patch-level variables equally contribute to the category variables distributions and the single contribution will therefore have little influence on them. For this reason, the inference is performed iterating message propagation in two nodes sets: category nodes and patch nodes. Within each set (intraset message passing), the messages are propagated according to the biggest-first schedule. For the (unconnected) category nodes set, the inner propagation converges in one step. For the patch-levels nodes set, even in case of regular grid, the biggest-first schedule leads to good results in terms of convergence time. This scheme is depicted in Fig. 2. Piecewise convergence has been proposed for LBP by other authors in relation to different problems (e.g., [9]). 5. EXPERIMENTAL RESULTS Experiments are run on the Microsoft Research database of images containing objects of nine different semantic categories1 . We took into account five categories, that is “build1 http://research.microsoft.com/vision/cambridge/ recognition/default.htm.

PP PP cat. set-up PPP P no struct., int. pts no struct., grid struct., int. pts struct., grid

build.

grass

tree

cow

sky

91% 73% 84% 69%

84% 89% 87% 82%

91% 67% 93% 69%

89% 78% 84% 80%

87% 87% 84% 87%

Table 1. Image-level labelling accuracies for different concepts with different set-ups of the model.

sky concepts, for the model based on keypoints. This can be explained with the fact that these categories tend to present a homogeneous coverage of the same type of textures, so that structural (proximity) information can be helpful in improving the confidence. The sky category does not present such improvements because, since few keypoints fall in the sky regions, the model tends to discriminate the concept with the help of other (usually unstructured) keypoints falling on objects of different categories. To have a better detail on the retrieval performance of the algorithm, Precision-Recall curves of the tested models for patches located on interest points are shown in Fig. 3. 6. CONCLUSIONS

Fig. 3. Precision-Recall curves for the retrieval of images of different categories, for patches taken on interest points, either considering structure (right) or not (left). ing”, “grass”, “tree”, “cow” and “sky”. The categories are broad in terms of appearance, some of them can be considered foreground while some other are background. Four different configurations of the system have been tested. Regular rectangular 20 by 20 patches have been considered for the configurations based on a regular grid, and patches depending on the scale of the interest points otherwise. The first method achieves a better coverage of the image area so that all the concepts are uniformly represented, while the second choice leads to better descriptive features vectors. Only single image-level category models have been tested for this work. Detection results for different set up of the model are shown in Table 1. When considering patches on detected points without spatial connections, results are comparable with [7], being the expressivity of the model similar. As a first result, it is possible to notice that the results for the model with patches placed on interest points are generally better. This is due to the fact that those points are usually more stable and descriptive than the ones whose position is forced with a rigid placement. The only exceptions are concepts like “sky” and “grass”, that are not generally well covered by the interest point locator, being smooth areas. For this reason, forcing a regular coverage increases the representativeness of the patches for these concepts. The results show that the structural information is not always useful to improve detection results. Its first effect is the increased complexity of the model, in terms of both number of parameters to tune in the training phase, and complexity of BP in the graph. Improvements are reported only on tree and

We have introduced a framework for object detection in images using discriminative graphical models to evaluate the impact of structural information embedded as links between nodes in a CRF. Results related to different system configurations have been reported, concluding with a final discussion on the usefulness of information of different nature. The results found in this work are interesting to evaluate the role played by the structural information on object detection via graphical models. This study can be very helpful in giving further guidance on the introduction of additional ways to embed interactions of different categories within images. 7. REFERENCES [1] J. Verbeek and B. Triggs, “Region classification with markov field aspect models,” in CVPR ’07, 2007, pp. 1–8. [2] X. He, R. S. Zemel, and M. A. Carreira-Perpinan, “Multiscale conditional random fields for image labeling,” in CVPR ’04, 2004, vol. 2, pp. 695–702. [3] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation,” in ECCV ’06, 2006, vol. 3951 LNCS. [4] J. Verbeek and B. Triggs, “Scene segmentation with crfs learned from partially labeled images,” in NIPS ’07, 2007. [5] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” IJCV, vol. 61, no. 1, pp. 55–79, Jan. 2005. [6] A. Quattoni, M. Collins, and T. Darrell, “Conditional random fields for object recognition.,” in NIPS ’04, 2004. [7] I. Ulusoy and C. M. Bishop, “Generative versus discriminative methods for object recognition,” in CVPR ’05, 2005, pp. 258– 265. [8] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004. [9] F. Hamze and N. de Freitas, “From fields to trees,” in AUAI ’04. 2004, pp. 243–250, AUAI Press.

The recovery of thematic role structure during noun ...