IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

Incorporating the Loss Function into Discriminative Clustering of Structured Outputs Wenliang Zhong, Weike Pan, James T. Kwok, and Ivor W. Tsang

Abstract— Clustering using the Hilbert Schmidt independence criterion (CLUHSIC) is a recent clustering algorithm that maximizes the dependence between cluster labels and data observations according to the Hilbert Schmidt independence criterion (HSIC). It is unique in that structure information on the cluster outputs can be easily utilized in the clustering process. However, while the choice of the loss function is known to be very important in supervised learning with structured outputs, we will show in this paper that CLUHSIC is implicitly using the often inappropriate zero-one loss. We propose an extension called CLUHSICAL (which stands for “Clustering using HSIC and loss”) which explicitly considers both the output dependency and loss function. Its optimization problem has the same form as CLUHSIC, except that its partition matrix is constructed in a different manner. Experimental results on a number of datasets with structured outputs show that CLUHSICAL often outperforms CLUHSIC in terms of both structured loss and clustering accuracy. Index Terms— Clustering methods, dependence Maximization, Hilbert Schmidt independence criterion, loss, structured clustering.

I. I NTRODUCTION

C

LUSTERING aims at grouping examples that are similar to each other into a small number of classes or clusters. It is an invaluable data-analysis tool that has been widely used in diverse domains ranging from engineering, medical science, earth science, social science to economics [1]. Over the decades, a battery of clustering approaches has been developed. In general, these can be considered as representing three different views of clustering [2]. The first one is the geometric view, which includes the classic k-means clustering algorithm [3]. The second one is the statistical view, with examples including the method of information bottleneck [4] and mixture models [5]. The last one is the spectral view, which includes methods such as the normalized cut and various spectral clustering algorithms [6]. Clustering using the Hilbert Schmidt independence criterion (CLUHSIC) [2] is a recent clustering algorithm that unifies these different views

Manuscript received April 23, 2010; revised July 11, 2010; accepted July 13, 2010. Date of publication August 23, 2010; date of current version October 6, 2010. This work was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region, under Grant 614508. W. Zhong, W. Pan, and J. T. Kwok are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China (e-mail: [email protected]; [email protected]; [email protected]). I. W. Tsang is with the School of Computer Engineering, Nanyang Technological University, 639798, Singapore (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2064177

of clustering. It finds the cluster partitioning such that the resultant cluster labels are maximally dependent on the data observations according to the Hilbert Schmidt independence criterion (HSIC) [7]. It can be shown that all the above views of clustering are special cases of CLUHSIC [2]. CLUHSIC is also a kernel-based method that can be used on structured data. While traditional kernel methods are mainly focused on vectorial inputs and outputs, there is growing interest in extending them to more complex domains with structured data. This has been driven in part by the tremendous amount of structured data available online, such as Internet documents residing in a hierarchy and bioinformatics databases containing DNA sequences. In general, structure information may be present in the inputs and/or outputs. For structured inputs, a variety of kernels have been developed for strings, trees, and graphs [8]. Traditionally, kernel-based clustering algorithms focus only on using kernels defined in the input space [9], [10], On the other hand, CLUHSIC has the important advantage that kernels can be used on both the input and output (cluster labels). Thus, it can be used for clustering data into hierarchies, chains, or even graphs. Recently, Xu et al. [11] proposed a related discriminative clustering algorithm for structured outputs. It is based on the idea of maximum margin clustering [12]–[14] that trains a support vector machine (SVM) by maximizing the margin and minimizing the loss over all possible cluster labels. However, maximum margin clustering is computationally much harder than maximum margin classification. Existing methods typically rely on reformulating and relaxing the non-convex optimization problem as semidefinite programs (SDPs) that are computationally expensive. To combat this problem, Xu et al. [11] proposed a heuristic procedure that avoids SDP entirely. However, as will be shown in Section IV, its empirical performance is not satisfactory. Very recently, Bach and Harchaoui [15] proposed a more efficient convex relaxation which can be solved as a sequence of singular value decompositions. However, this is designed only for vectorial data but not structured data. Moreover, loss functions are not used. Back to the realm of supervised learning, it is well known that the popular zero-one loss function is often inappropriate for structured prediction [16], [17]. The joint kernel k((x, y), (x , y )) = ϕ(x, y), ϕ(x , y ) defined over both input features x and output labels y can be used to measure the similarity based on the input structure and the output label structure. However, even with a good joint kernel, prediction of the structured output may not always be correct due to

1045–9227/$26.00 © 2010 IEEE

ZHONG et al.: INCORPORATING THE LOSS FUNCTION INTO DISCRIMINATIVE CLUSTERING OF STRUCTURED OUTPUTS

a

c

b

g Level 1

a

c

b

a

c

b

Level 2

Level 2 (b)

g Level 1

Level 2

Level 3

Level 3 (a)

g Level 1

1565

Level 3 (c)

Fig. 1. Example showing that the misclassified patterns are often assigned to faraway clusters by CLUHSIC. A blue dot indicates a correct cluster assignment, while a red cross indicates an incorrect assignment. (a) Ground truth. (b) CLUHSIC. (c) CLUHSICAL.

insufficient or noisy training examples. Consider, for example, a document that belongs to the category “apple” in the hierarchical classification of text documents. Misassigning this document to a category (say, “orange”) near the true label is much better than assigning it to a distant unrelated category (say, “moon”), although the losses in both cases are one according to the zero-one loss function. Hence, to measure the severity of different errors during training (or, in other words, to discourage patterns from being assigned to distant unrelated categories), the loss function also needs to utilize the structure information. In particular, one can define a loss function (y, u) that penalizes the difference between two outputs y and u according to the structure of the two objects. In general, there are many ways to define the loss function and each structured prediction problem may call for its own loss function. For example, in hierarchical classification, one can use the tree loss, which is defined as the height of the first common ancestor of the true and predicted labels in the hierarchy [17]. In natural language processing, the loss may be defined in terms of the underlying grammar. The above observation also holds for clustering. Consider the example in Fig. 1, in which we show the cluster assignments for a subset of patterns that all belong to the same leaf node. Here, a blue dot indicates a correct cluster assignment while a red cross indicates an incorrect assignment. As shown in Fig. 1(a), ideally all these patterns should be clustered to the same leaf. Despite the fact that CLUHSIC is designed for clustering problems with structured outputs, it does not explicitly consider the use of a loss function. Indeed, as will be shown later in this paper, CLUHSIC essentially uses the often inappropriate zero-one loss function. Fig. 1(b) shows the clustering result of CLUHSIC. As can be seen, the misclassified patterns are often assigned to faraway clusters. In this paper, we extend CLUHSIC so that loss functions can be explicitly considered. Fig. 1(c) shows the clustering result of the proposed method CLUHSICAL, which will be described in more detail in later sections. As can be seen, although the obtained clustering is still imperfect as compared to the ground truth, the patterns are only misassigned to nearby categories. The rest of this paper is organized as follows. Section II first gives a brief review on the CLUHSIC algorithm. Section III then proposes a clustering algorithm for structured outputs based on kernel ridge regression. Motivated by [11] and [15], we perform discriminative clustering by kernel ridge regression for structured data, whose optimization problem is simpler than that of the structural SVM [17]. In particular, we show that CLUHSIC can be considered as a special case with the use of a particular joint kernel and the zero-one loss. Clustering experiments on a number of datasets with structured

outputs are presented in Section IV, and the last section gives some concluding remarks. In the sequel, the transpose of vector/matrix is denoted by the superscript , the identity matrix by I, and the vector of all ones by 1. Moreover, tr(A) denotes the trace of a matrix A, A p = supx=0 Ax p /x p is the matrix p-norm of A, and ◦ the elementwise matrix multiplication. II. CLUHSIC A LGORITHM CLUHSIC [2] is a clustering algorithm based on the dependence maximization view. Given a sample S = {(x1 , y1 ), . . . , (xm , ym )}, the linear dependence between xi ’s and yi ’s can be easily estimated by simple statistics such as linear correlation. However, nonlinear dependencies are more difficult to measure. A recently proposed dependence (or, more precisely, independence) measure is the HSIC [7]. Specifically, let φ and λ be the feature maps on the input and output, respectively. Denote the corresponding reproducing kernel Hilbert space (RKHS) by F and G, and the corresponding kernels by k and l. as The cross-covariance operator Cx y : F → G is defined [18] Cx y = Exy (φ(x) − E[φ(x)]) ⊗ (λ(y) − E[λ(y)]) where ⊗ is the tensor product. HSIC is then defined as the square of the Hilbert-Schmidt norm · H S of Cx y HSIC(F , G) = Cx y 2H S

= Exx yy [k(x, x )l(y, y )] + Exx [k(x, x )]Eyy [l(y, y )] − 2Exy [Ex k(x, x )][Ey l(y, y )].

Given the sample S, it can be shown that an empirical estimate of HSIC is (m − 1)−2 tr(HKHL) (1) where K = [k(xi , x j )], L = [l(yi , y j )] are kernel matrices defined on {x1 , . . . , xm } and {y1 , . . . , ym }, respectively, and H = I − m1 11 is the so-called centering matrix. In the sequel, we assume that K is centered, and so (1) reduces to (m −1)−2 tr(KL). Recent studies [7], [19] show that HSIC has several advantages over other independence measures. First, its empirical estimate in (1) is easy to compute. Moreover, it guarantees good uniform convergence and has very little bias even in high dimensions. To use HSIC in clustering, one first defines a kernel matrix A ∈ Rc×c on a set of c clusters. This is used to model the prior structural relationships and correlation among clusters. In general, kernel entries for clusters that are structurally close to each other will be assigned high values. For example, if the user specifies that the resultant clusters should have a chain structure as in Fig. 2(a), one can use the output kernel

1566

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

(a)

(b)

kernel ridge regression (KRR) [22], which is also known as the least squares SVM [23]. KRR uses the square loss instead of the hinge loss, and leads to a simpler optimization problem than the SVM. This is then extended to the unsupervised learning setting in Section III-B.

(c)

Fig. 2. Some common structures among clusters. (a) Chain. (b) Ring. (c) Tree.

2 1 0 0 matrix 10 21 12 01 . Here, the leftmost cluster corresponds to the 0012

first row/column of the kernel matrix, the second-left cluster corresponds to the second row/column of the kernel matrix, and so on. Similarly, for more complicated structures such 1 (c)], as ring and tree 2 1[Fig. 2(b) and the corresponding kernel 01 2100 matrices are 10 21 12 01 and 10 20 02 01 respectively. 1012

0012

Let be a partition matrix such that its i th row specifies the assignment of the i th pattern to one of the c clusters, i.e., i j ∈ {0, 1} and 1 = 1. The kernel matrix L defined on the yi ’s can then be written as L = A . CLUHSIC aims at finding the cluster assignment such that the resultant L is maximally dependent on the kernel matrix K defined on the xi ’s according to the HSIC. Mathematically, this is formulated as the following optimization problem: max s.t.

tr(KA ) 1 = 1, i j ∈ {0, 1}.

(2)

In [2], this integer programming problem is solved by greedy local search. Specifically, the partition matrix is updated row by row. For the i th row, one assigns the corresponding i th pattern to the cluster with the largest objective value. This process is re-iterated until the assignment matrix does not change. More sophisticated approaches, such as spectral methods [20] and nonnegative matrix factorization [21], are also discussed in [19]. As discussed in [2], CLUHSIC is very flexible as one can use different kernels of K and A in (2). For example, on setting A = I, one recovers standard (kernel) k-means, which implicitly assumes that there is no relationship between the clusters. Alternatively, on setting ⎤⎞ ⎛⎡ wi , wi , . . . , wk ⎦⎠ A = diag ⎝⎣ i∈·1 i∈·2 i∈·k one recovers the weighted k-means where the i th pattern is associated with a weight wi representing its importance. In general, CLUHSIC can be considered as a family of clustering algorithms that unify the traditional geometric, spectral, and statistical dependence views of clustering. Interested readers are referred to [2] for further details. III. C LUSTERING FOR S TRUCTURED O UTPUTS In Section III-A, we first consider the easier setting of supervised learning for structured outputs. Instead of using the SVM as in [16], [17], we consider a related method called 1 Note that for the tree structure in Fig. 2(c), only the four leaves are the observed clusters. The internal nodes and root node are latent and so not involved in defining the output kernel matrix.

A. Kernel Ridge Regression for Structured Outputs For learning with structured outputs (such as sequences, trees, or graphs), it is often more convenient to use a joint feature representation ϕ that is defined on both the input X and output Y [17]. This allows the many-sided dependencies between X and Y to be captured. As in other kernel methods, this joint feature map is related to a joint kernel k as k((x, y), (x , y )) = ϕ(x, y), ϕ(x , y ). Given a set of training patterns {(xi , yi )}m i=1 , KRR then obtains the classification function w, ϕ(x, y) by solving the following optimization problem: C 2 1 w2 + ξiy min w,ξiy 2 m i,y =yi

ξiy , ∀i, ∀y = yi s.t. w, δϕ i (y) = 1 − √ (yi , y)

(3)

where ξiy ’s are slack variables for the errors, δϕ i (y) ≡ ϕ(xi , yi )−ϕ(xi , y) is the feature map for structured prediction, (yi , y) is a loss function penalizing the difference between yi and y, and C is a user-defined regularization parameter. Its main difference with the SVMs primal is that we now have equality constraints instead of inequality constraints. Moreover, note that the slack variables in (3) are scaled with the inverse loss, and this is often called slack re-scaling. Another approach, called margin re-scaling [16], scales the margin by the loss. In this paper, we will focus on slack rescaling, and extension to margin re-scaling is straightforward. Besides, unlike the SVM-struct [17] that uses only one slack variable for each training instance, here we use one slack variable for each (i, y) pair. The following proposition shows that this leads to an efficient optimization problem. Proposition 1: The dual of (3) can be written as 1 (4) max α 1 − α Qα α 2 where α = [αiy ] is the vector of Lagrange multipliers, Q = [Q iy, j y¯ ] is a positive definite matrix with entries δiy j y¯ m . √ 2C (yi , y) (y j , y¯ ) The proof is in Appendix I. Later on, it will be more convenient to write Q as J + S, where Q iy, j y¯ = δϕ i (y), δϕ j (¯y) +

J = [δϕ i (y), δϕ j (¯y)]

(5)

and S is a diagonal matrix with the (i y)th entry (∀i, y = yi ) Siy =

1 m . 2C (yi , y)

(6)

Since the objective function in (4) is a quadratic function with variable α, it is an unconstrained quadratic programming (QP) problem. Its optimum α can be easily obtained as α ∗ = (J + S)−1 1. Plugging this back in (4), the corresponding optimal

ZHONG et al.: INCORPORATING THE LOSS FUNCTION INTO DISCRIMINATIVE CLUSTERING OF STRUCTURED OUTPUTS

primal/dual objective value (which are equal as convex QPs have zero duality gap) is then equal to 1 1 (J + S)−1 1. (7) 2 In general, there can be different ways of defining the joint kernel and the corresponding joint feature map ϕ. In this paper, we focus on the construction via the tensor product ϕ(x, y) = φ(x) ⊗ λ(y)

(8)

where φ and λ are feature maps defined on X and Y, respectively. This feature map has been popularly used in classification with taxonomies [17]. Moreover, it can be shown that its joint kernel ϕ(x, y), ϕ(¯x, y¯ ) is simply a product of the corresponding input kernel κ(x, x¯ ) = φ(x), φ(¯x ) and output kernel (y, y¯ ) = λ(y), λ(¯y) [17]. B. CLUHSICAL Algorithm While the training outputs yi ’s are known in classification, they are missing in clustering. Hence, as in maximum margin clustering [13], we want to find {y1 , . . . , ym } such that the objective in (7) is minimized min 1 (J + S)−1 1.

(9)

y1 ,...,ym

However, because of the rather complicated definition of J + S, a direct minimization of 1 (J + S)−1 1 is difficult. With the use of the tensor-product feature map in (8), the following proposition shows that this optimization problem can be simplified by using first-order approximation (the proof is in Appendix II). Proposition 2: Suppose that the tensor-product feature map in (8) is used, and the corresponding joint kernel k(·, ·) is scaled by a factor γ > 0 to γ k(·, ·), where γ <

1 S−1

(10)

p J p

w.r.t. some matrix p-norm · p . Then, as a first-order approximation, (9) can be formulated as max s.t

˜ ˜ 1 ) − C Ctr(K A m i· is as defined in (12)

(11)

where K = [κ(xi , x j )] ∈ Rm×m is the kernel matrix defined on the input A = [ (yi , y j )] ∈ R|Y |×|Y | is the kernel matrix is a matrix such defined on the output C˜ = 4C 2 γ /m 2 , and that its i th row (corresponding to pattern i ) is defined as i· = − (yi , y1 ), · · · − (yi , y −1 ), (yi , y), −(yi , y +1 ), · · · − (yi , y|Y | )

y =yi

(12)

where Y = {y1 , . . . , y|Y | } and output yi of the i th pattern is equal to some y ∈ Y. Problem (11) involves an additional scaling parameter γ ˜ and this may appear cumbersome (hidden in the parameter C), in practical applications. Interestingly, it will be shown in the

1567

following that γ can be removed from the optimization with a In CLUHSIC proper normalization of the partition matrix . [2], each column of is often normalized to have unit 2 -norm [i.e., · j 2 = 1] before aligning K with A. This ensures that the solution is not biased toward clusters having more samples. A similar normalization can also be in (11). However, here, it is more performed on the with the 1 convenient to normalize each column of norm, because 1 in (11) is then a constant and can be dropped, along with the associated C˜ parameter (and thus also by . ¯ The the γ parameter). Denote the 1 -normalized optimization problem (11) then simplifies to ¯ ¯ ). max tr(KA ¯

(13)

This is of the same form as CLUHSIC, except that the (12). In the loss function is now explicitly considered in sequel, this will be called “CLUstering using HSIC And Loss” (CLUHSICAL). C. Relationship with CLUHSIC Unlike the CLUHSIC algorithm, CLUHSICAL allows the explicit specification of a loss function. In particular, when the zero-one loss is used (i.e., (yi , y j ) = 1 for yi = y j ), the following corollary shows that the unnormalized version of CLUHSIC and CLUHSICAL in (11) will yield the same clustering solution. The proof is in Appendix III. Corollary 1: With the zero-one loss, (11) without normalization reduces to CLUHSIC. While CLUHSIC and CLUHSICAL are equivalent under the zero-one loss before normalization (which is also verified experimentally), they can have different results after normalization. Recall that the main difference between CLUHSIC and CLUHSICAL lies in the definition of the partition matrix. In the following, we illustrate that the partition matrix in (12) is more advantageous than the one in CLUHSIC by studying the simplest clustering algorithm in the CLUHSIC family, namely (kernel) k-means clustering. Experimental results in Section IV-C also show that CLUHSICAL often performs better than CLUHSIC even when both are trained with the zero-one loss. Consider the objective |Y | m

|ic |sgn(ic )φ(xi ) − µc 2F

(14)

c=1 i=1

where is an unnormalized partition matrix and F is the feature space induced by the kernel k. Let m c be the number of patterns belonging to the cth cluster and assume that all clusters are of the same size, i.e., m c = m/|Y|. The following lemmas show that the optimization problems of the 1 -normalized CLUHSIC and CLUHSICAL are of the form in (14) (proofs are in Appendix III). Lemma 1: The optimization problem of the 1 -normalized CLUHSIC with A = (m/|Y|)I is of the form (14) with 1 φ(xi ) ∈ c, (15) ic = /c 0 φ(xi ) ∈

1568

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

and the corresponding centers can be obtained as 1 φ(xi ), c = 1, . . . , |Y|. µc = mc

(16)

a

Level 1

g Level 1

c

b

Level 2

Level 2

φ(xi )∈c

Lemma 2: The optimization problem of the 1 -normalized CLUHSICAL with A = ((2|Y| − 1)m/|Y|)I is of the form (14) with |Y| − 1 φ(xi ) ∈ c, (17) ic = /c −1 φ(xi ) ∈

Level 3

Level 3

(a)

(b)

Level 1

and the corresponding centers can be obtained as µc =

m i=1

ic φ(xi ), j =1 | j c |

m

c = 1, . . . , |Y|.

(18)

Note that the As in the two lemmas differ only by the constant 2(|Y| − 1), which is immaterial. Hence, we can examine the difference between the 1 -normalized versions of CLUHSIC and CLUHSICAL by considering the two different settings of ic ’s in (14). Note that, for CLUHSIC, ic is nonzero only when φ(xi ) is in cluster c. On the other hand, for CLUHSICAL, ic in (17) is positive when φ(xi ) belongs to cluster c, and negative otherwise. Hence, from (14), µc will be drawn close to φ(xi ) if φ(xi ) belongs to cluster c, otherwise, µc will be pushed away from φ(xi ). This thus resembles the effect of a margin and is also similar to the rival penalized competitive learning algorithm [24], which has better performance than standard k-means clustering. IV. E XPERIMENTS In this section, we perform experiments on a number of datasets, whose outputs have the structure of a hierarchy (Section IV-A) or a ring (Section IV-B). All these patterns have been manually grouped into different categories, and these will be used as the ground truth cluster labels in the performance evaluation. A. Hierarchy-Structured Datasets 1) Data Sets and Input Kernels: Structural classification of proteins (SCOP) data: The SCOP database [25] provides a detailed description of the structural relationships of all the known proteins. Here, we use the subset extracted in [26] with 15 superfamilies. Ten proteins are selected for each superfamily, leading to a total of 150 patterns. The classification of the proteins is based on a hierarchical structure. The lowest level consists of 15 superfamilies, the next higher level is the fold, then the class, and finally the root [Fig. 3(a)]. We use the three alignment kernels (K1Al , K2Al , and K3Al ) proposed in [26] as input kernels. These alignment kernels are inspired from the convolution kernel in [27], and are constructed based on kernels defined on the protein’s constituent substructures. World Intellectual Property Organization (WIPO) data: This is a set of patent documents from the WIPO and has been popularly used [17], [28], [29]. We use the six largest categories in Section D (each category contains at least 50 documents), with a total of 354 documents. The corresponding hierarchical structure is shown in Fig. 3(b). The linear kernel,

Level 2

(c) Fig. 3. Hierarchically structured datasets. (a) SCOP. (b) WIPO. (c) Facial expression.

which is often appropriate for text classification, is used on the input. Facial expression data: This data set2 has been used in the CLUHSIC paper [2]. It consists of 185 images (each of size 217 × 308) of three types of facial expressions from three subjects. The facial expressions of the same person are first grouped together in the hierarchy [Fig. 3(c)]. As in [2], each pixel of the face image is normalized in each dimension and the Gaussian kernel is used as the input kernel. 2) Experimental Setup: We follow [17] to construct the output feature map for these hierarchically structured data sets. Let Z be the set of nodes in the hierarchy, and let the hierarchy structure be represented by the partial order ≺, where z ≺ y means that node z is an ancestor of node y. A feature λz is then 1 z ≺ y or z = y defined with every node z, as λz (y) = . 0 otherwise By excluding the feature associated with the root node (which is shared by all the classes), it can be easily seen that this feature map is also the same as the one used in [2]. We will compare CLUHSICAL with two existing clustering algorithms for structured data. 1) CLUHSIC in [2]. 2) Discriminative unsupervised training algorithm (labeled “XWSS” in the sequel) in [11]. Since XWSS is based on SDP and so can only be used on very small datasets, we will adopt the heuristic iterative procedure proposed in [11]. The optimization of both CLUHSIC and CLUHSICAL involve integer programming. Since the focus of this paper is not on how to solve this integer program, we will simply follow CLUHSIC in [2] and obtain an approximate solution by greedy local optimization. Its time complexity per iteration is O(mk), and the procedure often converges in fewer than 20 iterations. More sophisticated approaches, such as spectral methods [20] and nonnegative matrix factorization [21] are also discussed in [19]. For further benchmarking, we also report results of kernelized versions (using the same kernel as CLUHSICAL) of several agglomerative hierarchical clustering methods [30], including the following. 2 http : //www.it.usyd.edu.au/ ∼/lesong/cluhsic_datasets.html

ZHONG et al.: INCORPORATING THE LOSS FUNCTION INTO DISCRIMINATIVE CLUSTERING OF STRUCTURED OUTPUTS

1569

6

6

7

7

8

8

9

9 10

10 (a)

(b)

(88.7%, 43.3%, 43.3%; 0.68, 1.25)

(62.0%, 42.0%, 42.0%; 0.960, 1.540)

Fig. 4. Example of the dendrogram and clustering result obtained on the facial expression data. Each color corresponds to a level-1 cluster, and each rectangle is a level-2 cluster. Since the dataset has 185 samples, the whole dendrogram has 185 leaves. To reduce clutter, we only show the top 30 clusters.

6

6

7

7

8

8

9

9

(c)

(d)

(62.0%, 42.0%, 42.0%; 0.960, 1.540)

1) Average linkage, which defines the dissimilarity between two clusters A and B as D(A, B) = (1/n A n B ) i∈ A, j ∈B d(ai , b j ). Here, n A and n B are the numbers of instances in clusters A and B, respectively, and d(ai , b j ) is the distance (induced by the same kernel used in CLUHSICAL) between instances ai and b j . 2) Complete linkage, with D(A, B) = maxi∈A, j ∈B d(ai , b j ). 3) Ward’s linkage [31], with D(A, B) = n AnB 2 n A +n B d(A m , Bm ) , where A m , B M are the cluster means of A and B, respectively, and hierarchical k-means algorithm. Since standard agglomerative hierarchical clustering and k-means do not utilize structure information, we implement a structure-sensitive variant as follows. Take as an example the facial expression data [Fig. 3(c)], which has three level-1 clusters and three level2 sub-clusters under each cluster. For hierarchical clustering, we first run the standard algorithm and obtain the complete dendrogram. By going down the root using the dendrogram, we can obtain the three level-1 clusters. For each of these, we further trace down the corresponding branch of the dendrogram and obtain the three sub-clusters (Fig. 4). For the more complex hierarchies in Fig. 3(a) and (b), we align the obtained clusters so that the one with the largest variance corresponds to the tree node having the largest subtree. The hierarchical kmeans procedure is similar, except that instead of going down the dendrogram, we perform kernel k-means to divide a node into several sub-clusters. In general, since hierarchical clustering does not utilize structure information in constructing the dendrogram, there is no guarantee that the variants used in our experiments (average linkage, complete linkage, and Ward’s linkage) can always produce clustering results that conform to the given structure. Nevertheless, empirically, they are able to do so in our experiments. For performance evaluation, we will use the accuracy and tree loss, which is defined as the height of the first common ancestor of the true and predicted labels in the hierarchy [17]. Note that the tree loss is the same as the zero-one loss when measured at the first level of the hierarchy, and so will not be reported in the sequel. Moreover, since the positions of some of the clusters can be permuted [e.g., the leaves under each level-1 cluster in Fig. 3, and the three level-1 clusters in

10

10

(70.0%, 47.3%, 47.3%; 0.827, 1.353)

6

6

7

7

8

8

9

9 10

10 (e)

(f)

(82.0%, 48.7%, 48.7%; 0.693, 1.207)

(94.7%, 58.7%, 58.7%; 0.467, 0.880)

Fig. 5. Confusion matrices on the SCOP data for a typical run. Shown inside the brackets are the accuracies at the first, second, and third levels, and the tree losses at the second and third levels. (a) Complete-linkage. (b) Kernel hierarchical k-means. (c) XWSS. (d) CLUHSIC. (e) CLUHSICAL (0–1 loss). (f) CLUHSICAL (tree loss).

Fig. 3(c)], we will report the performance based on the best permutation of the clustering result. The kernel hierarchical k-means clustering result is used to initialize CLUHSICAL, CLUHSIC, and XWSS (except for the teapot data where kernel hierarchical k-means produces a very poor initialization and so random initialization is used instead). To reduce statistical variability, results for all the methods (except for agglomerative hierarchical clustering, which is deterministic) are averaged over 10 repetitions with 10 random restarts in each repetition. In the 2 -normalized version of CLUHSICAL, we fix C = 0.1m and γ = 1. 3) Importance of the Loss Function: We first illustrate the importance of the loss function. Fig. 5 shows the confusion matrices obtained on the SCOP data for a typical run. The four squares lying along the diagonal correspond to the four subtrees rooted at nodes a, b, c, and g in Fig. 3(a). The small square inside the largest square corresponds to the leftmost subtree under node c. The gray level represents the number of patterns in the cluster (darker color indicates more patterns). In the ideal case, the confusion matrix is diagonal and all the patterns are correctly clustered, resulting in zero error and zero loss. As can be seen from Fig. 5, while all the methods are not perfect, CLUHSICAL, when trained with the tree loss, assigns the mis-clustered patterns to clusters that are still within the target subtree. On the other hand, the other methods may assign the mis-clustered patterns to distant unrelated categories. Fig. 6 illustrates this effect in more detail. For clarity, we only show the assignments for patterns belonging to a particular leaf node. Here, a (blue) dot indicates a correct

1570

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

TABLE I C LUSTERING P ERFORMANCE ON THE SCOP D ATA

kernel matrix

method average-linkage hierarchical complete-linkage clustering Ward’s linkage kernel hierarchical k-means XWSS CLUHSIC CLUHSICAL 0-1 loss ( 1 -normalized) tree loss CLUHSICAL 0-1 loss ( 2 -normalized) tree loss average-linkage hierarchical complete-linkage clustering Ward’s linkage kernel hierarchical k-means XWSS CLUHSIC CLUHSICAL 0-1 loss ( 1 -normalized) tree loss CLUHSICAL 0-1 loss ( 2 -normalized) tree loss average-linkage hierarchical complete-linkage clustering Ward’s linkage kernel hierarchical k-means XWSS CLUHSIC CLUHSICAL 0-1 loss ( 1 -normalized) tree loss CLUHSICAL 0-1 loss ( 2 -normalized) tree loss

K1Al

K2Al

K3Al

a

c

b

g

Level 1

a

c

b

g

Level 2

Level 2

Level 3

Level 3

(a) a

(b) c

b

g

Level 1

a

c

b

g

Level 2

Level 3

(c)

(d) c

b

g

Level 1

a

c

b

Level 2

g

Level 1 Level 2

Level 3

(e)

Level 1 Level 2

Level 3

a

Level 1

Level 3

(f)

Fig. 6. Cluster assignments for patterns coming from a specific cluster. (a)–(f) Results obtained by agglomerative hierarchical clustering with completelinkage, kernel hierarchical k-means, XWSS, CLUHSIC, CLUHSICAL (0–1 loss) and CLUHSICAL (tree loss), respectively.

cluster assignment while a (red) cross indicates an incorrect assignment. As can be seen, CLUHSICAL, when used with the tree loss, tries to avoid assigning mis-clustered patterns to faraway clusters. Hence, incorporation of the loss function is beneficial in clustering structured data. 4) Clustering Performance: Detailed clustering performance results on the various datasets are shown in Tables I–III. The best results and those that are not significantly worse (according to the t-test with a p-value less than 0.05) are printed in bold. As can be seen, CLUHSICAL trained with the structured loss performs well on the SCOP and WIPO datasets,

level 1 75.3 88.7 55.3 55.2 55.2 40.9 71.0 84.9 65.9 86.6 54.7 54.0 72.0 51.9 51.8 50.1 69.1 80.0 66.7 69.4 74.0 51.3 64.7 53.2 53.2 36.6 58.6 71.3 75.5 76.9

accuracy (%) level 2 level 3 34.7 32.7 43.3 43.3 32.7 30.7 35.2 35.2 35.2 35.2 31.9 31.9 47.7 47.6 54.1 53.9 44.4 44.1 53.7 53.7 14.0 12.0 12.0 10.0 41.3 38.0 32.1 32.1 32.1 32.1 29.6 29.6 39.0 39.0 45.5 45.3 39.3 38.5 40.1 39.6 29.3 25.3 28.0 28.0 34.7 32.7 35.7 35.7 35.7 35.7 25.5 25.5 38.0 38.0 43.3 43.2 51.3 50.9 47.5 47.1

tree loss level 2 level 3 0.90 1.57 0.68 1.25 1.12 1.81 1.10 1.74 1.10 1.74 1.27 1.95 0.81 1.34 0.61 1.07 0.90 1.46 0.60 1.06 1.31 2.19 1.34 2.24 0.87 1.49 1.17 1.84 1.17 1.84 1.20 1.91 0.92 1.53 0.75 1.29 0.94 1.56 0.91 1.51 0.97 1.71 1.21 1.93 1.01 1.68 1.12 1.76 1.12 1.76 1.38 2.12 1.03 1.65 0.86 1.42 0.73 1.22 0.76 1.29

and is comparable to the best result (attained by CLUHSICAL trained with the 0-1 loss) on the facial expression data. In particular, CLUHSICAL performs much better than kernel hierarchical k-means. This is because our kernel hierarchical k-means variant can only consider the sub-structure information separately, but not as a whole, whereas CLUHSICAL considers both the output structure and loss function in a more principled way. Results of the various hierarchical clustering variants are also inferior because the structure information cannot be utilized while constructing the dendrogram. Besides, both versions of CLUHSICAL are often better than CLUHSIC. CLUHSICAL trained with the structured loss is often better than that trained with the zero-one loss. Moreover, as discussed in Section III-C, CLUHSICAL often performs better than CLUHSIC even when both are trained with the zero-one loss. The 2 -normalized version also outperforms CLUHSIC, but is sometimes slightly inferior to its 1 -normalized version. B. Ring-Structured Dataset The datasets used in the previous section are all hierarchical in structure. In this section, we will experiment with the teapot data set3 [2], [32] which has a ring structure. It contains 400 teapot images (each of size 76 × 101) rotated from 1°–360°. The Gaussian kernel is used on the input. As for the output, we follow [2] and set A(yi , y j ) to 2 if i = j , 1 if i and j are neighbors, and 0 otherwise. In the first experiment, we use (standard) kernel k-means, CLUHSIC, and CLUHSICAL to cluster the full dataset into 10 3 http : //www.it.usyd.edu.au/ ∼/lesong/cluhsic_datasets.html

ZHONG et al.: INCORPORATING THE LOSS FUNCTION INTO DISCRIMINATIVE CLUSTERING OF STRUCTURED OUTPUTS

TABLE II C LUSTERING P ERFORMANCE ON THE WIPO D ATA

method average complete Ward’s kernel hierarchical k-means XWSS CLUHSIC CLUHSICAL 0-1 loss ( 1 -normalized) tree loss CLUHSICAL 0-1 loss ( 2 -normalized) tree loss hierarchical clustering

accuracy (%) lvl 1 lvl 2 lvl 3 32.5 18.4 18.4 24.9 24.9 20.3 27.1 27.1 22.9 26.0 23.5 20.9 26.0 23.5 20.9 26.4 24.7 21.8 30.6 27.0 25.7 33.0 29.6 27.3 31.4 27.7 26.1 35.1 32.1 29.3

tree loss lvl 2 lvl 3 1.49 2.31 1.50 2.30 1.46 2.23 1.51 2.30 1.51 2.30 1.49 2.27 1.42 2.17 1.37 2.10 1.41 2.15 1.33 2.04

1571

TABLE IV C LUSTERING P ERFORMANCE ON THE T EAPOT S UBSET method kernel k-means XWSS [11] CLUHSIC [2] CLUHSICAL 0-1 loss ( 1 -normalized) ring loss CLUHSICAL 0-1 loss ( 2 -normalized) ring loss

accuracy (%) 32.32 13.06 65.51 93.83 98.20 55.29 54.71

ring loss 1.16 1.54 0.39 0.06 0.02 0.85 0.85

TABLE V C LUSTERING P ERFORMANCE IN THE S PECIAL C ASE OF A = I. N UMBER IN

B RACKETS IS THE S TANDARD D EVIATION

TABLE III C LUSTERING P ERFORMANCE ON THE FACIAL E XPRESSION D ATA accuracy (%) level 1 level 2 78.4 42.7 78.4 42.7 88.7 71.9 70.1 54.6 70.1 55.0 89.0 80.8 96.5 94.8 92.3 90.2 94.3 90.8 93.4 91.8

method average hierarchical complete clustering Ward’s kernel hierarchical k-means XWSS CLUHSIC CLUHSICAL 0-1 loss ( 1 -normalized) tree loss CLUHSICAL 0-1 loss ( 2 -normalized) tree loss

(a)

(c)

tree loss level 2 0.79 0.79 0.39 0.75 0.75 0.30 0.09 0.18 0.15 0.15

(b)

(d)

(e)

Fig. 7. Clusters (separated by strokes) obtained on the full teapot dataset in a typical run. Note that the ordering of the grayscales in the k-means solution does not align with those of the ground truth. (a) Ring structure. (b) Kernel k-means. (c) CLUHSIC. (d) CLUHSICAL (0–1 loss). (e) CLUHSICAL (tree loss).

clusters. Fig. 7 shows the clusters obtained for a typical run. Here, the positions of the clusters on the chain are color-coded by the grayscale [Fig. 7(a)]. As can be seen, all the methods can group the patterns into reasonable clusters. However, the ordering of the clusters in the kernel k-means solution does not align with the given ring structure. This is not surprising as structure information is not used in kernel k-means. As for the solutions produced by CLUHSIC and CLUHSICAL, they are very similar and differ only in the slight variations for the sizes of the clusters obtained. However, a quantitative comparison is difficult. Recall that the images are obtained by continuously rotating the teapot from 1°–360°. Hence, there is no natural ground truth on the number of clusters and the size of each cluster. Even if these

data set SCOP (K1Al ) SCOP (K2Al ) SCOP (K3Al ) WIPO face teapot

method CLUHSIC CLUHSICAL CLUHSIC CLUHSICAL CLUHSIC CLUHSICAL CLUHSIC CLUHSICAL CLUHSIC CLUHSICAL CLUHSIC CLUHSICAL

accuracy (%) 66.4 (3.6) 72.8 (4.6) 64.3 (3.7) 69.4 (4.1) 66.3 (4.3) 71.7 (4.1) 43.0 (4.4) 51.5 (4.3) 77.0 (7.6) 90.9 (5.9) 32.7 (5.3) 35.1 (9.4)

were known, it would still be difficult to have a fair performance comparison because of the continuous ring structure. To avoid this problem, we perform a second experiment in which 5 images are removed from every 40 images. Thus, in this ground truth solution, we have a total of 10 well-separated clusters each with 35 images. Besides reporting the zero-one loss, we will also show the ring loss, which is defined as zero if the predicted position is correct, 1 if the predicted position and ground truth position are neighbors, and 2 otherwise. Moreover, as in Section IV-A, the positions of the clusters can be permuted4, and so the reported performance will be based on the best permutation. Results are shown in Table IV. Again, CLUHSICAL trained with the structured loss performs best on both metrics. C. Special Case of A = I and Zero-One Loss In this section, we consider the special case where there is no output structure (i.e., A = I). In this case, CLUHSIC is equivalent to standard kernel k-means. As can be seen from Table V, although CLUHSIC and CLUHSICAL are equivalent under the zero-one loss before normalization (which is also experimentally verified), CLUHSICAL outperforms CLUHSIC after normalization. D. Equation (20) as a Good Approximation of (γ J + S)−1 Finally, we provide empirical evidence that (20) is a good approximation of (γ J + S)−1 . We study the relative difference |t1 − t2 |/t1 in approximating the objective 4 For example, for a ring with c clusters, there are 2c possible permutations obtained by starting from each of the c clusters and then visiting them in either clockwise or anticlockwise manner.

1572

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

TABLE VI R ELATIVE A PPROXIMATION E RRORS U SING THE Z ERO -O NE L OSS AND S TRUCTURED L OSS γ

data set

0-1 loss 1

structured loss

0-1 loss 0.1

structured loss

0-1 loss 0.01

structured loss

face WIPO teapot SCOP (K1Al ) SCOP (K2Al ) SCOP (K3Al ) face WIPO teapot SCOP (K1Al ) SCOP (K2Al ) SCOP (K3Al ) face WIPO teapot SCOP (K1Al ) SCOP (K2Al ) SCOP (K3Al ) face WIPO teapot SCOP (K1Al ) SCOP (K2Al ) SCOP (K3Al ) face WIPO teapot SCOP (K1Al ) SCOP (K2Al ) SCOP (K3Al ) face WIPO teapot SCOP (K1Al ) SCOP (K2Al ) SCOP (K3Al )

C m 2.82 1.26e-02 4.36e-01 2.75e-01 5.27e-01 2.95e-01 4.26 1.48e-02 5.53e-01 3.47e-01 6.84e-01 3.74e-01 3.77e-02 1.29e-04 4.69e-03 3.04e-03 6.47e-03 3.32e-03 5.40e-02 1.52e-04 5.88e-03 3.92e-03 8.64e-03 4.32e-03 3.94e-04 1.29e-06 4.73e-05 3.09e-05 6.67e-05 3.38e-05 5.60e-04 1.52e-06 5.93e-05 3.99e-05 8.95e-05 4.41e-05

0.1m 3.77e-02 1.29e-04 4.69e-03 3.04e-03 6.47e-03 3.32e-03 5.40e-02 1.52e-04 5.88e-03 3.92e-03 8.64e-03 4.32e-03 3.94e-04 1.29e-06 4.73e-05 3.09e-05 6.67e-05 3.38e-05 5.58e-04 1.52e-06 5.92e-05 3.99e-05 8.95e-05 4.41e-05 3.95e-06 1.29e-08 4.74e-07 3.09e-07 6.70e-07 3.39e-07 5.62e-06 1.52e-08 5.93e-07 4.00e-07 8.99e-07 4.43e-07

t1 = 1 (γ J + S)−1 1 in (7) by its first-order approximation t2 = 1 S−1 1 − γ 1 S−1 JS−1 1 in (20). The labels yi ’s required in the definition of S are set to the ground truths. We vary γ from 1, 0.1 to 0.01, and C from 0.00001m to m (where m is the number of samples). As can be seen from Table VI, the approximation error, depending on C and γ , is usually very small. V. C ONCLUSION In this paper, we studied the problem of clustering structured outputs. We showed that the loss function, which is known to be very important in the supervised structured prediction problems, can also be incorporated into unsupervised learning. In particular, we showed that CLUHSIC is implicitly using the often inappropriate zero-one loss. We provided detailed derivations and proposed an extension called CLUHSICAL that explicitly considers both the output dependency and loss function. Experimental results on a number of datasets show that CLUHSICAL (with structured loss) often outperforms CLUHSIC, XWSS, and kernel hierarchical k-means, and thus confirm the importance of the loss function in clustering structured data.

0.01m 3.94e-04 1.29e-06 4.74e-05 3.09e-05 6.67e-05 3.38e-05 5.60e-04 1.52e-06 5.93e-05 3.99e-05 8.95e-05 4.41e-05 3.95e-06 1.29e-08 4.74e-07 3.09e-07 6.70e-07 3.39e-07 5.62e-06 1.52e-08 5.93e-07 4.00e-07 8.99e-07 4.43e-07 3.96e-08 1.29e-10 4.74e-09 3.09e-09 6.70e-09 3.39e-09 5.62e-08 1.52e-10 5.93e-09 4.00e-09 8.99e-09 4.43e-09

0.001m 3.95e-06 1.29e-08 4.74e-07 3.09e-07 6.70e-07 3.39e-07 5.62e-06 1.52e-08 5.93e-07 4.00e-07 8.99e-07 4.43e-07 3.96e-08 1.29e-10 4.74e-09 3.09e-09 6.70e-09 3.39e-09 5.62e-08 1.52e-10 5.93e-09 4.00e-09 8.99e-09 4.43e-09 3.96e-10 1.29e-12 4.74e-11 3.09e-11 6.70e-11 3.39e-11 5.62e-10 1.51e-12 5.93e-11 4.00e-11 8.99e-11 4.43e-11

0.0001m 3.96e-08 1.29e-10 4.74e-09 3.09e-09 6.70e-09 3.39e-09 5.62e-08 1.52e-10 5.93e-9 4.00e-09 8.99e-09 4.43e-09 3.96e-10 1.29e-12 4.74e-11 3.09e-11 6.70e-11 3.39e-11 5.62e-10 1.52e-12 5.94e-11 4.00e-11 8.99e-11 4.43e-11 3.96e-12 1.33e-14 4.72e-13 3.06e-13 6.69e-13 3.39e-13 5.62e-12 1.60e-14 5.88e-13 3.97e-13 8.97e-13 4.41e-13

0.00001m 3.96e-10 1.29e-12 4.74e-11 3.09e-11 6.70e-11 3.39e-11 5.62e-10 1.52e-12 5.93e-11 4.00e-11 8.99e-11 4.43e-11 3.96e-12 1.40e-14 4.72e-13 3.07e-13 6.69e-13 3.38e-13 5.62e-12 1.48e-14 7.49e-13 3.99e-13 8.97e-13 4.39e-13 3.76e-14 1.13e-16 4.20e-15 2.89e-15 6.51e-15 3.79e-15 5.53e-14 3.38e-16 5.42e-15 4.52e-15 9.40e-15 3.61e-15

A PPENDIX I P ROOF OF P ROPOSITION 1 Introducing Lagrange multipliers αiy ’s for the equality constraints in (3) and denoting δ i = δϕ i (y), δ¯ i = δϕ i (¯y), i, j = (yi , y j ), i. = (yi , y), and i¯. = (yi , y¯ ), 2 − then the Lagrangian is L = 1/2w2 + (C/m) i,y=yi ξiy √ i,y =yi αiy w, δ i − 1 + ξiy / i. . Setting its derivatives w.r.t. the primal variables to zero, and plugging these back into L, we obtain 2 C m αiy 2 1 αiy δ i + L= √ 2 m 2C i. i,y =yi i,y =yi ⎛ ⎞ m αiy ⎠ − αiy ⎝ α j y¯ δ¯ j , δ i − 1 + 2C i. i,y =yi j,¯y =y j 1 C m αiy 2 = αiy α j y¯ δ i , δ¯ j + √ 2 m 2C i. i,y =y i,y =y i

j,¯y =y j

i

ZHONG et al.: INCORPORATING THE LOSS FUNCTION INTO DISCRIMINATIVE CLUSTERING OF STRUCTURED OUTPUTS

−

i,y =yi j,¯y =y j

=

i,y =yi

=

i,y =yi

αiy α j y¯ δ i , δ¯ j +

αiy −

i,y =yi

2 m αiy 2C i.

i,y =yi

φ(x j ) ⊗ λ(y j ) − λ(¯y) i. j¯. i. λ(yi ) − λ(y) , = κ(xi , x j ) ij

2 1 m αiy αiy − αiy α j y¯ δ i , δ¯ j − 2 4C i. i,y =yi j,¯y =y j

i,y =yi

δ 1 m iy j y ¯ αiy − αiy α j y¯ δ i , δ¯ j + √ 2 2C i. j¯. i,y =y

Note that, when the kernel k is scaled to γ k as in Proposition 2, the J in (5) will be scaled accordingly as γ J. First, we introduce a few lemmas. Lemma 3: As a first-order approximation, minimizing 1 (γ J + S)−1 1 in (9) for the scaled kernel γ k is the same as maximizing ˜ J1 − 2C 1 1 (19) C1 m where = diag([i. ]i,y=yi ) and γ , C˜ are as defined in Proposition 2. (γ J + S)−1 = [S(γ S−1 J + I)]−1 = −1 that −1Proof: Note −1 S . On using (10), S−1 (γ J) p ≤ S (γ J) + I −1 γ S p J p < 1, this can then be expanded into an infinite series as [33] ∞ i −1 −1 S−1 . (γ J + S) = −S (γ J)

y =y

ij

i j¯. λ(y j ) − λ(¯y) .

i

A PPENDIX II P ROOF OF P ROPOSITION 2

y =yi ,¯y =y j

j¯. λ(y j ) − λ(¯y) = κ(xi , x j ) i. λ(yi ) − λ(y) ,

j,¯y =y j

where δiy j y¯ is 1 if i = j and y = y¯ , and 0 otherwise. This can then be rewritten as the form in (4).

1573

(21)

y¯ =y j

Using (12)

i. λ(yi ) − λ(y)

y =yi

⎡

=⎣

⎤ i. ⎦ λ(yi ) +

y =yi

i· λ) −i. λ(y) = (

y =yi

where λ = [λ(y1 ), . . . , λ(y|Y | )] . Therefore, (21) can be written as i· λ)( j ·λ) κ(xi , x j )( ij

=

i· λλ j · ) κ(xi , x j )(

ij

)1 = tr(KA ) = 1 (K ◦ A where ◦ denotes the Hadamard (or elementwise) product. Lemma 5: The 1 1 term in (19) can be simplified as is as defined in Proposition 2. 1 , where (1/2) Proof: 1 1 =

m i=1 y =yi

i=0

|Y |

1 1 |i j | = 1 2 2 m

i. =

i=1 j =1

Taking the first-order approximation (γ J + S)−1 (I − S−1 (γ J))S−1 = S−1 − γ S−1 JS−1 . (20) Minimizing −1 (γ J + S)−1 1 in (9) then becomes maximizing −1

γ1 S

−1

JS

−1

1−1S

˜ J1 − 2C 1 1 1 = C1 m

on using the definitions of J and S in (5) and (6). Lemma 4: Using the tensor-product feature map in (8), the ), where 1 J1 term in (19) can be simplified as tr(KA A are as defined in Proposition 2. K, , Proof: Again denote ϕii = ϕ(xi , yi ), ϕi. = ϕ(xi , y) and ϕi¯. = ϕ(xi , y¯ ). Note that δ i , δ¯ j i. j¯. 1 J1 = =

ϕii − ϕi. , ϕ j j − ϕ j¯. i. j¯.

i,y =yi j,¯y =y j

=

i,y =yi j,¯y =y j

A PPENDIX III P ROOF OF THE R ESULTS IN S ECTION III-C Proof: (of Corollary 1). When the zero-one loss is used, we have 1 1 =

m

φ(xi ) ⊗ (λ(yi ) − λ(y)) ,

i. =

i=1 y =yi

m (|Y| − 1) = m|Y| − m. i=1

Thus, the second term in (19) is a constant, and (11) reduces to max s.t

i,y =yi j,¯y =y j

Proof: (of Proposition 2). Result follows immediately by plugging Lemmas 4 and 5 into (19).

) tr(KA i· is as defined in (12).

(22)

This is the same as (2) of CLUHSIC, except that the definitions of the partition matrix are, apparently, different. However, i· in (12) reduces to note that with the zero-one loss, i· = [−1, . . . , −1, |Y| − 1, −1, . . . , −1]. ! ! "# $ "# $ (yi −1) times

(|Y |−yi ) times

(23)

1574

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

We can add 1 to each of its entries, and obtain ˆ i· = [0, . . . , 0, |Y|, 0, . . . , 0]. ˆ is essentially the same as the partition matrix Note that in CLUHSIC, except that the only nonzero entry is changed from 1 to the constant |Y|. Using the fact that K is always centered (i.e., K1 = 0) in CLUHSIC ˆ ˆ ) = tr(K( + 11 )A( + 11 ) ) tr(KA = tr(KA ) + tr(KA11 )

Substituting (18) back into (25), we have ⎞ ⎛ |Y | ⎝(|Y| − 1) κ(xi , xi ) + κ(xi , xi )⎠ c=1

− =

|Y |

c=1 m

ˆ ˆ ) ⇔ max tr(KA ). max tr(KA ˆ

and (19) leads to the same partitioning as CLUHSIC. Proof: (of Lemma 1). Setting as in (15), objective (14) reduces to |Y | φ(xi ) − µc 2F (24) c=1 xi ∈c

which is the standard k-means objective. The cluster center can be easily obtained as in (16). Moreover, note that (24) can be |Y | written as c=1 xi ∈c κ(xi , xi ) − 2φ(xi ), µc + µc , µc . Substituting in (16), we have ⎛ ⎞ |Y | |Y | 1 ⎝ κ(xi , xi ) − κ(xi , x j )⎠ m c c=1 xi ∈c c=1 xi ,x j ∈c ⎞ ⎛ | Y m | 1 ⎝m c κ(xi , xi ) − κ(xi , x j )⎠ . = m2 x ,x ∈c c i

j

Now the first term is constant, so this can be dropped from the optimization. When m c = m/|Y|, the objective reduces to the CLUHSIC objective with A = (m/|Y|)I its 1 -normalized and 1 φ(xi ) ∈ c, mc ¯ ic = partition matrix has elements / c. 0 φ(xi ) ∈ Proof: (of Lemma 2). With the setting of , (14) reduces to ⎛ ⎞ |Y | ⎝(|Y| − 1) µc − φ(xi )2F + µc + φ(xi )2F ⎠ c=1

xi ∈c

xi ∈c /

(25) and the cluster center can be obtained by (18) as (|Y| − 1) xi ∈c φ(xi ) − xi ∈c φ(x ) i / µc = m c (|Y| − 1) + m − m c m ic m φ(xi ) = j =1 | j c | i=1

where m j =1

| j c | = m c (|Y| − 1) + m − m c = m c (|Y| − 2) + m.

⎞

ic kc κ(xi , xk )⎠

i=1,k=1

2(|Y| − 1)κ(xi , xi )

c=1

c=1

⎝ m 1 j =1 | j c |

xi ∈c / m

⎞

⎛ |Y | m m ⎝ | j c | −

= tr(KA ).

i=1

xi ∈c

i=1

+ tr(K11 A ) + tr(K11A11 )

Hence

⎛

j =1

kc ic m κ(xi , xk )⎠ . | | jc j =1 j =1 | j c |

m

i=1,k=1

Since the first term is also constant and can be ignored, the objective then reduces to the CLUHSICAL objective with A = (2(|Y| − 1)m)/|Y| when m c = m/|Y|, and its 1 -normalized ¯ ic = (ic )/ m | j c |. partition matrix has elements j =1 R EFERENCES [1] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall, 1988. [2] L. Song, A. Smola, A. Gretton, and K. M. Borgwardt, “A dependence maximization view of clustering,” in Proc. 24th Int. Conf. Mach. Learn., Corvallis, OR, Jun. 2007, pp. 815–822. [3] J. MacQueen, “Some methods of classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math., Statist. Probability, 1967, pp. 281–297. [4] N. Slonim and Y. Weiss, “Maximum likelihood and the information bottleneck,” in Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer, Eds. Cambridge, MA: MIT Press, 2003, pp. 351–358. [5] G. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley, 2000. [6] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cambridge, MA: MIT Press, 2002, pp. 849–856. [7] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, “Measuring statistical dependence with Hilbert-Schmidt norms,” in Proc. Int. Conf. Algorithmic Learn. Theory, Singapore, Oct. 2005, pp. 63–77. [8] T. Gärtner, “A survey of kernels for structured data,” ACM SIGKDD Explorations Newslett., vol. 5, no. 1, pp. 49–58, Jul. 2003. [9] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Support vector clustering,” J. Mach. Learn. Res., vol. 2, pp. 125–137, Dec. 2001. [10] M. Girolami, “Mercer kernel-based clustering in feature space,” IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 780–784, May 2002. [11] L. Xu, D. Wilkinson, F. Southey, and D. Schuurmans, “Discriminative unsupervised learning of structured predictors,” in Proc. 23rd Int. Conf. Mach. Learn., Pittsburgh, PA, Jun. 2006, pp. 1057–1064. [12] H. Valizadegan and R. Jin, “Generalized maximum margin clustering and unsupervised kernel learning,” in Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press, 2007, pp. 1417– 1424. [13] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum margin clustering,” in Advances in Neural Information Processing Systems 17. Cambridge, MA: MIT Press, 2005, pp. 1537–1544. [14] L. Xu and D. Schuurmans, “Unsupervised and semi-supervised multiclass support vector machines,” in Proc. 20th Nat. Conf. Artificial Intell., Pittsburgh, PA, 2005, pp. 904–910. [15] F. Bach and Z. Harchaoui, “DIFFRAC: A discriminative and flexible framework for clustering,” in Advances in Neural Information Processing Systems 20. Cambridge, MA: MIT Press, 2008, pp. 49–56. [16] B. Taskar, “Learning structured prediction models: A large margin approach,” Ph.D. thesis, Dept. Comput. Sci., Stanford Univ., Palo Alto, CA, 2004. [17] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Large margin methods for structured and interdependent output variables,” J. Mach. Learn. Res., vol. 6, pp. 1453–1484, Dec. 2005.

ZHONG et al.: INCORPORATING THE LOSS FUNCTION INTO DISCRIMINATIVE CLUSTERING OF STRUCTURED OUTPUTS

[18] K. Fukumizu, F. R. Bach, and M. I. Jordan, “Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces,” J. Mach. Learn. Res., vol. 5, pp. 73–99, Dec. 2004. [19] L. Song, “Learning via Hilbert space embedding of distributions,” Ph.D. thesis, School of Information Technologies, Univ. Sydney, NSW, Australia, 2008. [20] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, “Spectral relaxation for K -means clustering,” in Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press, 2002. [21] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing Systems 13, T. Leen, T. Dietterich, and V. Tresp, Eds. Cambridge, MA: MIT Press, 2001, pp. 556–562. [22] C. Saunders, A. Gammerman, and V. Vovk, “Ridge regression learning algorithm in dual variables,” in Proc. 15th Int. Conf. Mach. Learn., San Francisco, CA, 1998, pp. 515–521. [23] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, Jun. 1999. [24] L. Xu, A. Krzyzak, and E. Oja, “Rival penalized competitive learning for clustering analysis, RBF net, and curve detection,” IEEE Trans. Neural Netw., vol. 4, no. 4, pp. 636–649, Jul. 1993. [25] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, “SCOP: A structural classification of proteins database for the investigation of sequences and structures,” J. Molecular Biol., vol. 247, no. 4, pp. 536– 540, Apr. 1995. [26] S. Bhattacharya, C. Bhattacharyya, and N. R. Chandra, “Structural alignment based kernels for protein structure classification,” in Proc. 24th Int. Conf. Mach. Learn., Corvallis, OR, Jun. 2007, pp. 73–80. [27] D. Haussler, “Convolution kernels on discrete structures,” Dept. Comput. Sci., Univ. California, Santa Cruz, CA, Tech. Rep. UCSC-CRL-99-10, 1999. [28] L. Cai and T. Hofmann, “Hierarchical document categorization with support vector machines,” in Proc. ACM 13th Conf. Inf. Knowl. Manage., Washington D.C., Nov. 2004, pp. 78–87. [29] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor, “Kernelbased learning of hierarchical multilabel classification models,” J. Mach. Learn. Res., vol. 7, pp. 1601–1626, Dec. 2006. [30] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley-Interscience, 2001. [31] J. H. Ward, Jr., “Hierarchical grouping to optimize an objective function,” J. Amer. Statist. Assoc., vol. 58, no. 301, pp. 236–244, Mar. 1963. [32] K. Q. Weinberger and L. K. Saul, “An introduction to nonlinear dimensionality reduction by maximum variance unfolding,” in Proc. 21st Nat. Conf. Artificial Intell., Boston, MA, Jul. 2006, pp. 1683–1686. [33] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Baltimore, MD: The Johns Hopkins Univ. Press, 1996.

Wenliang Zhong received the B.Sc. and M.Sc. degrees in computer science from the Sun Yat-sen University, Guangdong, China, in 2007 and 2009, respectively. Currently, he is pursuing the Ph.D. degree in computer science at the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China. His current research interests include kernel methods, machine learning, and computational intelligence.

1575

Weike Pan received the B.S. degree from the College of Computer Science, Zhejiang University, Hangzhou, China, in 2005. He is a Ph.D. candidate in the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China. His current research interests include transfer learning, recommender systems, and statistical machine learning. Mr. Pan is a Student Member of the Association for the Advancement of Artificial Intelligence.

James T. Kwok received the Ph.D. degree in computer science from the Hong Kong University of Science and Technology, Hong Kong, China, in 1996. He was with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, as an Assistant Professor. Since 2000, he has been an Associate Professor in the Department of Computer Science and Engineering, Hong Kong University of Science and Technology. His current research interests include kernel methods, machine learning, pattern recognition, and artificial neural networks. Dr. Kwok received the IEEE T RANSACTIONS ON N EURAL N ETWORKS Outstanding Paper Award for 2004 in 2006, and the Natural Science Award from the Ministry of Education, China, for 2008 in 2009. He is currently serving as an Associate Editor for the IEEE T RANSACTIONS ON N EURAL N ETWORKS and the Neurocomputing Journal.

Ivor W. Tsang received the Ph.D. degree in computer science from the Hong Kong University of Science and Technology, Hong Kong, China, in 2007. He is currently an Assistant Professor with the School of Computer Engineering, Nanyang Technological University (NTU), Singapore. He is also the Deputy Director of the Center for Computational Intelligence, NTU. His current research interests include machine learning, kernel methods, large-scale optimization and its applications to data mining, and pattern recognitions. Dr. Tsang was awarded the prestigious IEEE T RANSACTIONS ON N EURAL N ETWORKS Outstanding Paper Award for 2004 in 2006. He clinched the second-class prize of the National Natural Science Award from the Ministry of Education, China, for 2008 in 2009. His work on transfer learning for visual event recognition was awarded the Best Student Paper Prize at the 22nd IEEE Computer Society Conference on Computer Vision and Pattern Recognition in 2010. His work on speech adaptation earned him the Best Paper Award from the IEEE Hong Kong Chapter of Signal Processing Postgraduate Forum in 2006. He was also awarded the Microsoft Fellowship in 2005.