A Conscience On-line Learning Approach for Kernel ...

Viewer
Transcript

2010 IEEE International Conference on Data Mining

A Conscience On-line Learning Approach for Kernel-Based Clustering Chang-Dong Wang∗ , Jian-Huang Lai† , Jun-Yong Zhu‡ of Information Science and Technology, Sun Yat-sen University Guangzhou, P. R. China. Email:[email protected] † School of Information Science and Technology, Sun Yat-sen University Guangzhou, P. R. China. Email:[email protected]. Corresponding author ‡ School of Mathematics and Computational Science, Sun Yat-sen University Guangzhou, P. R. China. Email:[email protected] ∗ School

isolated points such that the prototype of this cluster is relatively far away from any points. As a result, in the later iterations, this prototype would never get a chance to be assigned with any point (Figure 1(b)). Several methods have been developed to overcome the ill-initialization problem. Bradley and Fayyad [5] proposed to compute a refined initialization from a given one by estimating the modes of a distribution based on the subsampling technique. Another approach for refining the initial cluster prototypes is based on the observations that some patterns are very similar to each other such that they have the same cluster membership irrespective to the choice of initial cluster prototypes [6]. Zhang et al. [7] computed a lower bound on the cost of the local optimum from the current prototype set and proposed a BoundIterate method. Evolutionary algorithms such as genetic algorithm have also been applied to 𝑘-means, aiming at avoiding the degenerate local minima [8], [9]. Very recently, Likas et al. proposed the global 𝑘-means that is deterministic and does not rely on any initial conditions [10], [11]. Although these methods have eliminated the sensitivity to the ill-initialization, however, most of them are computationally expensive. In this paper, we propose a novel approach termed conscience on-line learning (COLL) to solve the optimization problem associated with kernel-based clustering in the online learning framework. The proposed method starts with a random guess of the assignment the same as 𝑘-means. However, unlike 𝑘-means, in the procedure of each iteration, for each randomly taken data point, the method selects the winning prototype based on the conscience mechanism [12], and updates the winner by the on-line learning rule [13]. The procedure requires only one winning prototype to update slightly towards the new point rather than re-computing the mean of each cluster at each step, hence much faster convergence rate is achieved and other competitive mechanisms like conscience can be easily integrated. The advantage of the conscience mechanism is that, by reducing the winning rate of the frequent winners, all the prototypes are brought available into the solution quickly and the ill-initialized prototypes are biased so that each prototype can win the competition with almost the same probability [12]. Conse-

Abstract—Kernel-based clustering is one of the most popular methods for partitioning nonlinearly separable dataset. However, exhaustive search for the global optimum is NP-hard. Iterative procedure such as 𝑘-means can be used to seek one of the local minima. Unfortunately, it is easily trapped into degenerate local minima when the prototypes of clusters are ill-initialized. In this paper, we restate the optimization problem of kernel-based clustering in an on-line learning framework, whereby a conscience mechanism is easily integrated to tackle the ill-initialization problem and faster convergence rate is achieved. Thus, we propose a novel approach termed conscience on-line learning (COLL). For each randomly taken data point, our method selects the winning prototype based on the conscience mechanism to bias the ill-initialized prototype to avoid degenerate local minima, and efficiently updates the winner by the on-line learning rule. Therefore, it can more efficiently obtain smaller distortion error than 𝑘-means with the same initialization. Experimental results on synthetic and large-scale real-world datasets, as well as that in the application of video clustering, have demonstrated the significant improvement over existing kernel clustering methods. Keywords-kernel-based clustering; conscience mechanism; on-line learning; 𝑘-means

I. I NTRODUCTION Data clustering plays an indispensable role in various fields such as computer science, medical science, social science and economics [1]. In the real-world applications, it is important to identify the nonlinear clusters and kernel-based clustering is a popular method for this purpose [2]. The basic idea of kernel-based clustering is to seek an assignment of each point to one of some clusters in the kernel feature space such that, the within-cluster similarity is high and the between-cluster one is low. However, exhaustive search for the optimal assignments of the data points in the projected space is computationally infeasible [2]. Since the number of all possible partitions of a dataset grows exponentially with the number of data points. There is an urgent need for an efficient approach to find satisfactory sub-optimal solutions. 𝑘-means is such an iterative method [3], [4]. Despite great success, one serious drawback is that, its performance would easily degenerate in the case of ill-initialization [5]. For instance, in the randomly ill-initialized assignment, some cluster is assigned with a small number of remote and 1550-4786/10 $26.00 © 2010 IEEE DOI 10.1109/ICDM.2010.57

531

quently, two contributions have been made in this paper. 1) The proposed COLL method is insensitive to the illinitialization and can be generalized to tackle other degenerate problems associated with random initialization. 2) Compared with other techniques aiming at tackling illinitialization problem, such as global search strategy, our approach achieves faster convergence rate due to both on-line learning and conscience mechanism. Experimental results on synthetic and large-scale real-world datasets, as well as that in the application of video clustering, have demonstrated the significant improvement over existing kernel clustering methods. The remainder of the paper is organized as follows. Section II formulates the optimization problem of kernel clustering and reviews some related work. In section III, we describe in detail the proposed conscience on-line learning method. Experimental results and applications are reported in section IV. We conclude our paper in section V.

Thus the goal of kernel clustering is to solve the optimization problem in (3). The objective term 𝑛 ∑ ∥ 𝜙(x𝑖 ) − 𝝁𝜈𝑖 ∥2 (6) 𝑖=1

is known as the distortion error [13]. Ideally, all possible assignments of the data into clusters should be tested and the best one with smallest distortion error selected. This procedure is unfortunately computationally infeasible in even a very small dataset, since the number of all possible partitions of a dataset grows exponentially with the number of data points. Hence, efficient algorithms are required. In practice, the mapping function 𝜙 is often not known or hard to obtain and the dimensionality of 𝒴 is quite high. The feature space 𝒴 is characterized by the kernel function 𝜅 and corresponding kernel matrix 𝐾 [2]. Definition 1. A kernel is a function 𝜅, such that 𝜅(x, z) = ⟨𝜙(x), 𝜙(z)⟩ for all x, z ∈ 𝒳 , where 𝜙 is a mapping from 𝒳 to an (inner product) feature space 𝒴. A kernel matrix is a square matrix 𝐾 ∈ ℝ𝑛×𝑛 such that 𝐾𝑖,𝑗 = 𝜅(x𝑖 , x𝑗 ) for some x1 , . . . , x𝑛 ∈ 𝒳 and some kernel function 𝜅.

II. T HE P ROBLEM AND R ELATED W ORK A. Problem Formulation

Thus for an efficient approach, the computation procedure using only the kernel matrix is also required.

Given an unlabelled dataset 𝒳 = {x1 , . . . , x𝑛 } of 𝑛 data points in ℝ𝑑 which is projected into a kernel space 𝒴 by a mapping 𝜙, and the number of clusters 𝑐, we wish to find an assignment of each data point to one of 𝑐 clusters, such that in the kernel space 𝒴, the within-cluster similarity is high and the between-cluster one is low. That is, we seek a map 𝜈 : 𝒳 → {1, . . . , 𝑐} to optimize [2] 𝜈

=

∑

arg min{ 𝜈

(1)

𝜈𝑖 ← arg min ∥ 𝜙(x𝑖 ) − 𝝁𝑘 ∥2 , ∀𝑖 = 1, . . . , 𝑛 𝑘=1,...,𝑐

and the prototypes 𝝁 1 𝝁𝑘 ← −1 ∣𝜈 (𝑘)∣

∥ 𝜙(x𝑖 ) − 𝜙(x𝑗 ) ∥2

𝑖,𝑗:𝜈𝑖 =𝜈𝑗

∑

−𝜆

B. Batch Kernel 𝑘-means The 𝑘-means [3] algorithm is one of most popular iterative methods for solving the optimization problem (3). It begins by initializing a random assignment 𝜈 and seeks to minimize the distortion error by iteratively updating the assignment 𝜈

2

∥ 𝜙(x𝑖 ) − 𝜙(x𝑗 ) ∥ },

(2)

𝑖,𝑗:𝜈𝑖 ∕=𝜈𝑗

Theorem 1. The optimization criterion (2) is equivalent to the criterion 𝑛 ∑ 𝜈 = arg min ∥ 𝜙(x𝑖 ) − 𝝁𝜈𝑖 ∥2 , (3)

where 𝜖 is some very small positive value, e.g., 10−4 . Since in practice, only the kernel matrix 𝐾 is available, the updating of assignment (7) is computed based on the kernel trick [14]

where 𝝁𝑘 is the mean of data points assigned to cluster 𝑘 ∑ 1 𝜙(x𝑖 ), ∀𝑘 = 1, . . . , 𝑐 (4) 𝝁𝑘 = −1 ∣𝜈 (𝑘)∣ −1

𝜈𝑖

(𝑘)

and 𝜈𝑖 satisfies 2

𝜈𝑖 = arg min ∥ 𝜙(x𝑖 ) − 𝝁𝑘 ∥ , ∀𝑖 = 1, . . . , 𝑛. 𝑘=1,...,𝑐

(8)

𝑘=1

𝑖=1

𝑖∈𝜈

𝜙(x𝑖 ), ∀𝑘 = 1, . . . , 𝑐,

𝑖∈𝜈 −1 (𝑘)

until all prototypes converge or the number of iterations ˆ denote the old reaches a prespecified value 𝑡𝑚𝑎𝑥 . Let 𝝁 prototypes before the 𝑡-th iteration, the convergence of all prototypes is characterized by the convergence criterion 𝑐 ∑ ˆ 𝑘 ∥2 ≤ 𝜖 𝔢𝜙 = ∥ 𝝁𝑘 − 𝝁 (9)

where 𝜆 > 0 is some parameter and we use the short notation 𝜈𝑖 = 𝜈(x𝑖 ).

𝜈

∑

(7)

← arg min ∥ 𝜙(x𝑖 ) − 𝝁𝑘 ∥2 𝑘=1,...,𝑐 ∑ =

(5)

Proof: See [2].

532

arg min {𝐾𝑖,𝑖 + 𝑘=1,...,𝑐 ∑ 2 𝑗∈𝜈 −1 (𝑘) 𝐾𝑖,𝑗 }. − ∣𝜈 −1 (𝑘)∣

∑

ℎ∈𝜈 −1 (𝑘) 𝑙∈𝜈 −1 (𝑘) ∣𝜈 −1 (𝑘)∣2

𝐾ℎ,𝑙

(10)

μ

μ

1

μ

1

1

μ2: degenerate

μ : ill−initialized 2

μ3

(a) Data points in the feature space

μ2

μ3

(b) A random but ill initialization

μ3

(c) Kernel 𝑘-means degenerate result

(d) COLL procedure and result

Figure 1. Illustration of ill-initialization and comparison of kernel 𝑘-means and COLL. Different clusters are plotted in different colors and marks, and the initial prototypes are plotted in “×” while the final in “*”. (a) Data points in the feature space. (b) A random but ill initialization. Each prototype is computed as the mean of points randomly assigned to the corresponding cluster. 𝝁2 is ill-initialized, i.e., it is relatively far away from any points. (c) The degenerate result by kernel 𝑘-means. The ill-initialization makes the final 𝝁2 assigned with no point. (d) The procedure and result of COLL. The updating procedure of prototypes is plotted in thinner “×”. The conscience mechanism successfully makes 𝝁2 win in the iterations, leading to a satisfying result.

Then the updated prototypes 𝝁 are implicitly expressed by the assignment 𝜈, which is further used in the next iteration. Let 𝜈ˆ denote the old assignment before the 𝑡-th iteration, the convergence criterion (9) is computed as 𝔢𝜙

= =

𝑐 ∑

𝑛𝑘 = ∣𝜈 −1 (𝑘)∣, 𝑓𝑘 = 𝑛𝑘 /𝑛, ∀𝑘 = 1, . . . , 𝑐.

ℎ∈𝜈 −1 (𝑘) 𝑙∈𝜈 −1 (𝑘) −1 ∣𝜈 (𝑘)∣2

{

𝑘=1

∑

∑

+ 2

∑

∑

ℎ∈ˆ 𝜈 −1 (𝑘) 𝑙∈ˆ 𝜈 −1 (𝑘) −1 ∣ˆ 𝜈 (𝑘)∣2 ℎ∈𝜈 −1 (𝑘)

𝐾ℎ,𝑙

𝐾ℎ,𝑙

∑

∑

𝑙∈ˆ 𝜈 −1 (𝑘)

∣𝜈 −1 (𝑘)∣∣ˆ 𝜈 −1 (𝑘)∣

𝐾ℎ,𝑙

𝜈𝑖 = arg min {𝑓𝑘 ∥ 𝜙(x𝑖 ) − 𝝁𝑘 ∥2 }, }.

(12)

The proposed conscience on-line learning (COLL) is performed as follows: initialize the same random assignment 𝜈 as 𝑘-means, and iteratively update the assignment 𝜈 and the prototypes 𝝁 based on the frequency sensitive (conscience) on-line learning rule. That is, in the 𝑡-th iteration, for one randomly taken data point 𝜙(x𝑖 ), select the winning prototype 𝝁𝜈𝑖 guided by the winning frequency 𝑓𝑘 , i.e., conscience based winner selection rule:

ˆ 𝑘 ∥2 ∥ 𝝁𝑘 − 𝝁

𝑘=1 𝑐 ∑

−

winning frequency. In the beginning, they are initialized as

𝑘=1,...,𝑐

(11)

(13)

and update the winner 𝝁𝜈𝑖 with learning rate 𝜂𝑡 , i.e., on-line winner updating rule: ( ) 𝝁𝜈𝑖 ← 𝝁𝜈𝑖 + 𝜂𝑡 𝜙(x𝑖 ) − 𝝁𝜈𝑖 , (14)

The above procedures lead to the well-known kernel 𝑘means [4]. Given the appropriate initialization, it can find the sub-optimal solution (local minima). Despite great success in practice, it is quite sensitive to the initial positions of cluster prototypes, leading to degenerate local minima.

as well as update the winning frequency 𝑛𝜈𝑖 ← 𝑛𝜈𝑖 + 1, 𝑓𝑘 = 𝑛𝑘 /

𝑐 ∑

𝑛𝑙 , ∀𝑘 = 1, . . . , 𝑐.

(15)

𝑙=1

Example 1. In Figure 1(b), 𝝁2 is ill-initialized relatively far away from any points such that it will get no chance to be assigned with points and updated in the iterations of 𝑘-means. As a result, the clustering result by 𝑘-means is trapped in degenerate local minima as shown in Figure 1(c), where 𝝁2 is assigned with no point.

The iteration procedure continues until all prototypes converge or the number of iterations reaches a prespecified value 𝑡𝑚𝑎𝑥 . The convergence criterion identical to that of 𝑘-means (9) is used. The learning rates {𝜂𝑡 } satisfy conditions [13]: lim 𝜂𝑡 = 0,

𝑡→∞

To overcome this problem, current 𝑘-means algorithm and its variants usually run many times with different initial prototypes and select the result with the smallest distortion error (6) [11], which are however computationally too expensive. In this paper, we propose an efficient and effective approach to the optimization problem (3), which can output smaller distortion error that is insensitive to the initialization.

∞ ∑ 𝑡=1

𝜂𝑡 = ∞,

∞ ∑

𝜂𝑡2 < ∞. (16)

𝑡=1

In practice, 𝜂𝑡 = 𝑐𝑜𝑛𝑠𝑡/𝑡, where 𝑐𝑜𝑛𝑠𝑡 is some small constant, e.g., 1. By reducing the winning rate of the frequent winners according to (13), the goal of the conscience mechanism is to bring all the prototypes available into the solution quickly and to bias the competitive process so that each prototype can win the competition with almost the same probability. In this way, it is insensitive to the ill-initialization, and thus prevents the result from being trapped into degenerate local minima; meanwhile converges much faster than other iterative methods [12].

III. C ONSCIENCE O N - LINE L EARNING A. The Conscience On-line Learning Model Let 𝑛𝑘 denote the cumulative ∑𝑐winning number of the 𝑘-th prototype, and 𝑓𝑘 = 𝑛𝑘 / 𝑙=1 𝑛𝑙 the corresponding

533

𝜙 in (17), we get definition of 𝑊𝑘,:

Example 2. The same ill-initialization as kernel 𝑘-means in Example 1 is used by COLL. However, since the winning frequency of the ill-initialized 𝝁2 will become smaller in the later iterations, it gets the chance to win according to (13). Finally, an almost perfect clustering with small distortion error is obtained as shown in Figure 1(d).

𝜙 𝑊𝑘,𝑖

= =

B. The Computation of COLL As discussed before, any effective algorithm in the kernel space must compute with only the kernel matrix 𝐾. To this end, we devise an efficient framework for computation of the proposed COLL based on a novel representation of prototypes termed prototype descriptor. ¯ 𝑐×(𝑛+1) denote the 𝑐×(𝑛+1) matrix space satisfying Let ℝ ¯ 𝑐×(𝑛+1) , 𝐴:,𝑛+1 ≥ 0, i.e., the last column of matrix ∀𝐴 ∈ ℝ 𝐴 is nonnegative. We define the prototype descriptor based on kernel trick as follows.

= 𝜙 𝑊𝑘,𝑛+1

⟨𝝁𝑐 , 𝜙(x𝑛 )⟩

⟨𝝁𝑐 , 𝝁𝑐 ⟩

𝑖=1

∥ 𝜙(x𝑖 ) − 𝝁𝜈𝑖 ∥2 =

𝑛 ( ∑ 𝑖=1

= diag(𝐴𝐾𝐴 )

ℎ∈𝜈 −1 (𝑘) 𝜙(xℎ ) , ∣𝜈 −1 (𝑘)∣

∑

∑

ℎ∈𝜈 −1 (𝑘) 𝑙∈𝜈 −1 (𝑘) 𝑛 ∑ 𝑛 ∑

∑

𝑙∈𝜈 −1 (𝑘) 𝜙(x𝑙 ) ∣𝜈 −1 (𝑘)∣

〉

1 𝐾ℎ,𝑙 ∣𝜈 −1 (𝑘)∣2

𝐴𝑘,ℎ 𝐴𝑘,𝑙 𝐾ℎ,𝑙

𝐴𝑘,: 𝐾𝐴⊤ 𝑘,: .

(23)

Thus we obtain the initialization of 𝑊 𝜙 as (20). The proof is finished. Theorem 3 (Conscience based winner selection rule). The conscience based winner selection rule (13) can be realized in the way of

(18)

𝜙 𝜙 𝜈𝑖 = arg min {𝑓𝑘 ⋅ (𝐾𝑖,𝑖 + 𝑊𝑘,𝑛+1 − 2𝑊𝑘,𝑖 )}. 𝑘=1,...,𝑐

(24)

Proof: Consider the winner selection rule, i.e., (13), one can get 𝜈𝑖

=

arg min {𝑓𝑘 ∥ 𝜙(x𝑖 ) − 𝝁𝑘 ∥2 }

=

𝜙 𝜙 − 2𝑊𝑘,𝑖 )}.(25) arg min {𝑓𝑘 ⋅ (𝐾𝑖,𝑖 + 𝑊𝑘,𝑛+1

𝑘=1,...,𝑐 𝑘=1,...,𝑐

Thus we get the formula required.

Theorem 2 (Initialization). The random initialization can be realized in the way of = 𝐴𝐾,

⟨𝝁𝑘 , 𝝁𝑘 ⟩ 〈∑

=

)

⊤

=

(22)

ℎ=1 𝑙=1

𝐾𝑖,𝑖 + 𝑊𝜈𝜙𝑖 ,𝑛+1 − 2𝑊𝜈𝜙𝑖 ,𝑖 .

𝜙 𝑊:,𝑛+1

𝐴𝑘,𝑗 𝐾𝑗,𝑖

𝐴𝑘,: 𝐾:,𝑖

=

(19) Let’s consider the computation of 4 ingredients of the proposed COLL model.

𝜙 𝑊:,1:𝑛

(𝑘)

=

=

With this definition, the computation of the distortion error (6) now becomes: 𝑛 ∑

𝑗∈𝜈 𝑛 ∑

=

𝜙 𝜙 = ⟨𝝁𝑘 , 𝜙(x𝑖 )⟩, ∀𝑖 = 1, . . . , 𝑛, 𝑊𝑘,𝑛+1 = ⟨𝝁𝑘 , 𝝁𝑘 ⟩, 𝑊𝑘,𝑖 (17) i.e., ⎛ ⎞ ⟨𝝁1 , 𝜙(x1 )⟩ . . . ⟨𝝁1 , 𝜙(x𝑛 )⟩ ⟨𝝁1 , 𝝁1 ⟩ ⎜ ⟨𝝁2 , 𝜙(x1 )⟩ . . . ⟨𝝁2 , 𝜙(x𝑛 )⟩ ⟨𝝁2 , 𝝁2 ⟩ ⎟ ⎜ ⎟ 𝑊𝜙 = ⎜ ⎟. .. .. .. .. ⎝ ⎠ . . . .

...

⟨𝝁𝑘 , 𝜙(x𝑖 )⟩, ∀𝑖 = 1, . . . , 𝑛 〈∑ 〉 𝑗∈𝜈 −1 (𝑘) 𝜙(x𝑗 ) , 𝜙(x𝑖 ) ∣𝜈 −1 (𝑘)∣ ∑ 1 𝐾𝑗,𝑖 −1 ∣𝜈 (𝑘)∣ −1

𝑗=1

Definition 2 (Prototype descriptor). A prototype descriptor ¯ 𝑐×(𝑛+1) , such that the 𝑘-th row is a matrix 𝑊 𝜙 ∈ ℝ represents prototype 𝝁𝑘 by

⟨𝝁𝑐 , 𝜙(x1 )⟩

=

Theorem 4 (On-line winner updating rule). The on-line winner updating rule (14) can be realized in the way of ⎧ (1 − 𝜂𝑡 )𝑊𝜈𝜙𝑖 ,𝑗 + 𝜂𝑡 𝐾𝑖,𝑗 𝑗 = 1, . . . , 𝑛,    ⎨ 𝑊𝜈𝜙𝑖 ,𝑗 ← (26)  (1 − 𝜂𝑡 )2 𝑊𝜈𝜙𝑖 ,𝑗 + 𝜂𝑡2 𝐾𝑖,𝑖   ⎩ +2(1 − 𝜂𝑡 )𝜂𝑡 𝑊𝜈𝜙𝑖 ,𝑖 𝑗 = 𝑛 + 1.

(20)

where diag(𝑀 ) denotes the main diagonal of a matrix 𝑀 𝑐×𝑛 and the positive matrix 𝐴 = [𝐴𝑘,𝑖 ]𝑐×𝑛 ∈ 𝑅+ has the form { 1 if 𝑖 ∈ 𝜈 −1 (𝑘) −1 (21) 𝐴𝑘,𝑖 = ∣𝜈 (𝑘)∣ 0 otherwise.

Proof: Although we do not know exactly the expression of 𝝁𝜈𝑖 , however we can simply take 𝝁𝜈𝑖 as a symbol of this ´ 𝜈𝑖 . Substitute the prototype and denote its updated one as 𝝁 on-line winner updating rule (14) to the winning prototype

That is, the matrix 𝐴 reflects the initial assignment 𝜈. Proof: Assume the assignment is randomly initialized as 𝜈. Substitute the computation of the prototypes (4) to the

534

𝑊𝜈𝜙𝑖 ,: , we have 𝑊𝜈𝜙𝑖 ,𝑗

𝑊𝜈𝜙𝑖 ,𝑗

we have, 𝝁𝑘

← =

´ 𝜈𝑖 , 𝜙(x𝑗 )⟩ ∀𝑗 = 1, . . . , 𝑛 ⟨𝝁 ⟨𝝁𝜈𝑖 + 𝜂𝑡 (𝜙(x𝑖 ) − 𝝁𝜈𝑖 ), 𝜙(x𝑗 )⟩

=

(1 − 𝜂𝑡 )⟨𝝁𝜈𝑖 , 𝜙(x𝑗 )⟩ + 𝜂𝑡 ⟨𝜙(x𝑖 ), 𝜙(x𝑗 )⟩,

←

´ 𝜈𝑖 , 𝝁 ´ 𝜈𝑖 ⟩ ⟨𝝁

=

⟨𝝁𝜈𝑖 + 𝜂𝑡 (𝜙(x𝑖 ) − 𝝁𝜈𝑖 ), 𝝁𝜈𝑖 + 𝜂𝑡 (𝜙(x𝑖 ) − 𝝁𝜈𝑖 )⟩

=

(27)

𝑗 =𝑛+1

2

( ) 𝝁𝑘 + 𝜂𝑡 𝜙(x𝜋𝑚 𝑘 ) − 𝝁𝑘

=

(1 − 𝜂𝑡 )𝝁𝑘 + 𝜂𝑡 𝜙(x𝜋𝑚 𝑘 ) 𝑘 (

=

ˆ𝑘 + (1 − 𝜂𝑡 ) (1 − 𝜂𝑡 )𝑚 𝝁

𝑘

=

(28)

(1 − 𝜂𝑡 )𝑚𝑘 −𝑙 𝜂𝑡 𝜙(x𝜋𝑘 ) 𝑙

𝑙=1

=

ˆ𝑘 + (1 − 𝜂𝑡 ) 𝝁

1 ∑

𝑘=1

𝑘=1 ℎ=1 𝑙=1

+2𝜂𝑡

ˆ𝑘 + 𝝁𝑘 = (1 − 𝜂𝑡 ) 𝝁

𝑚 ∑ 𝑙=1

𝑐 ∑ 𝑘=1

⎛

𝐾𝜋𝑘 ,𝜋𝑘 ℎ

𝑙

(1 − 𝜂𝑡 )ℎ+𝑙

(

1 ⎝ 1− (1 − 𝜂𝑡 )𝑚𝑘

⎞

)∑ 𝑚𝑘 𝑊 𝜙 𝑘,𝜋 𝑘 𝑙

𝑙=1

(1 − 𝜂𝑡 )𝑙

⎠. (33)

ˆ 𝑘 can be Proof: According to Lemma 1, the old 𝝁 retained from the updated 𝝁𝑘 as

(29)

𝑚𝑘 ∑ 𝜙(x𝜋𝑘 ) 𝝁𝑘 𝑙 ˆ𝑘 = 𝝁 − 𝜂 . 𝑡 (1 − 𝜂𝑡 )𝑚𝑘 (1 − 𝜂𝑡 )𝑙

(34)

𝑙=1

Substitute it to 𝔢𝜙 = 𝔢

𝜙

=

(1 − 𝜂𝑡 ) 𝜂𝑡 𝜙(x𝜋𝑘 ). (30)

𝑘=1

ˆ 𝑘 ∥2 , we have ∥ 𝝁𝑘 − 𝝁

𝑚𝑘 ∑ 𝜙(x𝜋𝑘 ) 𝝁𝑘 𝑙 ∥ 𝝁𝑘 − − 𝜂 𝑡 (1 − 𝜂𝑡 )𝑚𝑘 (1 − 𝜂𝑡 )𝑙 𝑘=1 𝑙=1 ) (( )2 𝑐 ∑ 1 ⟨𝝁𝑘 , 𝝁𝑘 ⟩ 1− (1 − 𝜂𝑡 )𝑚𝑘

+𝜂𝑡2

𝑙

∑𝑐 (

𝑐 ∑

𝑘=1

0

) ∥2

𝑚𝑘 ∑ 𝑚𝑘 𝑐 ∑ ∑ ⟨𝜙(x𝜋𝑘 ), 𝜙(x𝜋𝑘 )⟩ ℎ

𝑙

(1 − 𝜂𝑡 )ℎ+𝑙 𝑘=1 ℎ=1 𝑙=1 ( ) ( )∑ 𝑚𝑘 𝑐 ∑ ⟨𝝁𝑘 , 𝜙(x𝜋𝑘 )⟩ 1 𝑙 +2𝜂𝑡 1− . (1 − 𝜂𝑡 )𝑚𝑘 (1 − 𝜂𝑡 )𝑙

Assume that it is true for 𝑚𝑘 = 𝑚, that is, for the first 𝑚 ordered points, 𝑚

𝑚𝑘 ∑ 𝑚𝑘 𝑐 ∑ ∑

+𝜂𝑡2

=

𝑙=1

(32)

𝑙

Theorem 5 (convergence criterion). The convergence criterion can be computed by (( ) )2 𝑐 ∑ 1 𝜙 𝔢𝜙 = 𝑊𝑘,𝑛+1 1− (1 − 𝜂𝑡 )𝑚𝑘

( ) ˆ 𝑘 + 𝜂𝑡 𝜙(x𝜋1𝑘 ) − 𝝁 ˆ𝑘 𝝁 1

(1 − 𝜂𝑡 )𝑚+1−𝑙 𝜂𝑡 𝜙(x𝜋𝑘 )

This expression shows that (29) is true for 𝑚𝑘 = 𝑚 + 1. Therefore, by mathematical induction, it is true for all positive integers 𝑚𝑘 .

Proof: To prove this relationship, we use the Principle of Mathematical Induction. One can easily verify that (29) is true for 𝑚𝑘 = 1 directly from (14), =

𝑚+1 ∑ 𝑙=1

𝑘 ] stores the indices of where array 𝜋 𝑘 = [𝜋1𝑘 , 𝜋2𝑘 , . . . , 𝜋𝑚 𝑘 𝑚𝑘 ordered points assigned to the 𝑘-th prototype in this iteration.

𝝁𝑘

𝑙

𝑙=1

ˆ𝑘 + (1 − 𝜂𝑡 )𝑚+1 𝝁

Lemma 1. In the 𝑡-th iteration, the relationship between the ˆ 𝑘 is: updated prototype 𝝁𝑘 and the old 𝝁 𝑚𝑘 ∑

) (1 − 𝜂𝑡 )𝑚−𝑙 𝜂𝑡 𝜙(x𝜋𝑘 )

𝑘

Then we get the on-line winner updating rule as (26). It is a bit complicated to compute the convergence criterion without explicit expression of {𝝁1 , . . . , 𝝁𝑐 }. Notice that, in one iteration, each point 𝜙(x𝑖 ) is assigned to one and only one winning prototype. Let array 𝜋 𝑘 = 𝑘 ] stores the indices of 𝑚𝑘 ordered points [𝜋1𝑘 , 𝜋2𝑘 , . . . , 𝜋𝑚 𝑘 assigned to the 𝑘-th prototype in one iteration. For instance, if 𝜙(x1 ), 𝜙(x32 ), 𝜙(x8 ), 𝜙(x20 ), 𝜙(x15 ) are 5 ordered points assigned to the 2-nd prototype in the 𝑡th iteration, then the index array of the 2-nd prototype 2 ] = [1, 32, 8, 20, 15] with 𝜋12 = is 𝜋 2 = [𝜋12 , 𝜋22 , . . . , 𝜋𝑚 2 2 2 2 1, 𝜋2 = 32, 𝜋3 = 8, 𝜋4 = 20, 𝜋52 = 15 and 𝑚2 = 5. The following lemma formulates the cumulative update of the 𝑘 ]. 𝑘-th prototype based on the array 𝜋 𝑘 = [𝜋1𝑘 , 𝜋2𝑘 , . . . , 𝜋𝑚 𝑘

ˆ𝑘 + 𝝁𝑘 = (1 − 𝜂𝑡 )𝑚𝑘 𝝁

𝑚 ∑

+𝜂𝑡 𝜙(x𝜋𝑚 𝑘 )

𝜂𝑡2 ⟨𝜙(x𝑖 ), 𝜙(x𝑖 )⟩

(1 − 𝜂𝑡 ) ⟨𝝁𝜈𝑖 , 𝝁𝜈𝑖 ⟩ + +2(1 − 𝜂𝑡 )𝜂𝑡 ⟨𝝁𝜈𝑖 , 𝜙(x𝑖 )⟩.

=

𝑘=1

𝑙=1

(35)

(1 − 𝜂𝑡 )

𝑚−𝑙

𝜂𝑡 𝜙(x𝜋𝑘 ). 𝑙

(31)

Thus 𝔢𝜙 can be computed by (33). This ends the proof. For clarification, Algorithm 1 summaries the proposed conscience on-line learning (COLL) method.

Then for 𝑚𝑘 = 𝑚 + 1, i.e., the (𝑚 + 1)-th point, from (14)

535

1.5

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1 −1.5

−1

−0.5

0

0.5

1

1.5

(a) Original two moons data set

2

2.5

−1 −1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

(b) Degenerate clustering by kernel 𝑘-means

−1 −1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

(c) Clustering result by the proposed COLL

Figure 2. Original two moons dataset and clustering results by kernel 𝑘-means and the proposed COLL method. (a) Original two moons dataset. (b) Kernel 𝑘-means result. It is obvious that degenerate local optimum is obtained due to the sensitivity to ill initialization. (c) Clustering result by COLL with the same initialization. An almost perfect clustering is obtained with only a few points being erroneously assigned.

) ( complex𝑂 𝑡𝑚𝑎𝑥 𝑛2 . Consequently the total computational ) ( ity(of the proposed )COLL method is 𝑂 𝑐𝑛2 + 𝑡𝑚𝑎𝑥 𝑛2 = 𝑂 max(𝑐, 𝑡𝑚𝑎𝑥 )𝑛2 , which is the same as that of kernel 𝑘means if the same number of iterations is obtained. However, due to the conscience mechanism [12] and on-line learning rule [13], the proposed COLL achieves faster convergence rate than its counterpart. Thus fewer iterations are needed for COLL to converge. This is especially beneficial in largescale data clustering.

Algorithm 1: Conscience on-line learning (COLL) Input: kernel matrix 𝐾 ∈ ℝ𝑛×𝑛 , 𝑐, {𝜂𝑡 }, 𝜖, 𝑡𝑚𝑎𝑥 . Output: cluster assignment 𝜈 subject to (3). 1: Randomly initialize assignment 𝜈 and set 𝑡 = 0; Initialize the winning frequency {𝑓𝑘 } by (12); ¯ 𝑐×(𝑛+1) by (20). Initialize prototype descriptor 𝑊 𝜙 ∈ ℝ 2: repeat 3: Get 𝑐 empty index arrays {𝜋 𝑘 = ∅ : 𝑘 = 1, . . . , 𝑐} and a random permutation {𝐼1 , . . . , 𝐼𝑛 : 𝐼𝑖 ∈ {1, . . . , 𝑛}, s.t. 𝐼𝑖 ∕= 𝐼𝑗 , ∀𝑖 ∕= 𝑗}, set 𝑡 = 𝑡 + 1. 4: for 𝑙 = 1, . . . , 𝑛 do 5: Select the winning prototype 𝑊𝜈𝜙𝑖 ,: of the 𝑖-th point (𝑖 = 𝐼𝑙 ) by (24), and append 𝑖 to the 𝜈𝑖 -th index array 𝜋 𝜈𝑖 . 6: Update the winning prototype 𝑊𝜈𝜙𝑖 ,: with learning rate 𝜂𝑡 by (26) and the winning frequency by (15). 7: end for 8: Compute 𝔢𝜙 via (33). 9: until 𝔢𝜙 ≤ 𝜖 or 𝑡 ≥ 𝑡𝑚𝑎𝑥 10: Obtain the cluster assignment 𝜈𝑖 by (24), ∀𝑖 = 1, ..., 𝑛.

IV. E XPERIMENTS AND A PPLICATIONS In this section, we first performed an experiment on the synthetic two moons dataset to illustrate the effectiveness of the proposed COLL method in handling the ill-initialization problem. Then we performed the experiments on four digit datasets to compare the convergence property of COLL with kernel 𝑘-means [4] and global kernel 𝑘-means [11], in terms of both the convergence rate which is characterized by the convergence criterion (9) and final distortion error (6). Comparison results reveal that the proposed COLL achieves a faster convergence rate meanwhile outputs much smaller distortion error than the compared methods. Additionally, for demonstrating the effectiveness of COLL in data clustering, another well-known clustering evaluation measurement, namely normalized mutual information (NMI) [15], was used to evaluate the clustering results generated by the compared methods. Finally, we tested the proposed method in the application of video clustering, where COLL generates the best segmentation with the least computational cost. In all experiments, Gaussian kernel 𝜅(x𝑖 , x𝑗 ) = 2 2 𝑒−∥x𝑖 −x𝑗 ∥ /2𝜎 is used to construct the kernel matrix 𝐾. To obtain a meaningful comparison, on each dataset, the same 𝜎 value is used for all the compared methods. Note that appropriate kernel selection is out of the scope of this paper and we only try the Gaussian kernel with some appropriate 𝜎 value according to the distance matrix 𝐷 = [∥ x𝑖 −x𝑗 ∥]𝑛×𝑛 . The iteration stopping criteria are set as 𝜖 = 10−4 and 𝑡𝑚𝑎𝑥 = 100. Learning rate is set at 𝜂𝑡 = 1/𝑡. All the

C. Computational complexity The computation of the proposed COLL method consists of two parts: initialization of 𝑊 𝜙 and iterations 𝜙 𝜙 to update ) . From (20), the initialization of 𝑊 ( 2𝑊 takes 𝑂 𝑐𝑛 operations.( For each iteration, the compu) tational complexity is 𝑂 𝑛(𝑐 + (𝑛 + 1)) + (𝑐 + 𝑛2 + 𝑛) . Since 𝑂 (𝑛(𝑐 + (𝑛 + 1))) operations are needed to perform the iteration (for each point, 𝑂(𝑐) to select a winner and 𝑂(𝑛 + 1) ) the winner, there being ( to update 𝑛 points) and 𝑂 𝑐 + 𝑛2 + 𝑛 operations are needed to compute the convergence criterion 𝔢𝜙 (the first term of (33) taking 𝑂(𝑐) operations, the second term at most 𝑂(𝑛2 ) operations and the third term 𝑂(𝑛) operations). Assume the iteration number is 𝑡𝑚𝑎𝑥 , since in general 1 < 𝑐 < 𝑛, the complexity for iteration ( )) ( computational procedure is 𝑂 𝑡𝑚𝑎𝑥 𝑛(𝑐 + (𝑛 + 1)) + (𝑐 + 𝑛2 + 𝑛) = 536

Table II AVERAGE DISTORTION ERROR (DE) OVER 100 RUNS FOR THE FOUR DIGIT DATASETS BY THE THREE KERNEL CLUSTERING ALGORITHMS .

Table I S UMMARY OF DIGIT DATASETS . 𝑛 IS THE NUMBER OF DATA POINTS ; 𝑐 IS THE NUMBER OF CLASSES ; 𝑑 IS THE DIMENSIONALITY; “BALANCED ” MEANS WHETHER ALL CLASSES ARE OF THE SAME SIZE . T HE 𝜎 IS FIXED FOR ALL COMPARED KERNEL BASED METHODS . Dataset Pendigits Mfeat USPS MNIST

𝑛 10992 2000 11000 5000

𝑐 10 10 10 10

𝑑 16 649 256 784

Balanced × √ √ √

Dataset Pendigits Mfeat USPS MNIST

𝜎 60.53 809.34 1286.70 2018.30

Kernel 𝑘-means 6704.0 1405.5 8004.7 3034.3

Global kernel 𝑘-means 6664.9 1363.3 7782.0 2894.1

Proposed COLL 6619.3 1324.2 7561.0 2754.9

almost converge. In Figure 3(c), the exception of USPS dataset is that in the 18-th iteration, the prototype descriptor 𝑊 𝜙 (i.e., prototypes) makes a larger change than in the 21th iteration, which does not affect the overall convergence since the changes in general become smaller and smaller as the iteration continues. The algorithm converges when the changes reach ∑𝑛almost zeros, where the minimal distortion errors (i.e., 𝑖=1 ∥ 𝜙(x𝑖 ) − 𝝁𝜈𝑖 ∥2 ) are obtained. Moreover, compared with kernel 𝑘-means, the plotted curves reveal that COLL achieves a much faster convergence rate. That is, much fewer iterations are needed in our method to achieve 𝔢𝜙 < 𝜖 as shown in Figure 3. Please note that log is used, which implies small difference between the plotted log values is indeed quite large. For further comparing the performances of the 3 methods in minimizing the distortion errors, two types of comparisons were carried out. One is to compare their performances when the number of clusters 𝑐 is set to different values, and the other is performed when the actual number of underlying clusters 𝑐 is prespecified and fixed (i.e., 𝑐 = 10) for the four digit datasets. Figure 4 plots the distortion error as a function of the preselected number of clusters obtained by COLL, kernel 𝑘means and global kernel 𝑘-means. It is obvious that, COLL achieves the smallest distortion errors among the compared methods. Especially, it outperforms the latest develop of kernel 𝑘-means, i.e., global kernel 𝑘-means, on all the four datasets. It should be pointed out that, ideally, the globally minimal distortion error, e.g., the one obtained by testing all the possible assignments 𝜈 with the best one selected, should have monotonically decreased as the number of clusters increases. However, since all the compared methods are just heuristic methods that can achieve local optima, it is not necessary to be monotonically decreasing. Table II lists the average values of final distortion errors (DE) with the actual number of clusters 𝑐 preselected. On all the datasets, the proposed COLL has generated the smallest distortion errors among the compared kernel methods. It again validates that COLL holds relatively better convergence property, i.e., it can converge to more appropriate implicit prototypes {𝝁𝑘 : 𝑘 = 1, . . . , 𝑐} such that much smaller distortion error is obtained. In addition to the internal validation like distortion error,

experiments are implemented in Matlab7.8.0.347 (R2009a) 64bit edition on a workstation (Windows 64bit, 8 Intel 2.00GHz processors, 16GB of RAM). A. Synthetic Dataset We first performed synthetic dataset clustering to demonstrate the effectiveness of the proposed COLL in handling the ill-initialization problem in nonlinear clustering condition. The classic two moons dataset consisting 𝑛 = 2000 points was generated as 2 half-circles in ℝ2 . The 𝜎 for constructing Gaussian kernel matrix was set at 𝜎 = 0.15 and fixed for all the compared kernel methods. Figure 2 shows the original two moons dataset and clustering results obtained by kernel 𝑘-means and COLL respectively. Due to the sensitivity to ill-initialization, kernel 𝑘-means has been trapped into degenerate local optima as shown in Figure 2(b). However, since the conscience mechanism is capable of handling ill-initialization problem by reducing the winning rate of the frequent winner, the clustering result by the proposed COLL with the same initialization is promising as shown in Figure 2(c). It demonstrates the effectiveness of COLL in tackling ill-initialization problem. B. Digit Clustering For digit clustering, we selected four widely tested digit datasets, including Pen-based recognition of handwritten digit data set (Pendigits), Multi-feature digit data set (Mfeat), USPS [16] and MNIST [17]. The first 2 datasets are from UCI repository [18]. Table I summarizes the properties of the four datasets, as well as the 𝜎 used in constructing the Gaussian kernel matrix. 1) Convergence Analysis: We first analyzed the convergence property of the proposed COLL. Figure 3 plots the logarithm of the convergence criterion 𝔢𝜙 value as a function of iteration step obtained by COLL and kernel 𝑘means when the actual number of underlying clusters 𝑐 (i.e., 𝑐 = 10) is provided. Since the main procedure of the global kernel 𝑘-means is to run the kernel 𝑘-means many times, it does not output the single convergence criterion. One can observe from Figure 3 that, the log(𝔢𝜙 ) on all datasets except USPS Figure 3(c), monotonically decrease from some relatively large values to some small values as the iteration step increases, which implies that the solution trends to

537

500

Kernel k−means The proposed COLL

700

Log(eφ)

300

φ

Log(e )

500 400 300 200

200

100

100 0 0

20

40 60 Iteration step

80

0 0

100

20

(a) Pendigits

40 60 Iteration step

80

600

500

500

400

400

300

300

200

200

100

100 20

(b) Mfeat

40 60 Iteration step

80

Kernel k−means The proposed COLL

700

600

0 0

100

800

Kernel k−means The proposed COLL

700

400

600 Log(eφ)

800

Kernel k−means The proposed COLL

Log(eφ)

800

0 0

100

20

(c) USPS

40 60 Iteration step

80

100

(d) MNIST

log(𝔢𝜙 )

Figure 3. The value as a function of iteration step by COLL and kernel 𝑘-means. Note that log is used, which implies small difference between the plotted log values is indeed quite large. COLL converges much faster than the compared method, i.e., fewer iterations are needed to achieve 𝔢𝜙 < 𝜖.

8100

6800

1600 1500 1400

6700

1300

6600

7800 7700

(a) Pendigits

20

3050 3000 2950

7600

2900

7500

2850 2800 2750

1200 10 15 Number of clusters

3150 3100

7900

7400 5

Kernel k−means Global kernel k−means The proposed COLL

3200

8000 Distortion error

Distortion error

Distortion error

6900

Kernel k−means Global kernel k−means The proposed COLL

8200

1700

7000

6500

Kernel k−means Global kernel k−means The proposed COLL

1800

7100

Distortion error

Kernel k−means Global kernel k−means The proposed COLL

7200

5

10 15 Number of clusters

20

5

(b) Mfeat

10 15 Number of clusters

(c) USPS

20

5

10 15 Number of clusters

20

(d) MNIST

Figure 4. The distortion error as a function of the preselected number of clusters obtained by COLL, kernel 𝑘-means and global kernel 𝑘-means. The proposed COLL achieves the smallest distortion errors on the four digit datasets.

Table III AVERAGE NORMALIZED MUTUAL INFORMATION (NMI)

(𝑗)

OVER 100 RUNS FOR THE FOUR DIGIT DATASETS BY THE THREE KERNEL CLUSTERING ALGORITHMS .

Dataset Pendigits Mfeat USPS MNIST

Kernel 𝑘-means 0.715 0.533 0.354 0.441

Global kernel 𝑘-means 0.736 0.542 0.369 0.472

a confusion matrix is formed first, where entry (𝑖, 𝑗), 𝑛𝑖 gives the number of points in cluster 𝑖 and class 𝑗. Then NMI can be computed from the confusion matrix [15] (ℎ) ∑𝑐 ∑𝑐ˆ 𝑛(ℎ) 𝑛𝑙 𝑛 ∑𝑐ˆ (𝑖) 2 𝑙=1 ℎ=1 𝑙𝑛 log ∑𝑐 (ℎ) 𝑛 𝑛 𝑖=1 𝑖 𝑖=1 𝑙 (36) 𝑁𝑀𝐼 = 𝐻(𝜋) + 𝐻(𝜁) ∑𝑐 𝑛𝑖 where 𝐻(𝜋) = − 𝑖=1 𝑛 log 𝑛𝑛𝑖 and 𝐻(𝜁) = ∑𝑐ˆ (𝑗) (𝑗) − 𝑗=1 𝑛𝑛 log 𝑛𝑛 are Shannon entropy of cluster labels 𝜋 and class labels 𝜁 respectively, with 𝑛𝑖 and 𝑛(𝑗) denoting the number of points in cluster 𝑖 and class 𝑗. A high NMI value indicates that the clustering and underlying class labels match well. See [15] for further details. Table III lists average NMI values over 100 runs for the four digit datasets obtained by the three algorithms. In all datasets, the proposed COLL can generate the best results among the compared algorithms. In particular on Mfeat datasets, COLL has obtained 0.062 higher average NMI than the global kernel 𝑘-means, meanwhile on the intractable USPS dataset, 0.092 higher average NMI has been achieved, which are great improvements. The comparison results demonstrate the effectiveness of COLL in the applications of digit clustering.

Proposed COLL 0.753 0.604 0.461 0.520

we also tested the compared methods in terms of the external clustering validation with the number of underlying clusters 𝑐 prespecified. Although there exist many external clustering evaluation measurements, such as clustering errors, average purity, entropy-based measures [19] and pair counting based indices [20], as pointed out by Strehl and Ghosh [21], the mutual information provides a sound indication of the shared information between a pair of clusterings. The mutual information based measurements may include variation of Information (VI) [22], normalized mutual information (NMI) [21], adjusted mutual information (AMI) [23], etc. And normalized mutual information (NMI) [21] is widely used in measuring how closely the clustering and underlying class labels match. Given a dataset 𝒳 of size 𝑛, the clustering labels 𝜋 of 𝑐 clusters and actual class labels 𝜁 of 𝑐ˆ classes,

C. Video Clustering In this subsection, we report the experimental results in the application of video clustering (automatic segmen538

Cluster 1: 0001-0061

Cluster 9: 1023-1083

Cluster 2: 0062-0163

Cluster 3: 0164-0289

Cluster 4: 0290-0400

Cluster 5: 0401-0504

Cluster 6:0505-699

Cluster 7: 700-827

Cluster 8: 828-1022

Cluster 10: 1084-1238

Cluster 11: 1239-1778

Cluster 12: 1779-1983

Cluster 13: 1984-2119

Cluster 14: 2120-2243

Cluster 15: 2244-2400

Cluster 16: 2401-2492

Figure 5.

Clustering frames of the video sequence ANNI002 into 16 scenes using the proposed COLL. A satisfying segmentation has been achieved.

Table IV T HE MEANS ( RUNNING 10 TIMES ) OF NMI AND COMPUTATIONAL TIME IN SECONDS ON THE 11 VIDEO SEQUENCES . # FRAMES DENOTES THE NUMBER OF FRAMES OF EACH VIDEO SEQUENCE .

Video sequence (#frames) ANNI001 ANNI002 ANNI003 ANNI004 ANNI005 ANNI006 ANNI007 ANNI008 ANNI009 ANNI010 ANNI011

(914) (2492) (4265) (3897) (11361) (16588) (1588) (2773) (12304) (30363) (1987)

Kernel 𝑘-means NMI Time 0.781 72.2 0.705 94.7 0.712 102.2 0.731 98.3 0.645 152.2 0.622 193.0 0.727 81.1 0.749 95.9 0.727 167.0 0.661 257.2 0.738 85.4

Global kernel 𝑘-means NMI Time 0.801 94.0 0.721 126.4 0.739 139.2 0.750 121.6 0.656 173.3 0.638 255.5 0.740 136.7 0.771 119.0 0.763 184.4 0.709 426.4 0.749 142.7

tation). Video clustering plays an important role in automatic video summarization/abstraction as a preprocessing step [24]. Consider a video sequence in which the camera is fading/switching/cutting among a number of scenes, the goal of automatic video clustering is to cluster the video frames according to the different scenes. The gray-scale values of the raw pixels were used as the feature vector for each frame. For one video sequence, the frames {f𝑖 ∈ ℝ𝑑 }𝑛𝑖=1 is taken as the dataset, where 𝑑 = 𝑤𝑖𝑑𝑡ℎ × ℎ𝑒𝑖𝑔ℎ𝑡 and 𝑛 is the length of the video sequence. We selected the 11 video sequences from the open-video website [25], which are 11 segments of the whole “NASA 25th Anniversary Show” with 𝑑 = 320 × 240 = 76800 and 𝑛 (i.e., the duration of the sequence) varying from one sequence to another. Figure 5 illustrates the clustering result of one video sequence “NASA 25th Anniversary Show, Segment 2” (ANNI002) by the proposed COLL. 2492 frames have been clustered into 16 scenes. Except for the frames from 400 to 405 and 694 to 701 as well as the last two clusters, where the separation boundaries are not so clear, satisfactory segmentation has been obtained. For comparison, “ground truth” segmentation of each video sequence has been manually obtained, through which NMI values are computed to compare COLL with kernel 𝑘-means and global kernel 𝑘means. Table IV lists the average values (running 10 times) of NMI and computational time in seconds on the 11 video

Proposed COLL NMI Time 0.851 70.4 0.741 89.0 0.762 99.5 0.759 93.6 0.680 141.2 0.642 182.3 0.770 79.1 0.794 81.5 0.781 160.4 0.734 249.0 0.785 83.7

sequences. Additionally, the length of each video sequence (i.e., number of the frames) is also listed. The results in terms of average values of NMI and computational time reveal that the proposed COLL generates the best segmentation among the compared methods with the least computational time, which is a significant improvement. V. C ONCLUSIONS Kernel-based clustering is one of the most popular methods for partitioning nonlinearly separable dataset. However, exhaustive search for the global optima is NP-hard. In this paper, we have presented an efficient and effective approach termed conscience on-line learning (COLL) for solving this optimization problem in the on-line learning framework. Unlike the classic 𝑘-means method, the proposed approach is insensitive to the initial positions of the cluster prototypes. Compared with other techniques aiming at tackling illinitialization problems, the COLL method achieves much faster convergence rate, due to both on-line learning and conscience mechanism. For validating the effectiveness and efficiency of the proposed method, three kinds of experiments have been carried out. Experimental results reveal that, the proposed COLL method is capable of handling the ill-initialization problem, has a faster convergence rate and obtains much smaller distortion error than other kernel-based clustering 539

[15] I. S. Dhillon, Y. Guan, and B. Kulis, “Kernel k-means, spectral clustering and normalized cuts,” in Proc. of the 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2004.

methods, thus can be applied to the real-world applications such as large-scale digit clustering and video clustering. ACKNOWLEDGMENT This project was supported by the NSFC-GuangDong (U0835005), NSFC (60633030, 60803083), 973 Program (2006CB303104) in China, and GuangDong Program (2010B031000004).

[16] J. J. Hull, “A database for handwritten text recognition research,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, no. 5, pp. 550–554, May 1994. [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition, http://yann. lecun.com/exdb/mnist/,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.

R EFERENCES [1] R. Xu and I. Wunsch, D., “Survey of clustering algorithms,” IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.

[18] A. Asuncion and D. Newman, “UCI machine learning repository, http://www.ics.uci.edu/∼mlearn/MLRepository.html,” 2007.

[2] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Camb. U. P., 2004.

[19] A. Strehl, J. Ghosh, and R. J. Mooney., “Impact of similarity measures on web-page clustering,” in Proc. AAAI Workshop on AI for Web Search (AAAI 2000), 2000.

[3] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. University of California Press, 1967, pp. 281–297.

[20] L. Hubert and P. Arabie, “Comparing partitions,” Journal of Classification, vol. 2, pp. 193–218, 1985.

[4] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, pp. 1299–1319, 1998.

[21] A. Strehl and J. Ghosh, “Cluster ensembles − a knowledge reuse framework for combining multiple partitions,” Journal of Machine Learning Research, vol. 3, pp. 583–617, 2002.

[5] P. S. Bradley and U. M. Fayyad, “Refining initial points for kmeans clustering,” in Proc. of the 15th Int. Conf. on Machine Learning, 1998.

[22] M. Meilˇa, “Comparing clusterings: an axiomatic view,” in Proc. of the 22nd International Conference on Machine Learning (ICML 2005), 2005.

[6] S. S. Khan and A. Ahmad, “Cluster center initialization algorithm for k-means clustering,” Pattern Recognition Letters, vol. 25, pp. 1293–1302, 2004.

[23] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison is a correction for chance necessary?” in Proc. of the 26th International Conference on Machine Learning (ICML 2009), 2009.

[7] Z. Zhang, B. T. Dai, and A. K. Tung, “On the lower bound of local optimums in k-means algorithm,” in Procc of the 6th Int. Conf. on Data Mining, 2006.

[24] B. T. Truong and S. Venkatesh, “Video abstraction: A systematic review and classification,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 3, no. 1, Feb. 2007.

[8] K. Krishna and M. N. Murty, “Genetic k-means algorithm,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 3, pp. 433–439, June 1999.

[25] “http://www.open-video.org,” The Open Video Project is managed at the Interaction Design Laboratory, at the School of Information and Library Science, University of North Carolina at Chapel Hill.

[9] B. Abolhassani, J. E. Salt, and D. E. Dodds, “A two-phase genetic k-means algorithm for placement of radioports in cellular networks,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, pp. 533–538, 2004. [10] A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering algorithm,” Pattern Recognition, vol. 36, pp. 451– 461, 2003. [11] G. F. Tzortzis and A. C. Likas, “The global kernel 𝑘means algorithms for clustering in feature space,” IEEE Trans. Neural Netw., vol. 20, no. 7, pp. 1181–1194, July 2009. [12] D. DeSieno, “Adding a conscience to competitive learning,” in IEEE Int. Conf. on Neural Netw., 1988. [13] C. M. Bishop, Pattern Recognition and Machine Learning, M. Jordan, J. Kleinberg, and B. Sch¨olkopf, Eds. Springer, 2006. [14] B. Sch¨olkopf, “The kernel trick for distances,” in Advances in Neural Information Processing Systems, 2000.

540

A Conscience On-line Learning Approach for Kernel ...

Conscience online learning: an efficient approach for ... - Springer Link

A Multiple Operator-valued Kernel Learning Approach ...

Hyperparameter Learning for Kernel Embedding ...

Unsupervised multiple kernel learning for ... -

A Cooperative Q-learning Approach for Online Power ...

Genetic Programming for Kernel-based Learning with ...

Kernel Methods for Learning Languages - NYU Computer Science

Kernel Methods for Learning Languages - Research at Google

Kernel and graph: Two approaches for nonlinear competitive learning ...

Kernel-Based Models for Reinforcement Learning

Multi-task, Multi-Kernel Learning for Estimating Individual Wellbeing

Multiple Kernel Learning Captures a Systems-Level Functional ... - PLOS

Learning-Based Approach for Online Lane Change ...

A Semi-supervised Ensemble Learning Approach for ...

A Discriminative Learning Approach for Orientation ... - Semantic Scholar