Knowl Inf Syst (2012) 31:79–104 DOI 10.1007/s10115-011-0416-2 REGULAR PAPER
Conscience online learning: an efficient approach for robust kernel-based clustering Chang-Dong Wang · Jian-Huang Lai · Jun-Yong Zhu
Received: 24 December 2010 / Revised: 13 March 2011 / Accepted: 11 May 2011 / Published online: 24 May 2011 © Springer-Verlag London Limited 2011
Abstract Kernel-based clustering is one of the most popular methods for partitioning nonlinearly separable datasets. However, exhaustive search for the global optimum is NPhard. Iterative procedure such as k-means can be used to seek one of the local minima. Unfortunately, it is easily trapped into degenerate local minima when the prototypes of clusters are ill-initialized. In this paper, we restate the optimization problem of kernel-based clustering in an online learning framework, whereby a conscience mechanism is easily integrated to tackle the ill-initialization problem and faster convergence rate is achieved. Thus, we propose a novel approach termed conscience online learning (COLL). For each randomly taken data point, our method selects the winning prototype based on the conscience mechanism to bias the ill-initialized prototype to avoid degenerate local minima and efficiently updates the winner by the online learning rule. Therefore, it can more efficiently obtain smaller distortion error than k-means with the same initialization. The rationale of the proposed COLL method is experimentally analyzed. Then, we apply the COLL method to the applications of digit clustering and video clustering. The experimental results demonstrate the significant improvement over existing kernel-based clustering methods. Keywords K -means
Kernel-based clustering · Conscience mechanism · Online learning · COLL ·
C.-D. Wang · J.-H. Lai (B) School of Information Science and Technology, Sun Yat-sen University, Guangzhou, People’s Republic of China e-mail:
[email protected] J.-Y. Zhu School of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou, People’s Republic of China
123
80
C.-D. Wang et al.
1 Introduction As a fundamental problem, data clustering plays an indispensable role in various fields such as computer science, medical science, social science, and economics [39]. In the literatures of data mining and information system, it has been widely used in the applications of adaptive web user interface design [26], XML documents analysis [20], sequence clustering [6], text clustering [13], multi-agent system analysis [27], business intelligence [32], image segmentation [30], etc. Many approaches have been developed from different view points. Wang et al. proposed a large item–based clustering method for clustering transactions such that there would be many large items within a cluster and little overlapping of such items across clusters [33]. In [18], Liu et al. developed a CLTree (CLustering based on decision Trees) algorithm, which for the first time employs the supervised learning technique called decision tree construction in the clustering methodology. In [37], clustering with pattern preservation was studied and two clustering algorithms were proposed, namely HIerarchical Clustering with pAttern Preservation (HICAP) and bisecting k-means Clustering with pAttern Perservation (K-CAP). From the viewpoint of efficiency, Jin et al. developed a fast and exact k-means clustering (FEKM) that requires only one or a small number of passes on the entire dataset, such that the computational cost is greatly reduced [12]. In the real-world applications, it is important to identify the nonlinear clusters and kernelbased clustering is a popular method for this purpose [23]. The basic idea of kernel-based clustering is to seek an assignment of each point to one of some clusters in the kernel feature space such that the within-cluster similarity is high and the between-cluster one is low. However, exhaustive search for the optimal assignments of the data points in the projected space is computationally infeasible [23]. Since the number of all possible partitions of a dataset grows exponentially with the number of data points, there is an urgent need for an efficient approach to find satisfactory sub-optimal solutions. The classical k-means is such an iterative method [19,34,35]. Due to its significant influence in the data mining community, k-means algorithm was identified as one of the top 10 algorithms in data mining [36]. Despite great success, k-means has one serious drawback that its performance would easily degenerate in the case of ill-initialization [36]. For instance, in the randomly ill-initialized assignment, some cluster is assigned with a small number of remote and isolated points such that the prototype of this cluster is relatively far away from any points. As a result, in the later iterations, this prototype would never get a chance to be assigned with any point (Fig. 1b). Several methods have been developed to overcome the ill-initialization problem. Bradley and Fayyad [4] proposed to compute a refined initialization from a given one by estimating the modes of a distribution based on the subsampling technique. Another approach for refining the initial cluster prototypes is based on the observations that some patterns are very similar to each other such that they have the same cluster membership irrespective to the choice of initial cluster prototypes [14]. Zhang et al. [41] computed a lower bound on the cost of the local optimum from the current prototype set and proposed a BoundIterate method. Evolutionary algorithms such as genetic algorithm have also been applied to k-means, aiming at avoiding the degenerate local minima [1,15]. Very recently, Likas et al. proposed the global k-means that is deterministic and does not rely on any initial conditions [17,29]. Although these methods have eliminated the sensitivity to the ill-initialization, however, most of them are computationally expensive. In this paper, we propose a novel approach termed conscience online learning (COLL) to solve the optimization problem associated with kernel-based clustering in the online learning framework. The proposed method starts with a random guess of the assignment the same as
123
Conscience online learning
81
μ1
μ2: ill−initialized μ3
(a)
(b)
μ1
μ
1
μ : degenerate 2
μ
2
μ
3
μ
3
(c)
(d)
Fig. 1 Illustration of ill-initialization and comparison of kernel k-means and COLL. Different clusters are plotted in different marks, and the initial prototypes are plotted in “multisymbol” while the final in “asterisk”. a Data points in the feature space. b A random but ill initialization. Each prototype is computed as the mean of points randomly assigned to the corresponding cluster. µ2 is ill-initialized, i.e., it is relatively far away from any points. c The degenerate result by kernel k-means. The ill-initialization makes the final µ2 assigned with no point. d The procedure and result of COLL. The updating procedure of prototypes is plotted in thinner “multisymbol”. The conscience mechanism successfully makes µ2 win in the iterations, leading to a satisfying result
k-means. However, unlike k-means, in the procedure of each iteration, for each randomly taken data point, the method selects the winning prototype based on the conscience mechanism [7] and updates the winner by the online learning rule [3]. The procedure requires only one winning prototype to update slightly toward the new point rather than re-computing the mean of each cluster at each step, hence much faster convergence rate is achieved and other competitive mechanisms like conscience can be easily integrated. The advantage of the conscience mechanism is that, by reducing the winning rate of the frequent winners, all the prototypes are brought available into the solution quickly and the ill-initialized prototypes are biased so that each prototype can win the competition with almost the same probability [7]. Consequently, two contributions have been made in this paper. 1. The COLL method is insensitive to the ill-initialization and can be generalized to tackle other degenerate problems associated with random initialization. 2. Compared with other techniques aiming at tackling ill-initialization problem, such as global search strategy, our approach achieves faster convergence rate due to both online learning and conscience mechanism. The remainder of the paper is organized as follows. Section 2 formulates the optimization problem of kernel-based clustering and reviews some related work. In Sect. 3, we describe the proposed conscience online learning method. The proposed method is experimentally
123
82
C.-D. Wang et al.
analyzed in Sect. 4. In Sects. 5 and 6, we apply the proposed method in the applications of digits clustering and video clustering, respectively. And, we conclude our paper and present the future work in Sect. 7. The main results in this paper were first presented in [31].
2 The problem and related work In this section, the optimization problem of kernel-based clustering is firstly formulated. Then, we review the commonly used batch kernel k-means and demonstrate that when the prototypes are ill-initialized, the clustering results degenerate seriously. 2.1 Problem formulation Given an unlabelled dataset X = {x1 , . . . , xn } of n data points in Rd , which is projected into a kernel space Y by a mapping φ and the number of clusters c, we wish to find an assignment of each data point to one of c clusters, such that in the kernel space Y , the within-cluster similarity is high and the between-cluster one is low. That is, we seek a map ν : X → {1, . . . , c} to optimize [23] ν = arg min ν
⎧ ⎨ ⎩
φ(xi ) − φ(x j ) 2 −λ
i, j:νi =ν j
(1)
φ(xi ) − φ(x j ) 2
i, j:νi =ν j
⎫ ⎬ ⎭
,
(2)
where λ > 0 is some parameter and we use the short notation νi = ν(xi ). Theorem 1 The optimization criterion (2) is equivalent to the criterion ν = arg min ν
n
φ(xi ) − µνi 2 ,
(3)
i=1
where µk is the mean of data points assigned to cluster k 1 φ(xi ), ∀k = 1, . . . , c µk = −1 |ν (k)| −1 i∈ν
(4)
(k)
and νi satisfies νi = arg
min φ(xi ) − µk 2 , ∀i = 1, . . . , n.
k = 1,...,c
(5)
Proof See [23].
Thus, the goal of kernel-based clustering is to solve the optimization problem in (3). The objective term n
φ(xi ) − µνi 2
(6)
i=1
is known as the distortion error [3]. Ideally, all possible assignments of the data into clusters should be tested and the best one with the smallest distortion error selected. This procedure
123
Conscience online learning
83
is unfortunately computationally infeasible in even a very small dataset, since the number of all possible partitions of a dataset grows exponentially with the number of data points. Hence, efficient algorithms are required. In practice, the mapping function φ is often not known or hard to obtain and the dimensionality of Y is quite high. The feature space Y is characterized by the kernel function κ and corresponding kernel matrix K [23]. Definition 1 A kernel is a function κ, such that κ(x, z) = φ(x), φ(z) for all x, z ∈ X , where φ is a mapping from X to an (inner product) feature space Y . A kernel matrix is a square matrix K ∈ Rn×n such that K i, j = κ(xi , x j ) for some x1 , . . . , xn ∈ X and some kernel function κ. Thus for an efficient approach, the computation procedure using only the kernel matrix is also required. 2.2 Batch kernel k-means The k-means [19] algorithm is one of the most popular iterative methods for solving the optimization problem (3). It begins by initializing a random assignment ν and seeks to minimize the distortion error by iteratively updating the assignment ν νi ← arg min φ(xi ) − µk 2 , ∀i = 1, . . . , n
(7)
k = 1,...,c
and the prototypes µ µk ←
1 |ν −1 (k)|
φ(xi ), ∀k = 1, . . . , c,
(8)
i∈ν −1 (k)
until all prototypes converge or the number of iterations reaches a prespecified value tmax . ˆ denote the old prototypes before the tth iteration, the convergence of all prototypes is Let µ characterized by the convergence criterion eφ =
c
ˆ k 2 ≤ µk − µ
(9)
k=1
where is some very small positive value, e.g., 10−4 . Since in practice, only the kernel matrix K is available, the updating of assignment (7) is computed based on the kernel trick [21] νi ← arg min φ(xi ) − µk 2 k=1,...,c
2 1 = arg min φ(xi ) − −1 φ(x j ) k=1,...,c |ν (k)| j∈ν −1 (k)
2 j∈ν −1 (k) K i, j h∈ν −1 (k) l∈ν −1 (k) K h,l = arg min − . K i,i + k=1,...,c |ν −1 (k)|2 |ν −1 (k)|
(10)
Then, the updated prototypes µ are implicitly expressed by the assignment ν, which is further used in the next iteration. Let νˆ denote the old assignment before the tth iteration, the convergence criterion (9) is computed as
123
84
C.-D. Wang et al.
eφ =
c
ˆ k 2 µk − µ
k=1
=
c k=1
−
2
h∈ν −1 (k) l∈ν −1 (k) |ν −1 (k)|2
h∈ν −1 (k) l∈νˆ −1 (k) |ν −1 (k)||νˆ −1 (k)|
K h,l
K h,l
+
h∈νˆ −1 (k) l∈νˆ −1 (k) |νˆ −1 (k)|2
K h,l
.
(11)
The above procedures lead to the well-known kernel k-means [22]. Given the appropriate initialization, it can find the sub-optimal solution (local minima). Despite great success in practice, it is quite sensitive to the initial positions of cluster prototypes, leading to degenerate local minima. Example 1 In Fig. 1b, µ2 is ill-initialized relatively far away from any point such that it will get no chance to be assigned with points and updated in the iterations of k-means. As a result, the clustering result by k-means is trapped in the degenerate local minima as shown in Fig. 1c, where µ2 is assigned with no point meanwhile µ3 is inappropriately located between two classes. To overcome this problem, current k-means algorithm and its variants usually run many times with different initial prototypes and select the result with the smallest distortion error (6) [29], which are however computationally too expensive. In this paper, we propose an efficient and effective approach to the optimization problem (3), which can output smaller distortion error that is insensitive to the initialization.
3 Conscience online learning In this section, we first introduce the proposed conscience online learning (COLL) model that is insensitive to the ill-initialization. Then, the computation of COLL model with only the kernel matrix K rather than the kernel mapping φ is provided. 3.1 The conscience online learning model Let n k denotes the cumulative winning number of the kth prototype, and f k = n k / the corresponding winning frequency. In the beginning, they are initialized as n k = |ν −1 (k)|,
f k = n k /n, ∀k = 1, . . . , c.
c
l=1 n l ,
(12)
The proposed conscience online learning (COLL) is performed as follows: initialize the same random assignment ν as k-means and iteratively update the assignment ν and the prototypes µ based on the frequency sensitive (conscience) online learning rule. That is, in the tth iteration, for one randomly taken data point φ(xi ), select the winning prototype µνi guided by the winning frequency f k , i.e., conscience based winner selection rule: νi = arg min { f k φ(xi ) − µk 2 }, k=1,...,c
and update the winner µνi with learning rate ηt , i.e.,
123
(13)
Conscience online learning
85
online winner updating rule: µνi ← µνi + ηt φ(xi ) − µνi ,
(14)
as well as update the winning frequency n νi ← n νi + 1,
fk = nk
c
nl , ∀k = 1, . . . , c.
(15)
l=1
The iteration procedure continues until all prototypes converge or the number of iterations reaches a prespecified value tmax . The convergence criterion identical to that of k-means (9) is used. The learning rates {ηt } satisfy conditions [3]: lim ηt = 0,
t→∞
∞ t=1
ηt = ∞,
∞
ηt2 < ∞.
(16)
t=1
In practice, ηt = const/t, where const is some small constant, e.g., 1. By reducing the winning chance of the frequent winners according to (13), the goal of the conscience mechanism is to bring all the prototypes available into the solution quickly and to bias the competitive process so that each prototype can win the competition with almost the same probability. In this way, the proposed COLL model is insensitive to the ill-initialization and thus prevents the result from being trapped into degenerate local minima, meanwhile converges much faster than other iterative methods [7]. Example 2 The same ill-initialization as kernel k-means in Example 1 is used by COLL. However, since the winning frequency of the ill-initialized µ2 will become smaller in the later iterations, it gets the chance to win according to (13). Finally, an almost perfect clustering with small distortion error is obtained as shown in Fig. 1d. 3.2 The computation of COLL As discussed before, any effective algorithm in the kernel space must compute with only the kernel matrix K . To this end, we devise an efficient framework for computation of the proposed COLL based on a novel representation of prototypes termed prototype descriptor. ¯ c×(n+1) denotes the c × (n + 1) matrix space satisfying ∀A ∈ R ¯ c×(n+1) , A:,n+1 ≥ 0, Let R i.e., the last column of matrix A is nonnegative. We define the prototype descriptor based on kernel trick as follows. ¯ c×(n+1) , such Definition 2 (Prototype descriptor) A prototype descriptor is a matrix W φ ∈ R that the kth row represents prototype µk by φ
φ
Wk,i = µk , φ(xi ) , ∀i = 1, . . . , n, Wk,n+1 = µk , µk ,
(17)
⎞ µ1 , φ(x1 ) . . . µ1 , φ(xn ) µ1 , µ1
⎜ µ2 , φ(x1 ) . . . µ2 , φ(xn ) µ2 , µ2 ⎟ ⎟ ⎜ Wφ = ⎜ ⎟. .. .. .. .. ⎠ ⎝ . . . . µc , φ(x1 ) . . . µc , φ(xn ) µc , µc
(18)
i.e., ⎛
123
86
C.-D. Wang et al.
with this definition, the computation of the distortion error (6) now becomes: n
φ(xi ) − µνi 2 =
i=1
n φ(xi ), φ(xi ) + µνi , µνi − 2 µνi , φ(xi )
i=1
=
n
φ φ K i,i + Wνi ,n+1 − 2Wνi ,i .
(19)
i=1
Let’s consider the computation of 4 ingredients of the proposed COLL model. Theorem 2 (Initialization) The random initialization can be realized in the way of φ
φ
W:,1:n = AK , W:,n+1 = diag(AK A )
(20)
where diag(M) denotes the main diagonal of a matrix M and the positive matrix A = c×n [Ak,i ]c×n ∈ R+ has the form
1 if i ∈ ν −1 (k) (21) Ak,i = |ν −1 (k)| 0 otherwise. That is, the matrix A reflects the initial assignment ν. Proof Assume the assignment is randomly initialized as ν. Substitute the computation of the φ prototypes (4) to the definition of Wk,: in (17), we get φ
Wk,i = µk , φ(xi ) , ∀i = 1, . . . , n j∈ν −1 (k) φ(x j ) , φ(xi ) = |ν −1 (k)| =
j∈ν −1 (k)
1 K j,i = Ak, j K j,i = Ak,: K :,i −1 |ν (k)| n
(22)
j=1
φ
Wk,n+1 = µk , µk
h∈ν −1 (k) φ(xh ) l∈ν −1 (k) φ(xl ) = , |ν −1 (k)| |ν −1 (k)| 1 K = −1 (k)|2 h,l |ν −1 −1 h∈ν
=
(k) l∈ν
n n
(k)
Ak,h Ak,l K h,l = Ak,: K A k,: .
(23)
h=1 l=1
Thus, we obtain the initialization of W φ as (20). The proof is finished.
Theorem 3 (Conscience-based winner selection rule) The conscience-based winner selection rule (13) can be realized in the way of φ
φ
νi = arg min { f k · (K i,i + Wk,n+1 − 2Wk,i )}. k=1,...,c
123
(24)
Conscience online learning
87
Proof Consider the winner selection rule, i.e., (13), one can get νi = arg min { f k φ(xi ) − µk 2 } k=1,...,c
φ
φ
= arg min { f k · (K i,i + Wk,n+1 − 2Wk,i )}.
(25)
k=1,...,c
Thus, we get the formula required.
Theorem 4 (Online winner updating rule) The online winner updating rule (14) can be realized in the way of
φ Wνi , j
←
φ
(1 − ηt )Wνi , j + ηt K i, j φ (1 − ηt )2 Wνi , j
j = 1, . . . , n,
+ ηt2 K i,i
φ + 2(1 − ηt )ηt Wνi ,i
j = n + 1.
(26)
Proof Although we do not know exactly the expression of µνi , however, we can simply take ´ νi . Substitute the online µνi as a symbol of this prototype and denote its updated one as µ φ winner updating rule (14) to the winning prototype Wνi ,: , we have φ
´ νi , φ(x j ) ∀ j = 1, . . . , n Wνi , j ← µ
= µνi + ηt (φ(xi ) − µνi ), φ(x j )
= (1 − ηt ) µνi , φ(x j ) + ηt φ(xi ), φ(x j ) , φ Wνi , j
´ νi , µ ´ νi
← µ
(27)
j =n+1
= µνi + ηt (φ(xi ) − µνi ), µνi + ηt (φ(xi ) − µνi )
= (1 − ηt )2 µνi , µνi + ηt2 φ(xi ), φ(xi ) + 2(1 − ηt )ηt µνi , φ(xi ) .
(28)
Then, we get the online winner updating rule as (26).
It is a bit complicated to compute the convergence criterion without explicit expression of {µ1 , . . . , µc }. Notice that, in one iteration, each point φ(xi ) is assigned to one and only one winning prototype. Let array π k = [π1k , π2k , . . . , πmk k ] stores the indices of m k ordered points assigned to the kth prototype in one iteration. For instance, if φ(x1 ), φ(x32 ), φ(x8 ), φ(x20 ), φ(x15 ) are 5 ordered points assigned to the 2nd prototype in the tth iteration, then the index array of the 2nd prototype is π 2 = [π12 , π22 , . . . , πm2 2 ] = [1, 32, 8, 20, 15] with π12 = 1, π22 = 32, π32 = 8, π42 = 20, π52 = 15 and m 2 = 5. The following lemma formulates the cumulative update of the kth prototype based on the array π k = [π1k , π2k , . . . , πmk k ]. Lemma 1 In the tth iteration, the relationship between the updated prototype µk and the ˆ k is: old µ ˆk + µk = (1 − ηt )m k µ
mk l=1
(1 − ηt )m k −l ηt φ(xπ k ) l
(29)
where array π k = [π1k , π2k , . . . , πmk k ] stores the indices of m k ordered points assigned to the kth prototype in this iteration.
123
88
C.-D. Wang et al.
Proof To prove this relationship, we use the principle of mathematical induction. One can easily verify that (29) is true for m k = 1 directly from (14), ˆ k + ηt φ(xπ k ) − µ ˆk µk = µ 1
ˆk + = (1 − ηt )1 µ
1
(1 − ηt )0 ηt φ(xπ k ).
(30)
l
l=1
Assume that it is true for m k = m, that is, for the first m ordered points, ˆk + µk = (1 − ηt )m µ
m
(1 − ηt )m−l ηt φ(xπ k ).
(31)
l
l=1
Then for m k = m + 1, i.e., the (m + 1)th point, from (14) we have, µk = µk + ηt φ(xπmk ) − µk k
= (1 − ηt )µk + ηt φ(xπmk ) k ˆk + = (1 − ηt ) (1 − ηt ) µ m
m
(1 − ηt )
m−l
ηt φ(xπ k ) + ηt φ(xπmk ) l
l=1
ˆk + = (1 − ηt )m+1 µ
m+1
k
(1 − ηt )m+1−l ηt φ(xπ k ). l
l=1
(32)
This expression shows that (29) is true for m k = m+1. Therefore, by mathematical induction, it is true for all positive integers m k . Theorem 5 (convergence criterion) The convergence criterion can be computed by !2 mk mk c c K π k ,π k 1 φ h l φ 2 e = W 1− + η t k,n+1 m (1 − ηt ) k (1 − ηt )h+l k=1 k=1 h=1 l=1 ⎛ ⎞ φ ! mk c W k 1 k,πl ⎝ 1− ⎠. +2ηt (1 − ηt )m k (1 − ηt )l k=1
(33)
l=1
ˆ k can be retained from the updated µk as Proof According to Lemma 1, the old µ ˆk = µ
m k φ(x k ) µk πl − η . t m k (1 − ηt ) (1 − ηt )l
(34)
l=1
ˆ k 2 , we have Substitute it to eφ = ck=1 µk − µ 2 m k φ(x k ) c µk πl φ e = − η µk − t (1 − ηt )m k (1 − ηt )l k=1 l=1 !2 mk m k φ(x k ), φ(x k )
c c 1 πh πl 2 1− = µ , µ
+ η k k t m h+l k (1 − ηt ) (1 − ηt ) k=1 k=1 h=1 l=1 ! m c k µ , φ(x k )
k 1 πl +2ηt 1− . m (1 − ηt ) k (1 − ηt )l k=1
Thus, eφ can be computed by (33). This ends the proof.
123
(35)
l=1
Conscience online learning
89
Nonlinear separator
W1,:
x
x Wˆv ,:
x
( )
Wv ,: x
W1,: ( )
x x
( ) Linear separator
( )
x
( )
( )
(x) (x)
(x) Wˆ v ,: (x) Wv ,: (x) (x)
(x )
Fig. 2 Illustration of the winner updating procedure of COLL. The function φ embeds the data into a feature space where the nonlinear pattern becomes linear. Then COLL is performed in this feature space. The new φ inputs are denoted in x and φ(x ) while the arrow plots the update of winning prototype, i.e., Wν,: and Wν,: . The linear separator in feature space is actually nonlinear in input space, so is the update of the winner
For clarification, Algorithm 1 summaries the proposed conscience online learning (COLL) method. Figure 2 illustrates the winner updating procedure of COLL. The flowchart describing the detailed steps involved in the COLL algorithm is shown in Fig. 3. 3.3 Computational complexity The computation of the proposed COLL method consists of two parts: initialization of W φ and 2 φ φ iterations to update W . From (20), the initialization of W takes O cn operations. For each iteration, the computational complexity is O n(c + (n + 1)) + (c + n 2 + n) . Since O (n(c + (n + 1))) operations are needed to perform the iteration (for each point, O(c) to select a winner and O(n + 1) to update the winner, there being n points) and O c + n 2 + n operations are needed to compute the convergence criterion eφ (the first term of (33) taking O(c) operations, the second term at most O(n 2 ) operations, and the third term O(n) operations). Assume the iteration number is tmax, since in general 1 < c < n, the computational complexity for iteration procedure is O tmax n(c + (n + 1)) + (c + n 2 + n) = O tmax n 2 . Consequently, the total computational complexity of the proposed COLL method is O cn 2 + tmax n 2 = O max(c, tmax )n 2 , which is the same as that of kernel k-means if the same number of iterations is obtained. However, due to the conscience mechanism [7] and online learning rule [3], the proposed COLL achieves faster convergence rate than its counterpart. Thus, fewer iterations are needed for COLL to converge. This is especially beneficial in large-scale data clustering.
4 Experimental analysis In all experiments except the one on the linearly separable dataset containing 33 2 2 2-dimensional points, Gaussian kernel κ(xi , x j ) = e−xi −x j /2σ is used to construct the kernel matrix K . To obtain a meaningful comparison, on each dataset, the same σ value is used for all the compared methods. Note that appropriate kernel selection is out of the scope of this paper and we only try the Gaussian kernel with some appropriate σ value according to the distance matrix D = [ xi − x j ]n×n . The iteration stopping criteria are set as = 10−4 and tmax = 100. Learning rate is set at ηt = 1/t. Many restarts have been conducted and for each run, the same random initial prototypes are used for the compared methods. All the experiments are implemented in Matlab7.8.0.347 (R2009a) 64 bit edition on a workstation (Windows 64 bit, 8 Intel 2.00 GHz processors, 16 GB of RAM).
123
90
C.-D. Wang et al.
Kernel matrix K , cluster number c , learning rate { t } , ,tmax Randomly initialize assignment
{vi {1,..., c}| i 1,..., n}
Initialize winning frequency { f k
v 1 (k ) / n | k 1,..., c}
Initialize prototype descriptor W by W:,1:n with A [ Ak ,i ]c n , Ak ,i
1/ v 1 (k ) if vi 0
AK , W:,n k
diag ( AKAT )
1
. Initialize t
0
otherwise
Get c empty index arrays { k | k 1,..., c} and a random permutation {I i {1,..., n}| i 1,..., n, I i I j , i j} , t t 1, p 1
n?
p
Yes
No i
Ip, p
p 1
Select the winner based on the winning frequency: vi Append i to the index array
vi
(1 Update the winner: Wvi , j
vi 1
[
t
,...,
vi mvi
)Wvi , j
t
2
(1
t
Update the winning frequency: nvi
c
Compute e
1 k 1
(1
t
)
mk
c
Wk , n
K i ,i
mk
k 1 h 1 l 1
e
2(1
K
(1
or t
No
k h
t
c
nk /
mk
2 t
1
i, mvi
1
2Wk ,i ))
mvi 1
j 1, ..., n
nvi 1, f k
2
1
vi mvi 1
Ki , j 2 t
) Wvi , j
arg min( f k ( Ki ,i Wk ,n
] , such that
j 1
, t
k l
)
h l
) tWvi ,i
j
n 1
n j , k 1,..., c
c
2
1
t k 1
mk
1 (1
t
)
mk
l 1
Wk ,
(1
k l
t
)l
tmax ?
Yes Output cluster assignment {vi arg min( f k ( Ki ,i Wk ,n
1
2Wk ,i )) | i 1,..., n}
Fig. 3 The flowchart of the proposed COLL algorithm
In this section, by conducting simulation experiments, we have shown how the conscience mechanism works to avoid the degenerate clustering results in the case of ill-initialization. Then, in the synthetic two moons dataset, we demonstrated the effectiveness of the proposed COLL in avoiding degenerate results in kernel-based clustering.
123
Conscience online learning
91
Algorithm 1: Conscience online learning (COLL) Input: kernel matrix K ∈ Rn×n , c, {ηt }, , tmax . Output: cluster assignment ν subject to (3). 1: Randomly initialize assignment ν and set t = 0; Initialize the winning frequency { f k } by (12); ¯ c×(n+1) by (20). Initialize prototype descriptor W φ ∈ R 2: repeat 3: Get c empty index arrays {π k = ∅ : k = 1, . . . , c} and a random permutation {I1 , . . . , In : Ii ∈ {1, . . . , n}, s.t. Ii = I j , ∀i = j}, set t = t + 1. 4: for p = 1, . . . , n do φ 5: Select the winning prototype Wνi ,: of the ith point (i = I p ) by (24), and append i to the νi th index array π νi . φ 6: Update the winning prototype Wνi ,: with learning rate ηt by (26) and the winning frequency by (15). 7: end for 8: Compute eφ via (33). 9: until eφ ≤ or t ≥ tmax 10: Obtain the cluster assignment νi by (24), ∀i = 1, ..., n.
Table 1 Data points of three classes in the synthetic dataset Class 1 x
Class 2 y
Class 3
x
y
x
y
2.9218
8.8770
4.1444
2.5162
10.5870
5.1868
1.9110
10.1535
4.2405
1.3046
9.0512
4.2030
1.5779
8.0387
4.5244
2.0368
9.6120
4.6046
0.9247
9.6863
5.5208
2.6206
8.7061
4.4281
2.4751
9.1016
4.2434
2.7171
8.2116
5.8210
2.0162
9.0584
5.2367
1.9404
9.4845
5.4187
1.1672
9.5068
5.2912
0.6894
9.7337
5.1716
1.7634
9.8439
5.1090
1.2244
10.2885
5.1700
0.7064
8.2426
5.3921
2.1546
8.6259
4.4191
2.3040
9.9326
4.1707
2.5616
7.8489
5.7022
5.7125
2.3275
10.1436
5.2449
5.0803
1.5668
The synthetic dataset contains 33 2-dimensional points belonging to three classes as shown in Table 1. We randomly located three initial prototypes µ1 , µ2 , µ3 at: µ1 =
! 4.4198 , 6.2457
µ2 =
1.7387 9.2810
! ,
µ3 =
6.9966 3.4642
! .
(36)
Figures 4 and 5 demonstrate the online learning procedures and results of COLL and online learning without conscience mechanism (denoted as OLL for short), respectively. In both cases, the prototypes randomly initialized in (36) are so inappropriate that the central one is relatively far away from any data points. By using the conscience mechanism, the COLL approach can bias the ill-initialized prototype (red one) to be selected as the winning prototype, such that it would move toward one cluster as shown in Fig. 4. After seven iterations, the 3 prototypes converged to the
123
92
C.-D. Wang et al. Initialization
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
Procedure of iteration 1
12 10 8
8
6
6
4
4
4
2
2
2
0
0
2
4
6
8
10
12
Procedure of iteration 3
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
0
2
4
6
8
10
12
Procedure of iteration 4
12
8
0
6
4
4
4
2
2
2
0
0
4
6
8
10
12
Procedure of iteration 6
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
0
2
4
6
8
10
12
Procedure of iteration 7
12
8
0
6
4
4
4
2
2
2
2
4
6
8
10
12
0
0
2
4
6
8
10
0
2
4
12
0
10
12
6
8
10
12
Results Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
8
6
0
8
10
6
0
6
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10
4
Procedure of iteration 5
8
6
2
2
10
6
0
0
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10
6
0
Procedure of iteration 2
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
0
2
4
6
8
10
12
Fig. 4 Demonstration of conscience on-line learning (COLL) iteration procedure
appropriate values: µ1 =
! 4.8643 , 2.1817
µ2 =
1.7662 9.2323
! ,
µ3 =
! 9.1317 . 4.9178
(37)
Figure 6 plots the winning frequencies of the three prototypes as a function of competition index over seven iterations. As the iteration continued, the winning frequencies converged to be stable and characterize the proportion of the corresponding cluster. However, for the online learning without conscience mechanism (OLL), the ill-initialized prototype would never get a chance to be selected as the winner, leading to the degenerate clustering. The 3 prototypes learned by the OLL were trapped to degenerate results: ! ! ! 4.4198 1.7845 6.9738 µ1 = , µ2 = , µ3 = (38) 6.2457 9.2629 3.4361 where one prototype is assigned with no point meanwhile one prototype is located between two classes as shown in Fig. 5. We further performed synthetic dataset clustering to demonstrate the effectiveness of COLL in handling the ill-initialization problem in kernel-based clustering. The classic two moons dataset consisting n = 2, 000 points was generated as 2 half circles in R2 . The σ for 2 2 constructing Gaussian kernel matrix κ(xi , x j ) = e−xi −x j /2σ was set at 0.15. Figure 7 shows the original two moons dataset and clustering results obtained by kernel k-means [22]
123
Conscience online learning
93
Initialization
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
Procedure of iteration 1
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
8
6
6
4
4
4
2
2
2
0
0
2
4
6
8
10
12
Procedure of iteration 3
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
0
2
4
6
8
10
12
Procedure of iteration 4
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
0
4
4
4
2
2
2
0
0
6
8
10
12
Procedure of iteration 6
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
0
2
4
6
8
10
12
Procedure of iteration 7
12
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10 8
0
6
4
4
2
2
2
0
0
6
8
10
12
0
2
4
6
8
10
12
0
2
4
0
10
12
6
8
10
12
Results Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
8
4
4
8
10
6
2
6
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
12
6
0
4
Procedure of iteration 5
8 6
4
2
10
6
2
0
12
6
0
Cluster one Cluster two Cluster three Prototype one Prototype two Prototype three
10
6
0
Procedure of iteration 2
12
0
2
4
6
8
10
12
Fig. 5 Demonstration of on-line learning without conscience mechanism (OLL) iteration procedure
and COLL, respectively. Due to the sensitivity to ill-initialization, kernel k-means has been trapped into degenerate local optima as shown in Fig. 7b. However, since the conscience mechanism is capable of handling ill-initialization problem by reducing the winning rate of the frequent winner, the clustering result by the proposed COLL with the same initialization is promising as shown in Fig. 7c. It demonstrates the effectiveness of COLL in tackling ill-initialization problem in kernel-based clustering.
5 Applications in digit clustering This section reports the experimental results in the applications of digit clustering over four widely tested digit datasets. We compared the proposed COLL method with the classical kernel k-means [22], and two variants proposed to overcome the ill-initialization problem, namely kernel k-means based on genetic algorithm (genetic kernel k-means) [15] and global kernel k-means [29]. Comparison results on the convergence rate in terms of the convergence criterion (9) and final distortion error (6) reveal that the proposed COLL achieves a faster convergence rate meanwhile outputs much smaller distortion error than the compared methods. Additionally, given the “ground-truth” cluster labels, two widely used external clustering evaluations, namely classification rate (CR) and normalized mutual information (NMI) [8]
123
94
C.-D. Wang et al. 1 0.9
Winning frequency
0.8 0.7 0.6 0.5 0.4 0.3 0.2 Prototype 1 Prototype 2 Prototype 3
0.1 0 0
33
66
99
132
165
198
231
Competition index over seven iterations
Fig. 6 The winning frequencies of the three prototypes as a function of competition index over seven iterations. In each iteration, there exist 33 competitions (i.e., 33 data points). Thus the number of competitions is 231. According to the definition and updating of winning frequency, i.e. (12) and (15), in the first four iterations, the denominator, i.e., the number of competitions, is not so large (since there are only 33 points in this synthetic dataset), which means there is a big influence on the value of the winning frequency when the numerator increases by one. Therefore, the winning frequencies change a lot in the first four iterations. However, when the competition continues, since the denominator becomes much larger, the fraction is less impacted by the increment of both the denominator and numerator. Therefore, the amount of change becomes smaller and smaller, and the winning frequencies converge to characterize the proportion of the corresponding cluster 1.5
Class 1 Class 2
1.5
1.5
Cluster 1 Cluster 2
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1 −1.5 −1 −0.5 0
0.5
(a)
1
1.5
2
2.5
−1 −1.5 −1 −0.5 0
(b)
0.5
1
1.5
2
2.5
−1 −1.5 −1 −0.5 0
Cluster 1 Cluster 2
0.5
1
1.5
2
2.5
(c)
Fig. 7 Original two moons dataset and clustering results by kernel k-means and the proposed COLL method. a Original two moons dataset. b Degenerate clustering by Kernel k-means. It is obvious that degenerate local optimum is obtained due to the sensitivity to ill initialization. c Clustering result by the proposed COLL with the same initialization. An almost perfect clustering is obtained with only a few points being erroneously assigned
were computed to compare the four algorithms. Comparison results have demonstrated the effectiveness of the proposed COLL method in digit clustering. 5.1 Digit datasets For digit clustering, we selected four widely tested digit datasets, including Pen-based recognition of handwritten digit dataset (Pendigits), Multi-feature digit dataset (Mfeat), USPS [11], and MNIST [16]. The first two datasets are from UCI repository [2]. The Pendigits dataset contains 10,992 digits of 10 unbalanced classes {0, 1, . . . , 9}. The number of attributes is 16. The Mfeat contains 2,000 handwritten numerals of 10 balanced classes {0, . . . , 9}, represented as 6 feature sets, including mfeat-fou (76 Fourier coefficients of the character shapes),
123
Conscience online learning
95
Table 2 Summary of digit datasets Dataset Pendigits
n
c
d
Balanced
10,992
10
16
Mfeat
2,000
10
649
USPS
11,000
10
256
5,000
10
784
MNIST
× √ √ √
σ 60.53 809.34 1,286.70 2,018.30
n is the number of data points; c is the number of classes; d is the dimensionality; “Balanced” means whether all classes are of the same size. The σ is fixed for all compared kernel based methods
Fig. 8 Some samples of MNIST dataset
mfeat-fac (216 profile correlations), mfeat-kar (64 Karhunen-Love coefficients), mfeat-pix (240 pixel averages in 2 × 3 windows), mfeat-zer (47 Zernike moments), mfeat-mor (6 morphological features). The USPS dataset contains 11,000 scaled handwritten digit images of size 16 × 16, with 1,100 images for each digit category. The MNIST dataset used in this paper contains 5,000 scaled handwritten digit images of size 28 × 28, with 500 images for each digit category. Table 2 summarizes the properties of the four datasets, as well as the σ used in constructing the Gaussian kernel matrix. Figure 8 shows some samples of MNIST dataset. 5.2 Convergence analysis We first analyzed the convergence property of the proposed COLL. Figure 9 plots the logarithm of the convergence criterion eφ value as a function of iteration step obtained by COLL and kernel k-means when the actual number of underlying clusters c (i.e., c = 10) is provided. Since the main procedures of the genetic kernel k-means and global kernel k-means involve running the kernel k-means many times, it does not output the single convergence criterion. One can observe from Fig. 9 that the log(eφ ) on all datasets except USPS (Fig. 9c) monotonically decreasing from some relatively large values to some small values as the iteration step increases, which implies that the solution trends to almost converge. In Fig. 9c, the exception of USPS dataset is that in the 18th iteration, the prototype descriptor W φ (i.e., prototypes) makes a larger change than in the 21th iteration, which does not affect the overall convergence since the changes in general become smaller and smaller as the iteration continues. The algorithm nconverges when the2changes reach almost zeros, where the minimal distortion errors (i.e., i=1 φ(xi ) − µνi ) are obtained. Moreover, compared with kernel k-means, the plotted curves reveal that COLL achieves a much faster convergence rate. That is, much fewer iterations are needed in our method to achieve eφ < as shown in Fig. 9. Please note that log is used, which implies small difference between the plotted log values is indeed quite large.
123
96
C.-D. Wang et al. 800
500
Kernel k−means The proposed COLL
700
Log(e )
600 500
φ
Log(eφ)
Kernel k−means The proposed COLL
400
400 300 200
300 200 100
100 0
0
10
20
30
40
50
60
70
80
90
0
100
0
10
20
30
Iteration step
(a)
60
70
800
Kernel k−means The proposed COLL
700
80
90
100
Kernel k−means The proposed COLL
700 600
Log(e )
600 500
φ
Log(eφ)
50
(b)
800
400 300 200
500 400 300 200
100 0
40
Iteration step
100 0
10
20
30
40
50
60
70
80
90
100
0
0
Iteration step
(c)
10
20
30
40
50
60
70
80
90
100
Iteration step
(d)
Fig. 9 The log(eφ ) value as a function of iteration step by COLL and kernel k-means. Note that log is used, which implies small difference between the plotted log values is indeed quite large. COLL converges much faster than the compared method, i.e., fewer iterations are needed to achieve eφ < . a Pendigits. b Mfeat. c USPS. d MNIST
5.3 Comparing using internal measurement We compared the performances of the four methods in terms of minimizing the distortion errors, which is a widely used internal measurement. Two types of comparisons were carried out. One is to compare their performances when the number of clusters c is set to different values and the other is performed when the actual number of underlying clusters c is prespecified and fixed (i.e., c = 10) for the four digit datasets. Figure 10 plots the distortion error as a function of the preselected number of clusters obtained by the four methods. It is obvious that the proposed COLL achieves the smallest distortion errors among the compared methods. Especially, it outperforms the latest develop of kernel k-means, i.e., global kernel k-means, on all the four datasets. It should be pointed out that, ideally, the globally minimal distortion error, e.g., the one obtained by testing all the possible assignments ν with the best one selected should have monotonically decreased as the number of clusters increases. However, since all the compared methods are just heuristic methods that can achieve local optima, it is not necessary to be monotonically decreasing. Table 3 lists the average values of final distortion errors (DE) when the actual number of underlying clusters c is prespecified and fixed. On all the datasets, the proposed COLL has generated the smallest distortion errors among the compared kernel methods. It again validates that COLL holds relatively better convergence property, i.e., it can converge to more appropriate implicit prototypes {µk : k = 1, . . . , c} such that much smaller distortion error is obtained.
123
Conscience online learning
97 Kernel k−means Genetic kernel k−means Global kernel k−means The proposed COLL
7100 7000 6900 6800 6700
1700 1600 1500 1400 1300
6600 6500
Kernel k−means Genetic kernel k−means Global kernel k−means The proposed COLL
1800
Distortion error
Distortion error
7200
1200 2
4
6
8
10
12
14
16
18
20
2
4
6
Number of clusters
(a)
10
12
14
16
18
20
(b)
8100 8000 7900 7800 7700 7600 7500
3150
Kernel k−means Genetic kernel k−means Global kernel k−means The proposed COLL
3100
Distortion error
Kernel k−means Genetic kernel k−means Global kernel k−means The proposed COLL
8200
Distortion error
8
Number of clusters
3050 3000 2950 2900 2850 2800
7400
2750 2
4
6
8
10
12
14
16
18
20
2
Number of clusters
4
6
8
10
12
14
16
18
20
Number of clusters
(c)
(d)
Fig. 10 The distortion error as a function of the preselected number of clusters obtained by the four methods. The proposed COLL achieves the smallest distortion errors on the four digit datasets. a Pendigits. b Mfeat. c USPS. d MNIST Table 3 Average distortion error (DE) over 100 runs for the four digit datasets by the four kernel based clustering algorithms Dataset
Kernel k-means
Genetic kernel k-means
Global kernel k-means
COLL
Pendigits
6,704.0
6,680.4
6,664.9
6,619.3
Mfeat
1,405.5
1,397.0
1,363.3
1,324.2
USPS
8,004.7
7,905.9
7,782.0
7,561.0
MNIST
3,034.3
2,913.5
2,894.1
2,754.9
5.4 Comparing using external measurement Apart from the internal evaluation like distortion error, we also tested the compared methods in terms of the external clustering evaluations with the number of underlying clusters c prespecified. Two widely adopted external evaluations, namely classification rate (CR) and normalized mutual information (NMI) [24] were used. Although there exist many external clustering evaluation measurements, such as clustering errors, average purity, entropy-based measures [25], and pair counting based indices [10], as pointed out by Strehl and Ghosh [24], the mutual information provides a sound indication of the shared information between a pair of clusterings. And, classification rate (CR) and normalized mutual information (NMI) [24] are widely used in measuring how closely the clustering and underlying class labels match. For computing the classification rate, each learned category is firstly associated with the “ground-truth” category that accounts for the largest number of samples in the learned category; then, the classification rate (CR) is computed as the ratio of the number of correctly
123
98
C.-D. Wang et al.
Table 4 Average classification rate (CR) and normalized mutual information (NMI) over 100 runs for the four digit datasets by the four kernel based clustering algorithms Digit dataset
Kernel k-means
Genetic kernel k-means
Global kernel k-means
COLL
CR (%)
NMI
CR (%)
CR (%)
CR (%)
NMI
NMI
NMI
Pendigits
71.0
0.715
72.8
0.730
73.3
0.736
75.0
0.753
Mfeat
53.1
0.533
53.5
0.537
53.8
0.542
60.1
0.604
USPS
35.2
0.354
36.0
0.363
36.6
0.369
45.8
0.461
MNIST
43.7
0.441
45.2
0.455
47.0
0.472
51.9
0.520
classified samples to the size of dataset. That is, CR =
#correctly classified samples × 100% #samples in the dataset
(39)
Obviously, a higher classification rate indicates a more accurate clustering. Given a dataset X of size n, the clustering labels π of c clusters and actual class labels ζ of cˆ classes, a confusion ( j) matrix is formed first, where entry (i, j), n i gives the number of points in cluster i and class j. Then, NMI can be computed from the confusion matrix [8] 2 NMI =
c l=1
cˆ
h=1
(h)
nl n
(h)
nl n (h) cˆ (i) i=1 n i i=1 n l
log c
(40) H (π) + H (ζ ) cˆ n ( j) c n i ni n ( j) where H (π) = − i=1 j=1 n log n are Shannon entropy of n log n and H (ζ ) = − cluster labels π and class labels ζ , respectively, with n i and n ( j) denoting the number of points in cluster i and class j. A high NMI value indicates that the clustering and underlying class labels match well. Table 4 lists the average CR and NMI values over 100 runs for the four digit datasets obtained by the four algorithms. In all datasets, the proposed COLL can generate the best results among the compared algorithms. In particular on Mfeat datasets, COLL has obtained 0.062 higher average NMI and 6.3% higher average CR than the global kernel k-means, meanwhile on the more intractable USPS dataset, 0.092 higher average NMI and 9.2% higher average CR have been achieved, which are great improvements. Figure 11 plots the average (running 100 times) computational time in seconds on the four digit datasets by the four algorithms. That is, we run each method on each dataset and record the computational time used to output the clustering results from the input kernel matrix. This is repeated 100 times, and the average time for each method on each dataset is plotted. From the figure, the proposed method is the fastest among the compared methods, which validates that the COLL approach achieves a much faster convergence rate than other methods. The main reasons are as follows. Although the computational complexities of COLL and kernel k-means are the same, i.e., O(max(c, tmax )n 2 ), due to the conscience mechanism [7] and online learning rule [3], COLL achieves faster convergence rate than its counterpart. Additionally, the computational complexity of the global kernel k-means is O(ctmax n 2 ) [29], which is much larger than that of COLL and kernel k-means. The most timeconsuming algorithm is the genetic kernel k-means, the computational complexity of which is O(max(c, tmax )n 2 gmax ), where gmax is the maximum number of generations typically set to 50 [15].
123
Conscience online learning
99
Fig. 11 The average (running 100 times) computational time in seconds on the four digit datasets by the four algorithms
6 Applications in video clustering In this section, we report the experimental results in the application of video clustering (automatic segmentation). Video clustering plays an important role in automatic video summarization/abstraction as a preprocessing step [28]. Consider a video sequence in which the camera is fading/switching/cutting among a number of scenes, the goal of automatic video clustering is to cluster the video frames according to the different scenes. The difference between digit clustering and video clustering lies in the feature spaces of the digit dataset and the video sequence, especially at the class boundaries. The class boundaries between different digits are clearer than those between different scenes, since the camera may be fading/switching/cutting among different scenes. However, the within-class variation of a digit dataset is much larger than that of a video sequence. In the video clustering, the grayscale values of the raw pixels were used as the feature vector for each frame. For one video n sequence, the frames {fi ∈ Rd }i=1 are taken as the dataset, where d = width × height and n is the length of the video sequence. We selected the 11 video sequences from the openvideo website [9], which are 11 segments of the whole “NASA 25th Anniversary Show” with d = 320 × 240 = 76,800 and n (i.e., the duration of the sequence) varying from one sequence to another. Figure 12 illustrates the clustering result of one video sequence “NASA 25th Anniversary Show, Segment 2” (ANNI002) by the proposed COLL. About 2492 frames have been clustered into 16 scenes, and each scene is plotted with three representative frames. From the figure, we can see that there exists camera fading/switching/cutting among a number of scenes in this video sequence. Except for the frames from 400 to 405 and 694 to 701 as well as the last two clusters, where the separation boundaries are not so clear, satisfactory segmentation has been obtained. For comparison, “ground-truth” segmentation of each video sequence has been manually obtained, through which the CR and NMI values are computed to compare COLL with kernel k-means, genetic kernel k-means, and global kernel k-means. Table 5 lists the average values (running 10 times) of CR and NMI on the 11 video sequences. Additionally, the length of each video sequence (i.e., number of the frames) is also listed. The results in terms of average values of CR and NMI reveal that the proposed COLL generates the best
123
100
C.-D. Wang et al.
Cluster 1: 0001-0061
Cluster 2: 0062-0163
Cluster 3: 0164-0289
Cluster 4: 0290-0400
Cluster 5: 0401-0504
Cluster 6:0505-699
Cluster 7: 0700-0827
Cluster 8: 0828-1022
Cluster 9: 1023-1083
Cluster 10: 1084-1238
Cluster 11: 1239-1778
Cluster 12: 1779-1983
Cluster 13: 1984-2119
Cluster 14: 2120-2243
Cluster 15: 2244-2400
Cluster 16: 2401-2492
Fig. 12 Clustering frames of the video sequence ANNI002 into 16 scenes using the proposed COLL. The first, middle and last frames of each segment are plotted, and the beginning and ending frames of each segment are shown like “Cluster 2: 0062-0163”, which means the beginning and ending frames of cluster 2 are Frame 62 and 163 respectively. Except four boundaries between cluster 4 and cluster 5, cluster 6 and cluster 7, cluster 14 and cluster 15, cluster 15 and cluster 16, the obtained cluster boundaries are the same as the true class boundaries. These misclassified boundaries are marked in red Table 5 The means (running 10 times) of CR and NMI on 11 video sequences Video (#frames)
ANNI001 (914)
Kernel k-means
Genetic kernel k-means
Global kernel k-means
COLL
CR (%) NMI
CR (%) NMI
CR (%) NMI
CR (%) NMI
77.9
78.5
80.0
85.0
0.781
0.788
0.801
0.851
ANNI002 (2492)
70.2
0.705
71.4
0.715
71.9
0.721
73.8
0.741
ANNI003 (4265)
70.9
0.712
72.2
0.724
73.5
0.739
76.0
0.762
ANNI004 (3897)
72.8
0.731
73.7
0.740
74.8
0.750
75.7
0.759
ANNI005 (11361) 64.4
0.645
64.8
0.649
65.2
0.656
67.7
0.680
ANNI006 (16588) 62.0
0.622
62.8
0.630
63.5
0.638
64.0
0.642
ANNI007 (1588)
72.5
0.727
72.7
0.729
73.8
0.740
76.8
0.770
ANNI008 (2773)
74.8
0.749
75.1
0.753
76.8
0.771
79.3
0.794
ANNI009 (12304) 72.4
0.727
74.2
0.744
76.0
0.763
78.0
0.781
ANNI010 (30363) 65.9
0.661
69.0
0.691
70.8
0.709
73.2
0.734
ANNI011 (1987)
0.738
74.1
0.743
74.7
0.749
78.4
0.785
73.6
#frames denotes the number of frames of each sequence
segmentation among the compared methods, which is a significant improvement. Additionally, Fig. 13 plots the average (running 10 times) computational time in seconds on the eleven video sequences by the four algorithms. We can see that the proposed COLL method is quite efficient in this case, which validates its faster convergence rate. In summary, experimental results have demonstrate the effectiveness of the COLL method in video clustering.
123
Conscience online learning
101
Fig. 13 The average (running 10 times) computational time in seconds on the eleven video sequences by the four algorithms. a ANNI001∼ANNI004, b ANNI005∼ANNI008, c ANNI009∼ANNI011
7 Conclusions and future work Kernel-based clustering is one of the most popular methods for partitioning nonlinearly separable datasets. However, exhaustive search for the global optima is NP-hard. In this paper, we have presented an efficient and effective approach termed conscience online learning (COLL) for solving this optimization problem in the online learning framework. Unlike the classical k-means method, the proposed approach is insensitive to the initial positions of the cluster prototypes. Compared with other techniques aiming at tackling the ill-initialization problem, the COLL method achieves much faster convergence rate, due to both online learning and conscience mechanism. The rationale of the proposed COLL method has been experimentally analyzed on the synthetic datasets. Then, we have applied the COLL method to the applications of digit clustering and video clustering. The experimental results have demonstrated the significant improvement over existing kernel-based clustering methods. In our future work, we will consider the automatic cluster number estimation in kernelbased clustering. In this paper, we have presented an online learning framework for kernel-based clustering and utilized the conscience mechanism to tackle the ill-initialization problem. In the literature of data clustering, automatically estimating the number of clusters is another very important research topic [5,38,40]. Due to the specific computation of kernel-based clustering, it is much more difficult to realize the automatic cluster number selection. Fortunately, based on the proposed online learning framework, we can incorporate the existing model-selection strategies such as rival penalization [5,38] and self-splitting [40] into the online learning framework to realize the automatic cluster number selection in the kernel-based clustering. In this way, it becomes possible to develop a clustering method with simultaneous capabilities of automatically estimating the cluster number and identifying nonlinearly separable clusters. Acknowledgments This project was supported by the NSFC-GuangDong (U0835005), NSFC (60803083), 973 Program (2006CB303104) in China, and GuangDong Program (2010B031000004). The authors would like to thank the PC chairs of ICDM 2010 for selecting the paper for publication in KAIS and thank the ICDM 2010 reviewers for their comments that are very helpful in extending the paper.
References 1. Abolhassani B, Salt JE, Dodds DE (2004) A two-phase genetic k-means algorithm for placement of radioports in cellular networks. IEEE Trans Syst Man Cybern B Cybern 34:533–538
123
102
C.-D. Wang et al.
2. Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/ MLRepository.html 3. Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin 4. Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering of the 15th international conference on machine learning 5. Cheung Y-M (2005) On rival penalization controlled competitive learning for clustering with automatic cluster number selection. IEEE Trans Knowl Data Eng 17:1583–1588 6. Denton AM, Besemann CA, Dorr DH (2009) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst 18:1–27 7. DeSieno D (1988) Adding a conscience to competitive learning. IEEE international conference on neural network 8. Dhillon IS, Guan Y, Kulis B (2004) Kernel k-means, spectral clustering and normalized clustering of the 10th ACM SIGKDD international conference on knowledge discovery and data mining 9. http://www.open-video.org (n.d.) The Open Video Project is managed at the Interaction Design Laboratory, at the School of Information and Library Science, University of North Carolina at Chapel Hill 10. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218 11. Hull JJ (1994) A database for handwritten text recognition research. IEEE Trans Pattern Anal Mach Intell 16(5):550–554 12. Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10:17–40 13. Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25:35–55 14. Khan SS, Ahmad A (2004) Cluster center initialization algorithm for k-means clustering. Pattern Recognit Lett 25:1293–1302 15. Krishna K, Murty MN (1999) Genetic k-means algorithm. IEEE Trans Syst Man Cybern B Cybern 29(3):433–439 16. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. http://yann.lecun.com/exdb/mnist/ 17. Likas A, Vlassis N, Verbeek JJ (2003) The global k-means clustering algorithm. Pattern Recognit 36: 451–461 18. Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. In: Proceedings of the 9th international conference on information and knowledge management 19. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, California, pp 281–297 20. Nayak R (2008) Fast and effective clustering of XML data using structural information. Knowl Inf Syst 14:197–215 21. Schölkopf B (2000) The kernel trick for distances. Adv Neural Inf Process Syst 301–307 22. Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319 23. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge 24. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617 25. Strehl A, Ghosh J, Mooney RJ (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the AAAI workshop on AI for web search (AAAI 2000) 26. Su Z, Yang Q, Zhang H, Xu X, Hu Y-H, Ma S (2002) Correlation-based web document clustering for adaptive web interface design. Knowl Inf Syst 4:151–167 27. Takacs B, Demiris Y (2010) Spectral clustering in multi-agent systems. Knowl Inf Syst 25:607–622 28. Truong BT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multime Comput Commun Appl 3(1):1–37 29. Tzortzis GF, Likas AC (2009) The global kernel k-means algorithms for clustering in feature space. IEEE Trans Neural Netw 20(7):1181–1194 30. Wang C-D, Lai J-H (2011) Energy based competitive learning. Neurocomputing 74:2265–2275 31. Wang C-D, Lai J-H, Zhu J-Y (2010) A conscience on-line learning approach for kernel-based clustering. In: Proceedings of the 10th international conference on data mining. pp 531–540 32. Wang J, Wu X, Zhang C (2005) Support vector machines based on k-means clustering for real-time business intelligence systems. Int J Bus Intell Data Min 1:54–64 33. Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the 8th international conference on information and knowledge management
123
Conscience online learning
103
34. Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining 35. Wu J, Xiong H, Chen J, Zhou W (2007) A generalization of proximity functions for k-means. In: Proceedings of the 7th international conference on data mining 36. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37 37. Xiong H, Steinbach M, Ruslim A, Kumar V (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19:311–336 38. Xu L, Krzy˙zak A, Oja E (1993) Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Trans Neural Netw 4(4):636–649 39. Xu R, Wunsch DI (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678 40. Zhang Y-J, Liu Z-Q (2002) Self-splitting competitive learning: a new on-line clustering paradigm. IEEE Trans Neural Netw 13(2):369–380 41. Zhang Z, Dai BT, Tung AK (2006) On the lower bound of local optimums in k-means algorithm. In: Proceedings of the 6th international conference on data mining
Author Biographies Chang-Dong Wang received the B.S. degree in applied mathematics in 2008 and M.Sc. degree in computer science in 2010 from Sun Yat-sen University, Guangzhou, P. R., China. He started the pursuit of the Ph.D. degree with Sun Yat-sen University in September 2010. His current research interests include machine learning, pattern recognition, data mining, and computer vision, especially focusing on data clustering. He is selected for the IEEE TCII Student Travel Award and gives a 20 min oral representation in the 10th IEEE International Conference on Data Mining, December 14–17, 2010, Sydney, Australia. His ICDM paper titled “A Conscience Online Learning Approach for Kernel-Based Clustering” is awarded as one of the best papers invited for further publication in the international journal Knowledge and Information Systems.
Jian-Huang Lai received his M.Sc. degree in applied mathematics in 1989 and his Ph.D. in mathematics in 1999 from SUN YAT-SEN University, China. He joined Sun Yat-sen University in 1989 as an Assistant Professor, where currently, he is a Professor with the Department of Automation of School of Information Science and Technology and vice dean of School of Information Science and Technology. Dr. Lai had successfully organized the International Conference on Advances in Biometric Personal Authentication’2004, which was also the Fifth Chinese Conference on Biometric Recognition (Sinobiometrics’04), Guangzhou, in December 2004. He has taken charge of more than five research projects, including NSF-Guangdong (No. U0835005), NSFC (No. 60144001, 60373082, 60675016), the Key (Keygrant) Project of Chinese Ministry of Education (No. 105134), and NSF of Guangdong, China (No. 021766, 06023194). He has published over 80 scientific papers in the international journals and conferences on image processing and pattern recognition. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, wavelet, and its applications. Prof. Lai serves as a standing member of the Image and Graphics Association of China and also serves as a standing director of the Image and Graphics Association of Guangdong.
123
104
C.-D. Wang et al. Jun-Yong Zhu received the B.S. and M.S. degrees in the school of Mathematics and Computational Science from Sun Yat-Sen University, Guangzhou, P. R., China, in 2008 and 2010, respectively. Currently, he is working toward the Ph.D. degree in the Department of Mathematics, Sun Yat-Sen University. His current research interests include machine learning, transfer learning using auxiliary data, pattern recognition such as heterogeneous face recognition.
123