Neural Processing Letters (2005) 22:249–262 DOI 10.1007/s11063-005-8016-3

© Springer 2005

TCSOM: Clustering Transactions Using Self-Organizing Map ZENGYOU HE , XIAOFEI XU and SHENGCHUN DENG Department of Computer Science and Engineering, Harbin Institute of Technology, 92 West Dazhi Street, P.O. Box 315, Harbin 150001, P. R. China. e-mail: [email protected], {xiaofei, dsc}@hit.edu.cn Abstract. Self-Organizing Map (SOM) networks have been successfully applied as a clustering method to numeric datasets. However, it is not feasible to directly apply SOM for clustering transactional data. This paper proposes the Transactions Clustering using SOM (TCSOM) algorithm for clustering binary transactional data. In the TCSOM algorithm, a normalized Dot Product norm based dissimilarity measure is utilized for measuring the distance between input vector and output neuron. And a modified weight adaptation function is employed for adjusting weights of the winner and its neighbors. More importantly, TCSOM is a one-pass algorithm, which is extremely suitable for data mining applications. Experimental results on real datasets show that TCSOM algorithm is superior to those stateof-the-art transactional data clustering algorithms with respect to clustering accuracy. Key words. clustering, self-organizing map, transactions, categorical data, data mining

1.

Introduction

Clustering algorithms partition a data set into several disjoint groups such that points in the same group are similar to each other according to some similarity metric. Recently, more attention has been put on clustering categorical data [1–21], where records are made up of non-numerical attributes. Fast and accurate clustering of categorical data has many potential applications in customer relationship management, e-commerce intelligence, etc. This work focuses on clustering binary transactional datasets. Binary data sets are interesting and useful for a variety of reasons [22]. They are the simplest form of data available in a computer and they can be used to represent categorical data. From a clustering point of view, they offer several advantages. There is no concept of noise like that of quantitative data, they can be used to represent categorical data and can be efficiently stored, indexed and retrieved [22]. Since all dimensions have the same scale, there is no need to transform the dataset. The Self-Organizing Map (SOM) is a robust form of an unsupervised Neural Network (NN), which was first introduced by Teuvo Kohonen [23]. The creation of a SOM requires two layers of processing nodes: the first is an input layer containing processing nodes for each component in the input vector; the second is an output 

Corresponding author.

250

ZENGYOU HE ET AL.

layer of processing nodes that is associated with those of the input layer. The number of processing nodes in the output layer is determined by programmer, and is based on the envisaged shape and size of the map, and on the number of independent inputs. There are also algorithms that can automatically grow a map to an ‘optimal size’. In a SOM network there are no hidden layers or hidden processing nodes. The SOM is a competitive NN. Thus, when the input is presented to the network, each output unit competes to match the input pattern. The output that is closest to the input pattern is declared the winner. The weights of the winning unit are then adjusted, i.e., moved in the direction of the input pattern by a factor determined by the learning rate. This is the basic nature of typical competitive NNs. The SOM differs from other unsupervised algorithms in that it creates a topological map based on patterns of correlated similarity. This map is dynamic and continually adjusts itself not only according to the winner’s weights, but also the weights of the output nodes in the neighborhood of the winning node. Thus the output nodes that start with randomized weight values slowly align themselves with one another based on perceived patterns of similarity in the input data. When an input pattern is presented, a neighborhood of nodes responds to the input pattern to see if it can match it. SOM networks have been successfully applied as a clustering method to numeric datasets [24, 25]. However, it is not feasible to directly apply SOM to transactional data. This paper aims at investigating the feasibility of clustering binary transactional dataset using SOM. In particular, we propose the Transactions Clustering using SOM (TCSOM) algorithm, which utilizes normalized Dot Product norm for measuring the distance between input vector and output neuron. The algorithm also uses a modified weight adaptation function for adjusting the weights of the winner and its neighbors. More importantly, TCSOM is a one-pass algorithm, which is extremely suitable for data mining applications. The remainder of this paper is organized as follows. Section 2 presents a review on related work. Section 3 introduces basic concepts about SOM clustering. In Section 4, we present the TCSOM algorithm. Experimental results are given in Section 5 and Section 6 concludes the paper.

2.

Related Work

A few algorithms have been proposed in recent years for clustering categorical data [1–21]. In [1], the problem of clustering customer transactions in a market database is addressed. STIRR, an iterative algorithm based on non-linear dynamical systems is presented in [2]. The approach used in [2] can be mapped to a certain type of non-linear systems. If the dynamical system converges, the categorical databases can be clustered. Another recent research [3] shows that the known dynamical systems cannot guarantee convergence, and proposes a revised dynamical system in which convergence can be guaranteed. K-modes, an algorithm extending the k-means paradigm to categorical domain is introduced in [4–5]. New dissimilarity measures to deal with categorical data is

TCSOM: CLUSTERING TRANSACTIONS

251

conducted to replace means with modes, and a frequency based method is used to update modes in the clustering process to minimize the clustering cost function. Based on k-modes algorithm, Jollois and Nadif [6] proposes an adapted mixture model for categorical data, which gives a probabilistic interpretation of the criterion optimized by the k-modes algorithm. A fuzzy k-modes algorithm is presented in [7] and tabu search technique is applied in [8] to improve fuzzy k-modes algorithm. An iterative initial-points refinement algorithm for categorical data is presented in [9]. The work in [19] can be considered as the extensions of k-modes algorithm to transaction domain. In [10], the authors introduce a novel formalization of a cluster for categorical data by generalizing a definition of cluster for numerical data. A fast summarization based algorithm, CACTUS, is presented. CACTUS consists of three phases: summarization, clustering, and validation. ROCK, an adaptation of an agglomerative hierarchical clustering algorithm, is introduced in [11]. This algorithm starts by assigning each tuple to a separated cluster, and then clusters are merged repeatedly according to the closeness between clusters. The closeness between clusters is defined as the sum of the number of ‘links’ between all pairs of tuples, where the number of ‘links’ is computed as the number of common neighbors between two tuples. In [12], the authors propose the notion of large item. An item is large in a cluster of transactions if it is contained in a user specified fraction of transactions in that cluster. An allocation and refinement strategy, which has been adopted in partitioning algorithms such as k-means, is used to cluster transactions by minimizing the criteria function defined with the notion of large item. Following the large item method in [12], a new measurement, called the small-large ratio is proposed and utilized to perform the clustering [13]. In [14], the authors consider the item taxonomy in performing cluster analysis. While the work [15] proposes an algorithm based on ‘caucus’, which is fine-partitioned demographic groups that is based the purchase features of customers. Squeezer, a one-pass algorithm is proposed in [16]. Squeezer repeatedly read tuples from dataset one by one. When the first tuple arrives, it forms a cluster alone. The consequent tuples are either put into an existing cluster or rejected by all existing clusters to form a new cluster according to the given similarity function. COOLCAT, an entropy-based algorithm for categorical clustering, is proposed in [17]. Starting from a heuristic method of increasing the height-to-width ratio of the cluster histogram, the authors in [18] develop the CLOPE algorithm. Cristofer and Simovici [20] introduce a distance measure between partitions based on the notion of generalized conditional entropy and a genetic algorithm approach is utilized for discovering the median partition. In [21], the authors formally define the categorical data clustering problem as an optimization problem from the viewpoint of cluster ensemble, and apply cluster ensemble approach for clustering categorical data.

252

3.

ZENGYOU HE ET AL.

Self-Organizing Map (SOM) Clustering

The SOM network typically has two layers of nodes, the input layer and the output layer (Kohonen layer). The input layer is fully connected to a two-dimensional output layer. The input nodes form a vector, which has the same length as the input vector. During the training process, input data are fed to the network through the processing nodes in the input layer. An input pattern x is denoted by a vector of length n as: x = (x1 , x2 , . . . , xn ), where xi is ith input signal in the input pattern and n is number of input signals in each input pattern. The output layer is usually a two-dimensional vector/array of M nodes (M = R × C). Since each node in the two-dimensional output layer could be uniquely determined by its position, we can transform the two-dimensional vector into a one-dimensional vector of length M. For the convenience of description, we use an only index to refer to an output node. An input pattern is simultaneously incident on the nodes of the two-dimensional output layer. Associated with each node j in the output layer, is a weight vector (its length is also n), denoted by wj = (w1j , w2j , . . . , wnj ), where wij is the weight value associated with node j corresponding to the ith signal of an input vector. Figure 1 presents an example of SOM network. The algorithm for the SOM has no difference to the standard neural network algorithm: initial weight assignment and iterative training. For initial weight assignment, three methods are frequently used: random initialization, which assigns random values to the weight vectors; (random) sample initialization, which randomly chooses sample vectors to assign to the weight vectors; and linear initialization, where the weight vectors are initialized in an orderly fashion along the linear subspace spanned by the two principle eigenvectors of the input data set. For iterative training, the winner (suppose node j is the winner) among output nodes for the input pattern is first located. The winner is determined by a function over the weight vector and the input vector. The function often used is the

Figure 1. Self-organizing map (Image Source: http://www.cies.staffs.ac.uk/somview/som-vect.gif).

TCSOM: CLUSTERING TRANSACTIONS

253

Euclidean distance in traditional SOM. Then, the weight vectors of other nodes around this node j are updated by the following formula: wit+1 = wit + α(t) ∗ D(t, i, j ) ∗ (x t − wit )

(1)

In this formula, wit and wit+1 are the weight vectors of output node i at time t and t + 1. α(t) is called learning speed function. Generally it is a monotonous decreasing function of t (e.g. α(t) = A/(t + B)). D(t, i, j ) is the neighborhood function, where i and j are neurons in output layer. The reason that function D takes parameter t is that the radius of neighborhood usually is shrinking as time goes by. Two forms of neighborhood function are often used: the bubble function, which equals to 1 within the neighborhood of j and 0 elsewhere; and the Gaussian function, which is a little more accurate but involves a lot more computation (see below).   dist(i, j ) D(t, i, j ) = exp − (2) 2σ 2 (t) where dist(i, j ) is the topological distance between node i and node j , and σ (t)decreases monotonically with time t.

4.

The TCSOM Algorithm

The TCSOM algorithm is a scalable algorithm for clustering transactions with only one-pass over the dataset. For initial weight assignment, the TCSOM algorithm selects the first Minput vectors (M is the number of output nodes) from the dataset to construct initial weight vector for each output node, i.e., initial weight vector is set to be the input vector initially assigned. In this way, they are input into distinct clusters. Moreover, we have a variable ck to represent the times of the output node k being identified as a winner so far. In this phase, ck is initialized by 1 for 1  k  M. For each consequent input pattern x = (x1 , x2 , . . . , xn ), its distance to an output node k (denoted by its weight vector wk = (w1k , w2k , . . . , wnk )) is computed using the following formula:   n  dist(x, wk ) = 1 − xi ∗ wik /(n ∗ ck ) (3) i=1

In formula (3), n is the number of dimensions of input vector and ck represents the times of the output node k being identified as a winner so far. The reason for formulating the dissimilarity function over the weight vector and the input vector as (3) is based on the following considerations: (1) In binary 0-1 data, ‘1’ indicates the existence of an item or a categorical attribute value. Hence, our attempt is to find clusters with higher density of ‘1’.

254

ZENGYOU HE ET AL.

The weight value wik is proportional to the number of ‘1’ contained in data patterns’ ith component in the kth cluster (the weight updating rule will be discussed later). Thus, wik /ck approximately represents the density of ‘1’ on the ith component of the cluster. From this viewpoint, xi ∗ (wik /ck ) is the similarity between input vector x and weight vector wk on the ith component. Considering all comn n ponents simultaneously, we get i=1 (xi ∗ (wik /ck ))/n = ( i=1 xi ∗ wik )/(n ∗ ck ) to measure the similarity between x and wk . Hence, formula (3) is derived as the distance function. (2) Besides what we have mentioned in the above point (1), introducing ck in formula (3) can prevent putting all the input vectors into a single cluster. That is, without ck in formula (3), there will be a highly possibility to put remaining input vectors into a cluster with very large size. On the other hand, one may think that introducing ck in formula (3) will produce clusters with similar size. Empirical results in experimental section show that our algorithm can produce significantly varied clusters with respect to the size of cluster. After computing the distances between x and all the weight vectors, we have to find the best-matching output neuron. The winner node, denoted here by j , is the output node with weight vector closed to x: dist(x, wj ) =

min{dist(x, wk )} k

(4)

For winner training, the weight vectors of those nodes around node j are updated by the following formula: wit+1 = wit + α(t)∗ D(t, i, j )∗ x t

(5)

In formula (5), wit and wit+1 are the weight vectors of output node i at time t and t + 1. In our current implementation, from the viewpoint of efficiency, we let  1 if (i = j ) D(t, i, j ) = (6) 0 if (i = j ) and α(t) =

1 . t +1

(7)

The neighborhood function (6) used is a bubble function with radius of neighborhood equals to 0. More precisely, it means that only weight vector of the winner, node j , is updated. Since we use a single-pass over the dataset, hence t = 0. That is, α(t) = 1. Therefore, the weight adaptation function (5) could be re-formulated as: wjt+1 = wjt + x t

(8)

In formula (8), weight vector of the winner node j is updated by the sum of x t and wjt .

TCSOM: CLUSTERING TRANSACTIONS

255

Meanwhile, we output j as the cluster label of this input vector and increment cj by 1. Apparently, after one scan over the dataset, input patterns are distributed into corresponding clusters. Therefore, TCSOM algorithm is very efficient in handling large transactional datasets. In the following, we present the time and space complexities of the TCSOM algorithm in detail. The time and space complexities of the TCSOM algorithm depend on the size of dataset (m), the number of dimensions (n) and the number of output nodes (M). For each input vector, computing its distances to all output nodes needs O (n*M) operations, hence, the algorithm has time complexity O (m ∗ n ∗ M). The algorithm only needs to store M weight vectors (i.e., n ∗ M weight values) and M variables (each variable is used for remembering the size of a single cluster) in main memory, so the space complexity of our algorithm is O (n ∗ M). From the above analysis, we can see that the TCSOM algorithm is especially qualified for clustering binary data streams [22]. In the data stream model, data points can only be accessed in the order of their arrivals and random access is not allowed. The space available to store data streams is often not enough because of the volume of unbounded streaming data points. To process high volume, openended data streams, an algorithm has to meet some stringent criteria. In [26], Domingos presents a series of designed criteria for such algorithm, which are summarized as follows: 1. The time needed by the algorithm to process each data record in the stream must be small and constant; otherwise, it is impossible for the algorithm to catch up the pace of the data. 2. Regardless of the number of records the algorithm has seen, the amount of main memory used must be fixed. 3. It must be a one-pass algorithm, since in most applications, either the data is still not available, or there is no time to revisit old data. 4. It must have the ability to make a usable model available at any time, since we may never meet the end of the stream. 5. The model must be up-to-date at any point in time, that is to say, it must keep up with the changes of the data. From the description and analysis on TCSOM algorithm, it is easy to see that it achieves all the five criteria. Hence, it is also a qualified streaming clustering algorithm for binary data streams.

5.

Experimental Results

A performance study has been conducted to evaluate our method. In this section, we describe those experiments and the results. We ran our algorithm on real-life datasets obtained from the UCI Machine Learning Repository [27] to test its clustering performance against other algorithms.

256 5.1.

ZENGYOU HE ET AL.

real life datasets and evaluation method

We experimented with two real-life datasets: the Congressional Votes dataset and the Mushroom dataset, which were obtained from the UCI Machine Learning Repository [27]. Now we will give a brief introduction about these datasets. • Congressional Votes Dataset: It is the United States Congressional Voting Records in 1984. Each record represents one Congressman’s votes on 16 issues. All attributes are Boolean with Yes (denoted as y) and No (denoted as n) values. A classification label of Republican or Democrat is provided with each record. The dataset contains 435 records with 168 Republicans and 267 Democrats. • The Mushroom Dataset: It has 22 attributes and 8124 records. Each record represents physical characteristics of a single mushroom. A classification label of poisonous or edible is provided with each record. The numbers of edible and poisonous mushrooms in the dataset are 4208 and 3916, respectively. Validating clustering results is a non-trivial task. In the presence of true labels, as in case of the data sets we used, the clustering accuracy for measuring the clustering results was computed as follows. Given the final number of clusters, k, clus tering accuracy r was defined as: r = ki=1 ai /n, where n is the number of records in the dataset, ai is the number of instances occurring in both cluster i and its corresponding class, which had the maximal value. In other words, ai is the number of records with the class label that dominates cluster i. Consequently, the clustering error is defined as e = 1 − r. 5.2.

experiment design

We studied the clustering found by four algorithms, our TCSOM algorithm, the Squeezer algorithm introduced in [16], the GAClust algorithm proposed in [20] and ccdByEnsemble algorithm in [21]. Until now, there is no well-recognized standard methodology for categorical data clustering experiments. However, we observed that most clustering algorithms require the number of clusters as an input parameter, so in our experiments, we cluster each dataset into different number of clusters, varying from 2 to 9. For each fixed number of clusters, the clustering errors of different algorithms were compared. In all the experiments, except for the number of clusters, all the parameters required by the ccdByEnsemble algorithm are set to be default as in [21]. The Squeezer algorithm requires only a similarity threshold as input parameter, so we set this parameter to a proper value to get the desired number of clusters. For the GAClust algorithm, we set the population size to be 50, and set other parameters to their default values1 . 1 The

source codes for GAClust are public available at: http://www.cs.umb.edu/∼dana/GAClust/index.html. The readers may refer to this site for details about other parameters.

257

TCSOM: CLUSTERING TRANSACTIONS

Clustering Error

0.5 0.4 Squeezer

0.3

GAClust

ccdByEnsemble

TCSOM

0.2 0.1 0 2

3

4

5 6 7 The number of clusters

8

9

Figure 2. Clustering error versus different number of clusters (votes dataset).

Table I. Relative performance of different clustering algorithms (votes dataset). Ranking

1

2

3

4

Average clustering error

Squeezer

0

2

1

5

0.163

GAClust

0

3

2

3

0.136

ccdByEnsemble

0

4

4

0

0.115

TCSOM

8

0

0

0

0.079

Moreover, since the clustering results of TCSOM algorithm, ccdByEnsemble algorithm and Squeezer algorithm are fixed for a particular dataset when the parameters are fixed, only one run is used in the three algorithms. The GAClust algorithm is a genetic algorithm, so its outputs will differ in different runs. However, we observed in the experiments that the clustering error is very stable, so the clustering error of this algorithm is reported with its first run. In summary, we use one run to get the clustering errors for all the four algorithms. 5.3.

clustering results on congressional voting (votes) data

Figure 2 shows the results on the votes dataset of different clustering algorithms. From Figure 2, we can summarize the relative performance of these algorithms as Table I. In Table I, the numbers in column labelled by k (k = 1, 2, 3 or 4) are the times that an algorithm has rank k among the four algorithms. For instance, in the eight experiments, Squeezer algorithm performed second best in two cases, that is, it is ranked 2 for two times. Compared to the other three algorithms, the TCSOM algorithm performed best in all cases. And the average clustering error of the TCSOM algorithm was significantly smaller than that of other algorithms. Thus, the clustering performance of TCSOM on the votes dataset is superior to all other three algorithms. In Table II, a description with respect to cluster’s size on the clusters produced by TCSOM is given.

258

ZENGYOU HE ET AL.

Table II. The distribution of cluster’s size (votes dataset). Number of clusters 2 The distribution of cluster’s size

234,201

3

4

5

6

7

8

9

228,190

214,165

210,160

207,156

206,144

205,143

205,134

17

42,14

27,25

36,17

32,19

21,17

30,12

13

12,7

4,14

16,15

12,11

6

12,6

8,7,6

Some important observations from Table II are summarized as follows: (1) TCSOM algorithm can produce significantly varied clusters with respect to the size of cluster. It emprically verified the fact that distance function used in TCSOM is not biased for producing clusters with similar size. Furthermore, it is very natural to conlude that TCSOM algorithm is also suitable for clustering datasets with unbalanced class distribution. (2) Another important observation is that, although the number of clusters is varied from two to nine, there are always two ‘bigger’ clusters in the clustering output of TCSOM. By examining these two ‘bigger’ clusters (the size of first one is always larger than 200 and the the size of second one is ranged from 201 to 134), we found that the first one is made up of Democrats in majority and the second one mainly is made up of Republicans. This observation reveals that TCSOM is robust to the input number of clusters in finding the true clustering structures from underlying dataset, which is very important when the true number of clusters is unknown or hard to determine. (3) By examining those data in extremely small clusters, it is found that most of them are outliers. Hence, as a by-product of cluster analysis, we can utilize TCSOM for detecting outliers by considering data objects in extremely small clusters as outliers. Related researches [28, 29] on outlier detection have empirically verified the effectiveness of such clustering-based outlier detection method. Since the TCSOM algorithm is a one-pass algorithm, we test the sensitivity of our algorithm to the order used to present the items to the network. To find out how the input sequence of data affects the TCSOM algorithm, we produced five new datasets where each input vector is placed randomly. Thus, by executing TCSOM on these datasets, we can get the results with different input orders. In the experiments, the number of clusters is ranged from 2 to 5. Firstly, experiments were conducted to see the impact on the clustering accuracy. As Figure 3 shows, clustering errors with different input orders didn’t change

259

TCSOM: CLUSTERING TRANSACTIONS

Clustering Error

0.15 0.1 number of clusters=2 number of clusters=3 number of clusters=4 number of clusters=5

0.05 0 1

2

3

4

5

The ID of Generated Files

Figure 3. Clustering error versus different input orders (votes dataset).

significantly, which gives the evidence that the processing order of input vectors does not has a major impact on the clustering error. With a careful analysis of the clustering results with different processing order, it reveals the fact that two ‘bigger’ clusters (just as discussed before) are relatively stable, i.e., the clustering results are approximately same except for the movement of some data objects from one cluster to another. Hence, experimental results on this dataset provide us hints on the algorithm’s robustness with respect to input order.

5.4.

clustering results on mushroom data

The experimental results on the mushroom dataset are described in Figure 4, and the relative performance of those algorithms is summarized in Table III. As shown in Figure 4 and Table III, our algorithm beats all the other algorithms in average clustering error. Furthermore, although the TCSOM algorithm didn’t always perform best on this dataset, it performed best in five cases and never performed worst. That is, TCSOM algorithm performed the best in majority of the cases. Table IV presents the distribution of cluster’s size on the mushroom dataset. Apparently, on this dataset, TCSOM algorithm also produces clusters with significantly varied size. In particular, combining two or more clusters in the clusterings with larger number of clusters will approximately result in a clusterings

Table III. Relative performance of different clustering algorithms (mushroom dataset). Ranking

1

2

3

4

Average clustering error

Squeezer

1

5

0

2

0.206

GAClust

0

1

3

4

0.393

ccdByEnsemble

2

1

3

2

0.315

TCSOM

5

1

2

0

0.182

260

ZENGYOU HE ET AL.

Table IV. The distribution of cluster’s size (mushroom dataset). Number of clusters 2

3

4

5

6

7

8

9

The

4752

3862

3250

2351

2202

2202

2202

2126

distribution

3372

2453

2365

2194

2076

1930

1887

1882

1810

1427

1320

1472

1472

1465

1429

1082

1173

1080

1076

1063

1050

1086

1036

696

486

479

258

498

419

407

250

392

387

210

211

of cluster’s size

Clustering Error

153

0.6 0.5 0.4 0.3 0.2 0.1 0

Squeezer

2

3

GAClust

4

ccdByEnsemble

TCSOM

5 6 7 The number of clusters

8

9

Figure 4. Clustering error versus different number of clusters (mushroom dataset).

with smaller number of clusters. That is, TCSOM is robust to the input number of clusters in finding meaningful clustering results. Figure 5 describes the order sensitivity test on the mushroom dataset. As Figure 5 shows, on average, clustering errors with different input orders are relatively stable.

Clustering Error

0.5 0.4 0.3

number of clusters=2 number of clusters=3 number of clusters=4 number of clusters=5

0.2 0.1 0 1

2

3

4

The ID of Generated Files

Figure 5. Clustering error versus different input orders (mushroom dataset).

5

TCSOM: CLUSTERING TRANSACTIONS

5.5.

261

summary

The above experimental results on the two datasets demonstrate the effectiveness of TCSOM algorithm. One may argue that the results cannot precisely reflect that our method has better performance. However, from those results, we are confident to claim that our method could provide at least the same level of performance as other popular methods.

6.

Conclusions

In this paper, we propose a SOM based clustering algorithm called TCSOM for binary transactional data. Empirical evidences show that our method is effective in practice. Furthermore, the TCSOM algorithm is especially suitable for cluster analysis in data stream applications. The executable program of TCSOM is available for free download at: http:// software.hit.edu.cn/home/zengyouhe/software/TCSOM.html

Acknowledgements The comments and suggestions from the anonymous reviewers greatly improve the paper. We would also like to thank Mr. Danish Irfan for his help on proofreading. This work was supported by The High Technology Research and Development Program of China (Grant No. 2003AA4Z2170, Grant No. 2003AA413021), the National Nature Science Foundation of China (Grant No. 40301038) and the IBM SUR Research Fund.

References 1. Han, E. H., Karypis, G., Kumar, V. and Mobasher, B.: Clustering based on association rule hypergraphs. In: SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 9–13, 1997. 2. Gibson, D., Kleiberg, J. and Raghavan, P.: Clustering categorical data: an approach based on dynamic systems. In: Proceedings of VLDB’98, pp. 311–323, 1998. 3. Zhang, Y., Fu, A. W., Cai, C. H. and Heng, P. A.: Clustering categorical data. In: Proceedings of ICDE’00, pp. 305–305, 2000. 4. Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 1–8, 1997. 5. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3) 1998, 283–304. 6. Jollois, F. and Nadif, M.: Clustering large categorical data. In: Proceedings of PAKDD’02, pp. 257–263, 2002 7. Huang, Z. and Ng, M. K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Transaction on Fuzzy Systems, 7(4) 1999, 446–452. 8. Ng, M. K. and Wong, J. C.: Clustering categorical data sets using tabu search techniques. Pattern Recognition, 35(12) 2002, 2783–2790. 9. Sun, Y., Zhu, Q. and Chen, Z.: An iterative initial-points refinement algorithm for categorical data clustering. Pattern Recognition Letters, 23(7) 2002, 875–884.

262

ZENGYOU HE ET AL.

10. Ganti, V., Gehrke, J. and Ramakrishnan, R.: CACTUS-clustering categorical data using summaries. In: Proceedings of KDD’99, pp. 73–83, 1999. 11. Guha, S., Rastogi, R. and Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of ICDE’99, pp. 512–521, 1999. 12. Wang, K., Xu, C. and Liu, B.: Clustering transactions using large items. In: Proceedings of CIKM’99, pp. 483–490, 1999. 13. Yun, C. H., Chuang, K. T. and Chen, M. S.: An efficient clustering algorithm for market basket data based on small large ratios. In: Proceedings of COMPSAC’01, pp. 505–510, 2001. 14. Yun, C. H., Chuang, K. T. and Chen, M. S.: Using category based adherence to cluster market-basket data. In: Proceedings of ICDM’02, pp. 546–553, 2002. 15. Xu, J. and Sung, S. Y.: Caucus-based transaction clustering. In: Proceedings of DASFAA’03, pp. 81–88, 2003. 16. He, Z., Xu, X. and Deng, S.: Squeezer: an efficient algorithm for clustering categorical data. Journal of Computer Science & Technology, 17(5) 2002, 611–624. 17. Barbara, D., Li, Y. and Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of CIKM’02, pp. 582–589, 2002. 18. Yang, Y., Guan, S. and You, J.: CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of KDD’02, pp. 682–687, 2002. 19. Giannotti, F., Gozzi, G. and Manco, G.: Clustering transactional data. In: Proceedings of PKDD’02, pp. 175–187, 2002. 20. Cristofor, D. and Simovici, D.: Finding median partitions using information-theoreticalbased genetic algorithms. Journal of Universal Computer Science, 8(2) 2002, 153–172. 21. He, Z., Xu, X. and Deng, S.: A cluster ensemble method for clustering categorical data. Information Fusion, 6(2) 2005, 143–151. 22. Ordonez, C.: Clustering Binary Data Streams with K-means. In: SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2003. 23. Kohonen, T.: Self-organization and associative memory, Berlin: Springer-Verlag, 2nd ed., 1984 24. Flexer, A.: On the use of self-organizing maps for clustering and visualization. Intelligent Data Analysis, 5(5) 2001, 373–384. 25. Shum, W-H., Jin, H., Leung, K-S. and Wong, M. L.: A Self-Organizing Map with Expanding Force for Data Clustering and Visualization. In: Proceedings of ICDM’02, pp. 434–441, 2002. 26. Domingos, P. and Hulton, G.: Catching up with the data: research issues in mining data streams. In: 2001 SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2001. 27. Merz, C. J. and Merphy, P.: UCI Repository of Machine Learning Databases, 1996. (http://www.ics.uci.edu/∼mlearn/MLRRepository.html). 28. Jiang, M. F., Tseng, S. S. and Su, C. M.: Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6–7) 2001, 691–700. 29. He, Z., Xu, X. and Deng, S.: Discovering cluster based local outliers. Pattern Recognition Letters, 24 (9–10) 2003, 1641–1650.

TCSOM: Clustering Transactions Using Self ... - Springer Link

Department of Computer Science and Engineering, Harbin Institute of ... of data available in a computer and they can be used to represent categorical data.

126KB Sizes 4 Downloads 361 Views

Recommend Documents

Self-Stimulation Rewarding Experience Restores ... - Springer Link
Oct 23, 2007 - CA3 Dendritic Atrophy, Spatial Memory Deficits and Alterations in the Levels of ... and memory deficits, dendritic atrophy of the hippocampal.

The Effect of Membrane Receptor Clustering on Spatio ... - Springer Link
clustering on ligand binding kinetics using a computational individual- based model. The model .... If the receptor is free – not already bound to a ligand ...

Multi-topical Discussion Summarization Using ... - Springer Link
marization and keyword extraction research, particularly that for web texts, such as ..... so (1.1), for this reason (1.3), I/my (1.1), so there (1.1), problem is (1.2), point .... In: International Conference on Machine Learning and Cybernetics (200

Multi-topical Discussion Summarization Using ... - Springer Link
IBM Research – Tokyo. 1623-14 Shimotsuruma, Yamato, Kanagawa, Japan [email protected]. 3. Graduate School of Interdisciplinary Information Studies, University of Tokyo ... School of Computer Science, University of Manchester ... twofold: we first t

Preliminary Study of a Self-Administered Treatment for ... - Springer Link
Published online: 13 June 2007. © Springer ... treatment in comparison to a wait list control. Twenty-eight ..... of data analysis: an initial comparison of wait list to.

On the Biotic Self-purification of Aquatic Ecosystems - Springer Link
The Main Processes of Water. Purification in Aquatic Ecosystems. Many physical, chemical, and biotic processes are important for the formation of water quality ...

Reaction-diffusion system with self-organized critical ... - Springer Link
showing an APT that conserves the total number of parti- cles [11,12]. This model exhibits a non-equilibrium ... critical value ρ = ρc of the total particle density [12]. Here, we define a driven-dissipative version of the .... The exponent σs(1),

A Molecular Dynamics Simulation Study of the Self ... - Springer Link
tainties of the simulation data are conservatively estimated to be 0.50 for self- diffusion .... The Einstein plots were calculated using separate analysis programs. Diffusion ... The results for the self-diffusion coefficient are best discussed in t

Preliminary Study of a Self-Administered Treatment for ... - Springer Link
Jun 13, 2007 - Bowel Syndrome: Comparison to a Wait List Control Group. Kathryn Amelia Sanders Æ ... Springer Science+Business Media, LLC 2007. Abstract Despite the .... Recruitment occurred at two sites: the. Center for Stress and ...

Bias, precision and heritability of self-reported and ... - Springer Link
Aug 25, 2006 - or interviews and are often cheaper and more readily available than alternatives. However, the precision and potential bias cannot usually be ...

LNCS 4325 - An Integrated Self-deployment and ... - Springer Link
The VFSD is run only by NR-nodes at the beginning of the iteration. Through the VFSD ..... This mutual effect leads to Ni's unpredictable migration itinerary. Node Ni stops moving ... An illustration of how the ZONER works. The execution of the ...

Crosstalk calibration for torque sensor using actual ... - Springer Link
accomplished by means of relatively inexpensive load sensors. Various methods have been ...... M.S. degree in Mechanical Engineering from Seoul National ...

Review Article The potential for strategies using ... - Springer Link
Jul 31, 2003 - Intrinsic, or primary, brain tumours usually do not metastasise to ... nutraceutical when it is used at a pharmacological dose in treatment of a ...

Genetic differentiation in Pinus brutia Ten. using ... - Springer Link
Yusuf Kurt & Santiago C. González-Martínez &. Ricardo Alía & Kani Isik. Received: 9 May 2011 /Accepted: 15 November 2011 /Published online: 6 December 2011. © INRA / Springer-Verlag France 2011. Abstract. & Context Turkish red pine (Pinus brutia

3D articulated object retrieval using a graph-based ... - Springer Link
Aug 12, 2010 - Department of Electrical and Computer Engineering, Democritus. University ... Among the existing 3D object retrieval methods, two main categories ...... the Ph.D. degree in the Science of ... the past 9 years he has been work-.

Using Fuzzy Cognitive Maps as a Decision Support ... - Springer Link
no cut-and-dried solutions” [2]. In International Relations theory, ..... Fuzzy Cognitive Maps,” Information Sciences, vol. 101, pp. 109-130, 1997. [9] E. H. Shortliffe ...

Bayesian network structure learning using quantum ... - Springer Link
Feb 5, 2015 - ture of a Bayesian network using the quantum adiabatic algorithm. ... Bayesian network structure learning has been applied in fields as diverse.

Using hidden Markov chains and empirical Bayes ... - Springer Link
Page 1 ... Consider a lattice of locations in one dimension at which data are observed. ... distribution of the data and we use the EM-algorithm to do this. From the ...

Tinospora crispa - Springer Link
naturally free from side effects are still in use by diabetic patients, especially in Third .... For the perifusion studies, data from rat islets are presented as mean absolute .... treated animals showed signs of recovery in body weight gains, reach

Chloraea alpina - Springer Link
Many floral characters influence not only pollen receipt and seed set but also pollen export and the number of seeds sired in the .... inserted by natural agents were not included in the final data set. Data were analysed with a ..... Ashman, T.L. an

GOODMAN'S - Springer Link
relation (evidential support) in “grue” contexts, not a logical relation (the ...... Fitelson, B.: The paradox of confirmation, Philosophy Compass, in B. Weatherson.

Bubo bubo - Springer Link
a local spatial-scale analysis. Joaquın Ortego Æ Pedro J. Cordero. Received: 16 March 2009 / Accepted: 17 August 2009 / Published online: 4 September 2009. Ó Springer Science+Business Media B.V. 2009. Abstract Knowledge of the factors influencing