ARTICLE IN PRESS

Data & Knowledge Engineering xxx (2007) xxx–xxx www.elsevier.com/locate/datak

Rough clustering of sequential data Pradeep Kumar

a,b,*

, P. Radha Krishna

a,1

, Raju. S. Bapi

b,2

, Supriya Kumar De

c

a

b

Business Intelligence Lab, Institute for Development and Research in Banking Technology (IDRBT), 1, Castle Hills, Masab Tank, Hyderabad 500057, India Computational Intelligence Lab, Department of Computer and Information Sciences, University of Hyderabad, Gachibowli, Hyderabad 500046, India c XLRI Jamshedpur, C.H. Area, Jamshedpur 831001, India Received 7 February 2006; received in revised form 10 October 2006; accepted 19 January 2007

Abstract This paper presents a new indiscernibility-based rough agglomerative hierarchical clustering algorithm for sequential data. In this approach, the indiscernibility relation has been extended to a tolerance relation with the transitivity property being relaxed. Initial clusters are formed using a similarity upper approximation. Subsequent clusters are formed using the concept of constrained-similarity upper approximation wherein a condition of relative similarity is used as a merging criterion. We report results of experimentation on msnbc web navigation dataset that are intrinsically sequential in nature. We have compared the results of the proposed approach with that of the traditional hierarchical clustering algorithm using vector coding of sequences. The results establish the viability of the proposed approach. The rough clusters resulting from the proposed algorithm provide interpretations of different navigation orientations of users present in the sessions without having to fit each object into only one group. Such descriptions can help web miners to identify potential and meaningful groups of users.  2007 Elsevier B.V. All rights reserved. Keywords: Clustering; Rough sets; Constrained-similarity upper approximation; Web mining; Similarity metric; Sequential data

1. Introduction Clustering is an initial and fundamental step in data analysis. Although finer details of data are lost due to representation of data into fewer groups, clustering results in simple and understandable groups. Clustering has been studied in the field of machine learning and pattern recognition [1] and plays an important role in data mining applications such as scientific data exploration, information retrieval and text mining. It also *

Corresponding author. Address: Business Intelligence Lab, Institute for Development and Research in Banking Technology (IDRBT), 1, Castle Hills, Masab Tank, Hyderabad 500057, India. Tel.: +91 40 23534981/23134014; fax: +91 40 23535157. E-mail addresses: [email protected] (P. Kumar), [email protected] (P.R. Krishna), [email protected] (Raju. S. Bapi), [email protected] (S.K. De). 1 Tel.: +91 40 23534981. 2 Tel.: +91 40 23134014. 0169-023X/$ - see front matter  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2007.01.003

Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS 2

P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

plays a significant role in spatial database applications, web analysis, customer relationship management (CRM), marketing, computational biology and many other related areas. Clustering algorithms have been classified using different taxonomies based on various important issues such as algorithmic structure, nature of clusters formed, use of feature sets, etc [2]. Broadly speaking, clustering algorithms can be divided into two types – partitional and hierarchical. Partitional algorithms construct a partition of a database D of n objects into a set of k clusters, where k is an input parameter for these algorithms. To set the value of k, some domain knowledge is required which unfortunately is not available for many applications. Hierarchical algorithms create a hierarchical decomposition of the database, D. They can either be agglomerative or divisive. Agglomerative hierarchical clustering algorithms begin with each object in a separate group. These groups are successively combined based on a distance measure, until there is only one group remaining or a specified termination condition is satisfied. In divisive hierarchical clustering, we begin with having all the data in a big cluster and progressively split them into smaller clusters based on distance measures. Unlike partitioning algorithms, hierarchical algorithms do not require the number of clusters, k, as an input parameter. However, a termination condition has to be defined indicating when the merger or division process should end. Finding an efficient termination condition for the merging or division process is still an open research problem [2]. Partitional clustering algorithms are less affected by uneven distribution of data than hierarchical clustering techniques. But partitioning approaches still suffer from many of the problems associated with all the traditional statistical analysis methods such as assumptions of normality and equivariance [3]. Clusters can be hard or soft in nature. In conventional clustering, objects that are similar are allocated to the same cluster while objects that differ significantly are put in different clusters. These clusters are disjoint and are called hard clusters. In soft clustering, an object may be a member of two or more clusters. Soft clusters may have fuzzy or rough boundaries [4]. In fuzzy clustering, each object is characterized by partial membership whereas in rough clustering objects are characterized using the concept of a boundary region. In fuzzy clustering, degree of membership of a pattern in a given cluster decreases as its distance increases from the cluster’s center. This phenomenon of fuzzy clustering led to fuzzy c-means algorithm, which has many applications in pattern recognition [5]. A rough cluster is defined in a similar manner to a rough set. The lower approximation of a rough cluster contains objects that only belong to that cluster. The upper approximation of a rough cluster contains objects in the cluster which are also members of other clusters [6]. The advantage of using rough sets is that, unlike other techniques, rough set theory does not require any prior information about the data such as apriori probability in statistics and a membership function in fuzzy set theory. In this paper, we present a hierarchical clustering algorithm that uses similarity upper approximation derived from a tolerance (similarity) relation. The presented approach results in rough clusters wherein an object is a member of more than one cluster. 1.1. Web mining and its current approaches Joshi and Krishnapuram [7] argued that the clustering operation in many applications involves modelling an unknown number of overlapping sets, that is, the clusters do not necessarily have crisp boundaries. Web mining is one such area where overlapping clusters are required. Since web users are highly mobile, it is important to clearly understand and combine users as per their needs and characteristics. The clusters of web users’ sessions are different from traditional clusters, where a user may belong to more than one group. A clear distinction among users’ visiting behavior may not be present and hence these clusters tend to have vague or imprecise boundaries. Thus, the clustering algorithm should be able to put a user into two or more groups based on his/her needs or interests. This suggests the use of soft clustering approaches that allow a user session to appear in multiple clusters. Rough clustering can help researchers to discover multiple needs and interests in a session by looking at the multiple clusters that a session belongs to. Hirano and Tsumoto [8,9] proposed indiscernibility based clustering method that can handle relative proximity. In [10,11], automatic personalization of a web site from user transactions is presented using fuzzy proximity relations described in [12]. Recently, rough set theory has been used for clustering [4,8,9,11,13–16]. Lingras proposed two different models based on properties of rough sets for developing interval representaPlease cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

3

tions of data [4,16]. These models are very useful for dealing with soft boundaries. A clustering algorithm using rough approximation to cluster web transactions from web access logs has been attempted [11,13]. Moreover, fuzzy sets and rough sets have been integrated in a soft computing framework to develop stronger models of uncertainty [13,17–19]. 1.2. Sequential data Reliable clustering of web user sessions can be achieved if both the content as well as the order of page visits is considered. In this way, both the actual user’s page visit as well as users preferences and requirements are captured. However, most of the approaches in web mining do not utilize the sequential nature of user sessions. Usually sessions are modelled in a n-dimensional vector space of web pages. These n-dimensional vectors may be binary, indicating whether a particular web page has been visited or not within a session. These vectors may carry the information regarding the frequency count of web page visits within a session. Thus, depending on the nature of the values associated with these n dimensions, different kinds of limited user behavior analysis are being performed. Generally, clustering algorithms make use of either distance functions or similarity functions for comparing pairs of sequences. Many of the metrics for sequences do not fully qualify as being metrics due to one or more reasons. In Section 2, we provide a brief introduction to the S3M similarity function [20]. This function considers both the set as well as sequence similarity across two sequences. This paper presents a new clustering technique for sequences using the concept of constrained-similarity upper approximation. The key idea of our approach is to find a set of features that captures the sequential information of the data sequences as well as the content information. These feature sets are projected into an upper approximation space. Constrained-similarity upper approximation technique is applied to obtain upper approximation of rough clusters wherein one element may belong to more than one cluster. The current work uses web mining as an example to demonstrate the usefulness of the proposed rough clustering algorithm. 1.3. Organization of the paper The rest of the paper is organized as follows: a brief introduction to the S3M similarity measure for sequential data is presented in Section 2, overview of rough set theory is presented in Sections 3 and 4 presents the clustering approach that uses constrained-similarity upper approximation. In Section 5, we present a brief description of the web usage data used for experimentation. Also, a step-by-step illustration on a small example into a set of 10 web navigation transactions has been provided to provide insight into the working of the proposed algorithm. We present the experimental results on msnbc web data in Section 6. We compare results of the proposed algorithm with those from the traditional hierarchical clustering algorithm using vector coding of sequences. Finally, we discuss the results and conclude in Section 7. Table 1 lists the symbols used in this paper.

Table 1 Table of notations Symbol

Description

Symbol

Description

U R RðX Þ d p A S3M(A, B) LD(A, B)

Universe Equivalence relation Upper approximation of set X Threshold value Sequence similarity Approximation space Similarity between sequences A and B using S3M similarity measure Levensthein distance between sequences A and B

X s R (X) r q k bj C jCjj C jl

Set X, subset of U Tolerance relation Lower approximation of set X Relative similarity Set similarity Total number of clusters Cluster representative of jth cluster Size of the jth cluster lth object of jth cluster

Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS 4

P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

2. S3M – Similarity measure for sequences A sequence is made up of a set of items that could happen in time or happen one after another, that is, in position but not necessarily in relation with time. We can say that a sequence is an ordered set of items. Typically, a sequence is denoted as S ¼ ha1 ; a2 ; . . . ; an i; where a1 ; a2 ; . . . ; an are the ordered item sets in sequence S. The length of the sequence is defined as the number of item sets present in the sequence, denoted as jSj. In order to find patterns in sequences, it is necessary to not only look at the items contained in sequences but also the order of their occurrence. A new measure, called sequence and set similarity measure (S3M) was introduced for the network security domain [20]. The S3M measure consists of two parts: one that quantifies the composition of the sequence (set similarity) and the other that quantifies the sequential nature (sequence similarity). Sequence similarity quantifies the amount of similarity in the order of occurrence of item sets within two sequences. Length of longest common subsequence (LLCS) with respect to the length of the longest sequence determines the sequence similarity aspect across two sequences. For instance, with two sequences A and B, the sequence similarity is measured as LLCSðA; BÞ maxðjAj; jBjÞ

SeqSimðA; BÞ ¼

ð1Þ

Set similarity (Jaccard similarity measure) is defined as the ratio to the number of common item sets and the number of unique item sets in two sequences. Thus, for two sequences A and B, the set similarity is measured as SetSimðA; BÞ ¼

jA \ Bj jA [ Bj

ð2Þ

Let us consider two sequences A and B, where A ¼ ha; b; c; di and B ¼ hd; c; b; ai. Now, the set similarity measure for these two sequences is 1, indicating that their composition is alike. But we can see that they are not at all similar when considering the order of occurrence of item sets. This aspect is quantified by the sequence similarity component. Where the sequence similarity component is 0.25 for these sequences. LLCS keeps track of the position of occurrence of item sets in the sequence. For two sequences, C ¼ ha; b; c; di and D ¼ hb; a; k; c; t; p; di, LLCS(C, D) works out to be 3 and after normalization, the sequence similarity component turns out to be 0.43. The set similarity for these two sequences is 0.57. The above two examples illustrate the need for combining set similarity and sequence similarity components into one function in order to take care of both the content as well as position based similarity aspects. Thus, S3M measure for two sequences A and B is given by S 3 MðA; BÞ ¼ p 

LLCSðA; BÞ jA \ Bj þq maxðjAj; jBjÞ jA [ Bj

ð3Þ

Here, p + q = 1 and p; q P 0. p and q determine the relative weights to be given for order of occurrence (sequence similarity) and to the content (set similarity), respectively. In practical applications, the user could specify these parameters. The LLCS between two sequences can be found by the dynamic programming approach [21]. P Let S be a set of finite sequences generated from a given set of symbols, . Let R denote the set of real numbers then Simðsi ; sj Þ : S  S ! R is called an index of similarity between sequences si ; sj 2 S if it satisfies the following properties [22]: (1) Non negativity: Simðsi ; sj Þ P 0 8si ; sj 2 S. (2) Symmetry: Simðsi ; sj Þ ¼ Simðsj ; si Þ 8si ; sj 2 S. (3) Normalization: Simðsi ; sj Þ 6 1 8si ; sj 2 S. Clearly, S3M similarity measure satisfies all the above properties and hence qualifies as a proper similarity metric. Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

5

3. Rough set theory Rough set theory introduced by Pawlak [23] deals with uncertainty and vagueness. Rough set theory became popular among scientists around the world due to its fundamental importance in the field of artificial intelligence and cognitive sciences. The building block of rough set theory is an assumption that with every set of the universe of discourse, we associate some information in the form of data and knowledge. Objects clustered by the same information are similar with respect to the available information about them. The similarity generated in this way forms the basis for rough set theory. In rough set theory, a set can be defined using a pair of crisp sets, lower and upper approximations of a set. Incorporating classic axioms with rough set theory, we have rough set based logics and new features that are very useful in intelligent decision-making. Based on uncertain and inconsistent data, rough set logic allows correct reasoning and discovery of hidden-associations. Let U be the universe and let R  U  U be an equivalence relation on U, called an indiscernibility relation. The pair A = (U, R) is called an approximation space. The lower and upper approximation of set X with respect to R can be written as RðX Þ ¼ fx 2 U : ½xR  X g RðX Þ ¼ fx 2 U : ½xR \ X 6¼ /g where ½xR ¼ fy 2 U jxRyg is the equivalence class of x. If ½xR  X , then it is certain that x 2 X. If ½xR  U  RðX Þ then it is clear that x 6 X. ½xR  X is called rough with respect to R iff RðX Þ 6¼ RðX Þ. Otherwise X is R-discernible. Rough set of X is defined by a pair of lower and upper approximations as ðRðX Þ; RðX Þ. The set BNRðX Þ ¼ RðX Þ  RðX Þ is called rough-boundary (BNR) of X and the set U  RðX Þ is called negative region of X. Fig. 1 shows all the approximation spaces of the rough set. A basic granule of knowledge is called an elementary. It consists of well defined collections of similar objects. A crisp boundary is defined as the union of some elementary objects. The inability to group elementary objects into distinct partitions leads to rough boundaries. A rough cluster is defined in a similar manner to a rough set, that is with a lower and upper approximation. The lower approximation of a rough cluster contains objects that only belong to that cluster. The upper approximation of a rough cluster contains objects in the cluster which are also members of other clusters. In order to carry out rough clustering, two additional requirements, namely, an ordered value set (Vn) of each attribute N and a distance measure for clustering need to be specified. Ordering of value set (Vn) of attributes enables measurement of distance between objects. A distance measure used in rough clustering should be defined such that the strict requirements of indiscernibility relation used in canonical rough set theory is relaxed. Thus rough clustering allows for grouping of objects based on a notion of similarity relation rather

Fig. 1. Rough approximation space.

Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS 6

P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

than based on equivalence relation. An important distinction between rough clustering and traditional clustering approaches is that, with rough clustering, an object can belong to more than one cluster. 4. Proposed rough clustering algorithm ‘‘The concept must have a sharp boundary. To the concept without a sharp boundary there would correspond an area that does not have a sharp boundary-line all around.’’ Frege as translated in Pawlak [23]. In many data mining applications, the class attributes of most objects are not distinct but vague. Vagueness in data has attracted mathematicians, philosophers, logicians and recently computer scientists. Rough set theory is an approach to deal with vagueness. In the literature, considerable work has been carried out using rough sets to deal with this kind of uncertainty [10–14,19,24,25]. As discussed in Section 3, the core concept of rough set theory is the indiscernibility relation which posses reflexive, symmetry and transitivity properties. Indiscernibility relation partitions the universe into equivalence classes, which form the basic granules of knowledge. Generalization of rough set theory can be done mainly in three ways, namely, set-theoretic framework using non-equivalence binary relations, granule based definition using coverings and subsystems based definition using other subsystems [26–28]. In the set-theoretic framework, element based definition is generalized using non-equivalence binary relation. The granule based definition is used to generalize the operators of approximation of rough set theory. Covering provides a natural generalization of a partition. In the subsystem based approach two subsystems, one related to lower and other related to upper approximations are used. The subsystem based approach is used to generalize mainly algebraic system. In the work reported here, we use a non-equivalence binary relations based generalization of rough sets. When the objects cannot be partitioned using a strict indiscernibility relation, it is required to extend this to a binary relation such as tolerance relation. For an incomplete information system, reduction of knowledge has been accomplished using tolerance relation [29]. A tolerance based rough set model has been developed for document clustering and information retrieval [30,31]. Later Ngo and Nguyen [32] extended the work for clustering web search results. Attempts have been made to define rough sets on similarity relations [33,34]. In general, similarity relations do not give the same kind of partitions of the universe as the indiscernibility relation. Similarity classes of each object x present in the universe U provide the similarity information for the object. An object from one similarity class may be similar to objects from other similarity classes. Therefore the basic granule of knowledge is intermixed. Extending indiscernibility to similarity relation requires weakening of some of the properties of binary relations in terms of reflexivity, symmetry and transitivity [23,35]. The reflexivity property cannot be relaxed, as an object is trivially similar to itself. Researchers dealing with similarity relations do impose the symmetry property but some researchers relax it [33]. However, most researchers extending indiscernibility relations to similarity relations do relax the transitivity property [34]. Let X  U then a relation s  X · U is a tolerance relation on U, if (1) s is reflexive, that is for any x 2 U, x s x holds. (2) s is symmetric, that is for any pair of x; y 2 U , x s y ¼ y s x. Definitions of lower and upper approximations of a set can now be easily formulated using tolerance classes. In order to do this, we substitute tolerance classes for indiscernibility classes in the basic definition of lower and upper approximations of set. Thus, the tolerance approximations of a given subset X of the universe U is defined as in definition 1. Definition 1. [13]. Let X  U and a binary tolerance relation R is defined on U. The lower approximation of X, denoted by R(X) and the upper approximation of X, denoted by RðX Þ are respectively defined as follows:

RðX Þ ¼ fx 2 X ; RðxÞ  X g [ RðX Þ ¼ RðxÞ

and

x2X

Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

7

In this work, we propose an agglomerative clustering algorithm using rough sets for clustering web user transactions. Let xi 2 U be a user transaction consisting of sequence of web page visits. For clustering user transactions, initially each transaction is taken as a single cluster. Let the ith cluster be C i ¼ fxi g. Clearly, Ci is a subset of U. The upper approximation of Ci, denoted as RðC i Þ, is a set of transactions similar to xi, that is, a user visiting the web pages in xi may also visit other web pages present in the transactions belonging to RðC i Þ. For any non-negative threshold value d 2 ð0; 1 and for any two objects x; y 2 U , a binary relation s on U denoted as x s y is defined by x s y iff Simðx; yÞ P d. This relation R is a tolerance relation and R is both reflexive and symmetric but transitivity may not always hold. The first upper approximation Rðxi Þ is a set of objects that are most similar to xi. Thus, first upper approximation of an object xi can be defined as follows: Definition 2. For a given non-negative threshold value d 2 ð0; 1 and a set X ¼ fx1 ; x2 ; . . . ; xn g, X  U the first upper approximation is Rðfxi gÞ ¼ fxj jSimðxi ; xj Þ P dg Some sets in the collection resulting from the first upper approximation may share elements (the so called boundary elements). The boundary elements can guide the clustering process. The shared elements, generated after first upper approximation, may be the potential candidate of the new collection formed in the second or higher upper approximations. This can be decided by calculating the strength of shared element to all the clusters it belong. This is measured using a parameter called relative similarity. The value of the second and the higher similarity upper approximations is computed under the condition of relative similarity. For two intersecting sets X ; Y 2 U . The relative similarity of X with respect to Y is given by RelSimðxi ; xj Þ ¼

jRðxi Þ \ Rðxj Þj jRðxi Þ  Rðxj Þj

ð4Þ

where RðX Þ 6 RðY Þ. The relative similarity defined above, measures the ratio of size of the shared boundary between two sets and the number of elements that exclusively belong to the set under consideration. The subtraction of two sets may be zero hence the relative similarity may attain the indefinite value. Hence, to have the definite value of relative similarity in the positive real number domain, the first set should not be proper subset of the second while computing the relative similarity between two sets (as specified by the where condition). Now we define the proposed constrained-similarity upper approximation in the following definition: Definition 3. Let X ¼ fx1 ; x2 ; . . . ; xn g, X  U. For a fixed non-negative value r 2 ð0; 1, the constrainedsimilarity upper approximation of xi is given by 8 9 < = [ RRðfxi gÞ ¼ xj 2 Rðxl ÞjRelSimðxi ; xj Þ P r : ; xl 2Rðxi Þ

where Rðxi Þ 6 Rðxj Þ. In other words, all the sequences xj which belong to the similarity upper approximations of elements of Rðxi Þ that are relatively similar to xi are constrained (or merged) into the next similarity upper approximation of xi. We repeat the process of computing successive constrained-similarity upper approximations for a given r until two consecutive constrained-similarity upper approximations remain the same. Here, r is a user-defined parameter called relative similarity, used to merge two upper approximations for the formation of the second and higher upper approximations. d is a user defined threshold parameter use to define the similarity between two objects and is utilized to find the first upper approximation. The constrained-similarity upper approximation is computed for all transactions of U. The complete algorithm for computation of rough set based agglomerative clustering is given in Algorithm 1. Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS 8

P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

Unlike other traditional agglomerative algorithms, in our approach more than two transactions may merge to form clusters. Also, the number of upper approximation computations for similarity sets decreases as the number of iterations increases. Thus, the proposed rough agglomerative clustering converges faster. Algorithm 1 Rough Set based agglomerative clustering Input: T: A set of n transactions 2 U Threshold d 2 ð0; 1 Relative similarity r 2 ð0; 1 Output: Cluster scheme C Begin Step 1: Construct the similarity matrix using S3M measure. Step 2: For each xi 2 U, Compute S i ¼ Rðxi Þ using definition 2 for given threshold d. S Step 3: Let US ¼ i S i ,

C¼; Step 4: For all S i 2 US Compute the next constrained-similarity upper approximation S 0 using Definition 3 for relative similarity r if S i ¼ S 0i S C ¼ C S 0i US ¼ US n fS i g endif Step 5: Repeat step 4 until US 6¼ ; Step 6: Return C End

4.1. Complexity of the algorithm The complexity analysis of our approach is as follows: Let N be the total number of transactions in T and L be the average length of the transaction. The complexity of similarity computation is in the order of OðN 2 log2 LÞ. Let R be relation defined over T, then the complexity of upper approximation is in the order of OðjT j=jRjÞ [36], which is same as OðN =jRjÞ. Merging of clusters takes place at each iteration, based on the similarity upper approximation. Let k be the average number of clusters merging in each iteration. The complexity of merging k clusters is in the order of O(k log k) [37] and there may be a maximum of N/k iterations. Thus, the complexity of the merging process is O((N/k)k log k) = O(N log k). So, the complexity of rough agglomerative clustering is of the order of OðN 2 log2 LÞ þ OðN =jRjÞ þ OðN log kÞ. 5. Web usage data In this section, we describe the necessary preprocessing steps on the dataset. We also outline here the description of the web usage data that we have taken for experimental purposes. 5.1. Data preprocessing A prerequisite step in all of the techniques for providing users with recommendations is the identification of a set of user sessions from the raw usage data provided by the web server. Ideally, each user session gives an exact account of who accessed the web site, what pages were requested and in what order, and for how long each page was viewed. Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

9

In addition to identifying user sessions, the raw log must also be cleaned or transformed into a list of page views. Cleaning the server log involves removing all of the file accesses that are redundant, leaving only one entry per page view. This includes handling page views that have multiple frames and dynamic pages that have the same template name for multiple page views. It may also be necessary to filter the log files by mapping the references to the site topology induced by physical links between pages. This is particularly important for usage-based personalization, since the recommendation engine should not provide dynamic links to ‘‘outof-date’’ or non-existent pages. Each user session can be thought of in two ways: either as a single transaction of many page references, or as a set of many transactions each consisting of a single page reference. The goal of transaction identification is to dynamically create meaningful clusters of references for each user. Based on an underlying model of the user’s browsing behavior, each page reference can be categorized as a content reference, auxiliary (or navigational) reference, or hybrid. In this way, different types of transactions can be obtained from the user session file, including content-only transactions involving references to content pages and navigation-content transactions involving a mix of page types. The details of methods for transaction identification are discussed in Cadez et al. [38]. For the purpose of this paper, we assume that each user session is viewed as a single transaction containing reference to multiple pages in a session. Finally, the session file may be filtered to remove very small transactions. 5.2. Description of the dataset We collected data from the UCI dataset repository [http://kdd.ics.uci.edu/] that consists of Internet Information Server (IIS) logs for msnbc.com and news-related portions of msn.com for the entire day of September 28, 1999 (Pacific Standard Time). Each sequence in the dataset corresponds to page views of a user during that twenty-four hour period. Each event in the sequence corresponds to a user’s request for a page. Requests are not recorded at the finest level of detail but at the level of page categories as determined by the site administrator. There are 17 page categories, namely, ‘frontpage’, ‘news’, ‘tech’, ‘local’, ‘opinion’, ‘on-air’, ‘misc’, ‘weather’, ‘health’, ‘living’, ‘business’, ‘sports’, ‘summary’, ‘bbs’ (bulletin board service), ‘travel’, ‘msn-news’ and ‘msn-sports’. Table 2 shows the characteristics of the dataset. Each page category is represented by an integer label. For example, ‘frontpage’ is coded as 1, ‘news’ as 2, ‘tech’ as 3, etc. Each row describes the hits of a single user. For example, the fourth user hits ‘frontpage’ twice, and the second user hits ‘news’ once and so on as shown in Fig. 2. Table 2 Description of the msnbc dataset Total dataset Number of users Minimum session length Maximum session length Average number of visits per user

989,818 1 500 5.7

Fig. 2. Example web navigation data.

Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS 10

P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

Table 3 Similarity matrix using proposed metric with p = 0.5

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

T1

T2

T3

T4

T5

T6

T7

T8

T9

T10

1 0 0 0 0.21 0.29 0 0 0 0

0 1 0 0.47 0.17 0.17 0.17 0 0.15 0.15

0 0 1 0 0 0.25 0.33 0.33 0 0.21

0 0.47 0 1 0.17 0 0.45 0.27 0.24 0.5

0.21 0.17 0 0.17 1 0.18 0 0 0 0

0.29 0.17 0.25 0 0.18 1 0.18 0.21 0 0.17

0 0.17 0.33 0.45 0 0.18 1 0.58 0.17 0.62

0 0 0.33 0.27 0 0.21 0.58 1 0 0.5

0 0.15 0 0.24 0 0 0.17 0 1 0.24

0 0.15 0.21 0.5 0 0.17 0.62 0.5 0.24 1

In the total dataset, the length of user sessions ranges from 1 to 500 and the average length of a session is 5.7. Keeping this in mind, we randomly selected 5000 sessions from the preprocessed dataset (consisting of 44,062 sessions) all of length 6 for our experimentation. 5.3. An example To illustrate our approach, we considered 10 data sequences as shown in Fig. 2. The similarity table was computed using S3M similarity metric with p = 0.5 (Table 3). The first similarity upper approximation at threshold value d = 0.2 is given by RðT i Þ for i ¼ 1; 2; . . ., 10 as given below: RðT 1Þ ¼ fT 1; T 5; T 6g RðT 2Þ ¼ fT 2; T 4g RðT 3Þ ¼ fT 3; T 6; T 7; T 8; T 10g RðT 4Þ ¼ fT 2; T 4; T 7; T 8; T 9; T 10g RðT 5Þ ¼ fT 1; T 5g RðT 6Þ ¼ fT 1; T 3; T 6; T 8g RðT 7Þ ¼ fT 3; T 4; T 7; T 8; T 10g RðT 8Þ ¼ fT 3; T 4; T 6; T 7; T 8; T 10g RðT 9Þ ¼ fT 4; T 9; T 10g RðT 10Þ ¼ fT 3; T 4; T 7; T 8; T 9; T 10g In the first step, the second similarity upper approximation of RðT 1Þ is given by RR0 ðT 1Þ ¼ fT 1; T 3; T 5; T 6; T 8g Now, constrained-similarity upper approximation is applied on RR0 using Definition 3 and r = 1. It can be seen that only the elements T1, T5 and T6 qualify to be in the RRðT 1Þ. For example, consider element T3, RðT 1Þ \ RðT 3Þ ¼ fT 6g and RðT 1Þ  RðT 3Þ ¼ fT 1; T 5g. Thus, the relative similarity between T1 and T3 is RelSimðT 1; T 3Þ ¼

jRðT 1Þ \ RðT 3Þj 1 ¼ < rð¼ 1Þ jRðT 1Þ  RðT 3Þj 2

Hence, T3 will not merge with RðT 1Þ. Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

11

Thus, the family of constrained-similarity approximations is given as follows: RRðT 1Þ ¼ fT1; T5; T6g RRðT 2Þ ¼ fT2; T4g RRðT 3Þ ¼ fT3; T6; T7; T8; T10g RRðT 4Þ ¼ fT2; T4; T7; T8; T9; T10g RRðT 5Þ ¼ fT1; T5g RRðT 6Þ ¼ fT 3; T 6; T 8g RRðT 7Þ ¼ fT3; T4; T7; T8; T10g RRðT 8Þ ¼ fT3; T4; T6; T7; T8; T10g RRðT 9Þ ¼ fT4; T9; T10g RRðT 10Þ ¼ fT3; T4; T7; T8; T9; T10g In the above set family for the sets shown in bold the two consecutive similarity upper approximations are the same. In this case, the first and the second similarity upper approximation are the same. For example, RðT 1Þ ¼ RRðT 1Þ ¼ fT 1; T 5; T 6g Thus, the third similarity upper approximation will be computed for only those elements whose consecutive similarity upper approximations are not the same. Thus, only T6 needs to be considered for the third similarity upper approximation. RRRðT 6Þ ¼ fT3; T6; T8g Now since there is no change in the constrained-similarity upper approximations for all the elements, the algorithm has converged. The final cluster family is given by fT 1; T 5; T 6g; fT 2; T 4g; fT 3; T 6; T 7; T 8; T 10g; fT 2; T 4; T 7; T 8; T 9; T 10g; fT 1; T 5g; fT 3; T 4; T 7; T 8; T 10g; fT 3; T 4; T 6; T 7; T 8; T 10g; fT 4; T 9; T 10g

and

fT 3; T 4; T 7; T 8; T 9; T 10g Figs. 3–5 show the first second and third similarity upper approximations respectively, for the 10 data sequence example set. Fig. 5 shows the final overlapping clusters formed as per our proposed methodology. For the sake of clarity, we removed those clusters in Fig. 5 that are the proper subsets of one cluster. That is, C1 will be removed if and only if C1 is a subset of any Cm. In Fig. 3, transaction T1 is within the cluster shown by the dotted oval during the formation of second upper approximation whereas in Fig. 3, transaction T1 has left the dotted cluster and it is no more a member of the dotted cluster during the third upper approximation formed due to constrained merging. It can be clearly seen that many transactions belong to multiple clusters, for example T1 belongs to first as well as fourth cluster. 6. Experimental results We implemented our approach in Java 1.4 and performed experiments on 2.4 GHz, 256MB RAM, Pentium-IV machine running on Microsoft Windows XP 2002. Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS 12

P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

Fig. 3. First upper approximation.

Fig. 4. Second upper approximation.

Fig. 5. Third upper approximation.

Experiments were conducted on randomly selected samples of 100, 200, 300, 400 and 500 records from the preprocessed data set with the S3M metric. We considered the value of p as 0.5 with the S3M metric. The value of relative similarity (r) used to define the constrained-similarity upper approximation is taken as 1. Fig. 6 shows the number of clusters formed for 100, 200, 300, 400 and 500 records for threshold values 0.8 and 0.9. Table 4 shows the size of lower approximation sets for our pilot dataset consisting of 100, 200, 300, 400 and 500 records at d value of 0.8 and 0.9. As the number of records increases the number of clusters formed also increases with the same amount of variation for both the threshold values. In order to capture the implicit grouping of the dataset, the size of the lower approximation must be as small as possible. From Table 4, it is evident that for the 500 sample size dataset, the size of the lower approximation is 30 and 31 at d values of 0.8 and 0.9, respectively. The nature of several clusters formed in the proposed approach is such that all the members of a cluster Ci belong to other clusters Cj and Ck as shown in Fig. 7. These type of clusters retain the ambiguity of the clustering scheme hence justifying the application of rough set based clustering for web transaction data. Table 5 summarizes the number of clusters where the lower approximation set is null for our pilot dataset at d values of 0.8 and 0.9. Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

13

Fig. 6. Graph between numbers of clusters formed with different number of records.

Table 4 Lower approximation for different number of records at different threshold values Number of records

d = 0.8

d = 0.9

100 200 300 400 500

9 22 25 27 30

9 23 28 31 31

Fig. 7. Cluster example scenario in rough clustering. Here Ci has a null lower approximation.

Table 5 Number of records with null lower approximation for different number of records at different threshold values Number of records

d = 0.8

d = 0.9

100 200 300 400 500

2 6 6 8 9

2 5 7 11 9

Experiments were performed on large datasets to see the performance of proposed clustering algorithm. We recorded both the number of overlapping clusters as well as the time taken to execute the program on various sample of randomly selected data sequences. The threshold value d and p value of S3M taken in the experiments were 0.9 and 0.5, respectively. The value of relative similarity r is chosen as 1. Table 6 shows the number of overlapping clusters formed with the time taken to execute the program for the various sizes of randomly selected data sequences. The linear increase in time taken was observed for the increasing size of data sequences. Further, we randomly selected 5000 transactions from the preprocessed dataset and executed our proposed approach. We obtained 2674 overlapping clusters at threshold value(d) 0.8 and relative similarity (r) of 1. For this work, p value taken was 0.8. The size of lower approximations set is 603. There were 1628 clusters in total that did not have any lower approximation. Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS 14

P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

Table 6 Number of overlapping clusters formed for different sizes of data sequences Size of data sequences

Number of overlapping clusters formed

Time taken in seconds

1000 2000 3000 4000 5000 6000 7000 8000 9000 10,000 11,000 12,000 13,000 14,000 15,000

103 316 692 1571 2674 2912 3549 4389 4771 5512 5623 6192 6923 7459 7601

1082 1564 2823 3665 5643 9421 12,723 15,063 16,381 18,214 19,532 21,365 22,381 24,572 25,902

Effectiveness of the clustering technique was also measured quantitatively. We calculated the cluster representative by maximizing the average pairwise similarity among the members of the jth cluster using the formula given below: ( ) jC j j 1 X 3 b C j ¼ arg max S MðC jl ; C js Þ ð5Þ jl jC j j j ; j ¼1 l

s

where j ¼ 1; 2; . . . ; k. The above formula finds the cluster representative that maximizes intra-cluster distance using S3M similarity measure. As these clusters are composed of sessions that are sequential in nature, the cost associated in converting the sequences within a cluster to the cluster representative must be minimum (intra-cluster distance). At the same time, the cost of converting the sequences from two different clusters must be high (inter-cluster distance). We computed the conversion cost of sequences using the Levensthein distance (LD) for each cluster. The average Levensthein distance reflects the goodness of the clusters with respect to the intra-cluster distance. Average Levenshtein distance (ALD) is expressed as PjCj j k b 1X jl ¼1 LDð C j ; C jl Þ ALD ¼ ð6Þ k j¼1 jC j j ALD computations was carried out on seven big clusters (those clusters whose size is more than 10% of the dataset size). ALD for seven rough clusters is 3.756. This means the sessions preserving the sequential nature have been grouped together. The inter-cluster distance (LD) for the seven clusters are shown in Table 7. Together, these results suggest that the rough set based clustering resulted in clusters with desirable intra-cluster distance (ALD = 3.756) and inter-cluster distances (LD = 3–6). We compared our results with the traditional complete linkage based hierarchical clustering algorithm. To compare the algorithms the sessions falling in the big clusters (1587 sessions) were clustered using the complete linkage clustering algorithm with a frequency based encoding of the sessions. The ALD value for the complete linkage based hierarchical clustering algorithm is 4.178. Since the length of the sessions is 6, this result indicates that the cost of converting two sequences within a cluster is more than 65%. Thus, it can be seen that complete linkage hierarchical algorithm results in poor intra-cluster distances. Table 8 shows the inter-cluster distances obtained using the complete linkage based hierarchical algorithm. LD which indicates inter-cluster distance value ranges from 2 to 5 whereas for rough clustering the value was 3–6. It is interesting to note that a maximum value of inter-cluster distance of 6 was observed in the rough set based clustering among four clusters whereas the maximum value achieved in complete linkage clustering algorithm was 5 and was observed only between two clusters. Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

15

Table 7 Inter cluster distance in rough set based clustering

C1 C2 C3 C4 C5 C6 C7

C1

C2

C3

C4

C5

C6

C7

0 3 4 3 3 5 4

3 0 3 5 3 4 6

4 3 0 3 4 6 4

3 5 3 0 6 6 5

3 3 4 6 0 4 3

5 4 6 6 4 0 4

4 6 4 5 3 4 0

Table 8 Inter cluster distance in complete linkage based hierarchical clustering

C1 C2 C3 C4 C5 C6 C7

C1

C2

C3

C4

C5

C6

C7

0 3 4 4 3 2 4

3 0 3 5 4 2 3

4 3 0 3 3 4 2

4 5 3 0 3 4 3

3 4 3 3 0 3 2

2 2 4 4 3 0 3

4 3 2 3 2 3 0

These results establish the desirability of the proposed rough set based clustering over complete linkage based hierarchical clustering. 7. Discussion and conclusions Rough set theory, originally proposed by Pawlak in 1982, has attracted many researchers from various domains and led to successful applications in various fields. In this paper, we applied the concept of rough sets to cluster objects using the notion of similarity upper approximations. Usually, the clusters resulting from the web usage mining algorithms may not necessarily have crisp boundaries, rather they have fuzzy or rough boundaries [4,39]. Membership of a record in a cluster may not be precisely defined, that is, a record may be a candidate of more than one cluster. The approach presented in this paper results in rough clusters and the experiments on user navigation data produced meaningful clusters that enable discovering of navigation patterns. Rough clusters are helpful to get early warnings of potentially significant changes in the clustering patterns. Traditional clustering methods such as the k-means approach generate groups describing the members of each cluster whereas clustering techniques based on rough set theory generate clusters describing the main characteristics of each cluster [6,16]. These concepts provide interpretations of different web page visit sessions. Such concepts can be an aid to web miners attempting to describe potentially new groups of web users. Rough clustering produces more clusters than standard cluster analysis [6]. The number of clusters required to describe the data is a function of the inter-object distance. More clusters means that an object has a higher chance of being in more than one cluster by moving from the lower approximation to the boundary region and reducing the size of the lower approximation. In this work, we introduced the S3M similarity measure for web data, which explicitly exploits the sequential nature of data. S3M considers both the order of occurrence of items as well as the content of the set. Further, we introduced the concept of constrained-similarity upper approximation and presented a new rough agglomerative clustering approach for sequential data. The proposed approach enables merging of two or more clusters at each iteration and hence facilitates fast hierarchical clustering. We experimented our approach on a web navigation dataset collected from the UCI dataset repository. We successfully performed the experiments to form rough clusters. The size of lower approximations as well as the size of clusters with zero lower approximation were also listed. The proposed clustering approach is useful to mine sequential data Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS 16

P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

and helps web miners in describing the characteristics of potentially new groups of web users. The effectiveness of the proposed clustering approach has been established by comparing it with the traditional hierarchical clustering algorithm that uses vector encoding of sequences. Acknowledgement The authors are thankful to the anonymous referees for their valuable comments, which resulted in an improved presentation of the paper. References [1] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., Wiely, New York, 2001. [2] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing Surveys 31 (3) (1999) 264–323. [3] K.E. Voges, N.K.L. Pope, Generating compact rough cluster descriptions using an evolutionary algorithm, in: Genetic and Evolutionary Algorithm Conference, LNCS Springer-Verlag, London, UK, 2004, pp. 1332–1333. [4] P. Lingras, C. West, Interval set clustering of web users with rough k-means, Journal of Intelligent Information Systems 23 (1) (2004) 5–16. [5] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. [6] K.E. Voges, N.K.L. Pope, M.R. Brown, Cluster analysis of marketing data examining online shopping orientation: a comparison of k-means and rough clustering approaches, in: H.A. Abbass, R.A. Sarker, C.S. Newton (Eds.), Heuristics and Optimization for Knowledge Discovery, Idea Group Publishing, Hershey, 2002, pp. 207–224. [7] A. Joshi, R. Krishnapuram, Robust fuzzy clustering methods to support web mining, in: Workshop in Data Mining and Knowledge Discovery, SIGMOD, 1998, pp. 15-1–15-8. [8] S. Hirano, S. Tsumoto, An indiscernibility-based clustering method with iterative refinement of equivalence relations -rough clustering, Journal of Advanced Computational Intelligence and Intelligent Informatics 7 (2) (2003) 169–177. [9] S. Hirano, S. Tsumoto, Indiscernibility-based clustering : Rough clustering, in: International Fuzzy Systems Association World Congress, LNCS Springer-Verlag, Heidelberg, 2003, pp. 378–386. [10] S.K. De, A rough set theoretic approach to clustering, Fundamenta Informaticae 7 (2) (1999) 335–344. [11] P. Kumar, P.R. Krishna, S.K. De, R.S. Bapi, Web usage mining using rough agglomerative clustering, in: Seventh International Conference on Enterprise Information System, LNCS Springer-Verlag, London, UK, 2005, pp. 315–320. [12] D. Dubois, H. Prade, Putting rough sets and fuzzy sets together, in: R. Slowinski (Ed.), Intelligent Decision Support-Handbook of Applications and Advances of the Rough Set Theory, Kluwer Academic, 1992, pp. 203–232. [13] S.K. De, P.R. Krishna, Clustering web transactions using rough approximation, Fuzzy Sets and Systems 148 (1) (2004) 131–138. [14] S.K. De, R. Biswas, A.R. Roy, Finding dependency of attributes in information system, Journal of Fuzzy Mathematics 64 (3–4) (2004) 409–417. [15] S. Hirano, S. Tsumoto, Rough clustering and its application to medicine, Information Science 124 (2000) 125–137. [16] P. Lingras, Y.Y. Yao, Time complexity of rough clustering: gas versus k-means, in: Third International Conference on Rough Sets and Current Trends in Computing, LNCS Springer-Verlag, London, UK, 2002, pp. 263–270. [17] S.K. Pal, P. Mitra, Case generation using rough sets with fuzzy representation, IEEE Transactions on Knowledge and Data Engineering 16 (3) (2004) 292–300. [18] S.K. Pal, A. Skowron, Rough Fuzzy Hybridization: New Trends in Decision Making, LNCS Springer Verlag, Singapore, 1999. [19] M. Sarkar, Rough-fuzzy functions in classification, Fuzzy Sets and Systems 132 (3) (2002) 353–369. [20] P. Kumar, M.V. Rao, P.R. Krishna, R.S. Bapi, A. Laha, Intrusion detection system using sequence and set preserving metric, in: Proceedings of IEEE International Conference on Intelligence and Security Informatics, LNCS Springer Verlag, Atlanta, 2005, pp. 498–504. [21] L. Bergroth, H. Hakonen, T. Raita, A survey of longest common subsequence algorithm, in: Seventh International Symposium on String Processing and Information Retriveal, Atlanta, 2000, pp. 39–48. [22] M.H. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, NJ, 2003. [23] Z. Pawlak, Rough sets, International Journal of Computer and Information Sciences 2 (1982) 341–346. [24] S. Greco, B. Matarazzo, R. Slowinski, Rough set processing of vague information using fuzzy similarity relations, in: C.S. Calude, G. Paun (Eds.), Finite Versus Infinite – Contributions to an Eternal Dilemma, LNCS Springer-Verlag, London, 2000, pp. 149–173. [25] L. Polkowski, A. Skowron, J. Zytkow, Tolerance based rough sets, in: T.Y. Lin, A.M. Wildberger (Eds.), Soft Computing: Rough Sets, Fuzzy Logic, Neural Networks, Uncertainty Management, Simulation Councils, Inc., San Diego, 1995, pp. 55–58. [26] Y. Yao, T. Lin, Generalization of rough sets using modal logic, Intelligent Automation and Soft Computing (1996) 103–120. [27] Y.Y. Yao, Generalized rough set models, in: Rough Sets in Knowledge Discovery, Physica-Verlag, Heidelberg, 1998, pp. 286–318. [28] Y.Y. Yao, On generalizing rough set theory, in: 9th International Conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, 2003, pp. 44–51. [29] M. Kryszkiewicz, Rough set approach to incomplete information systems, Information Science 112 (1–4) (1998) 39–49. [30] T.B. Ho, N.B. Nguyen, Nonhierarchical document clustering based on a tolerance rough set model, International Journal of Intelligent Systems 17 (2) (2002) 199–212.

Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

ARTICLE IN PRESS P. Kumar et al. / Data & Knowledge Engineering xxx (2007) xxx–xxx

17

[31] S. Kawasaki, N.B. Nguyen, T.B. Ho, Hierarchical document clustering based on tolerance rough set model, in: 4th European Conference on Principles of Data Mining and Knowledge Discovery, LNCS Springer-Verlag, London, UK, 2000, pp. 458–463. [32] C.L. Ngo, H.S. Nguyen, A method of web search result clustering based on rough sets, Web Intelligence (2005) 673–679. [33] R. Slowinski, D. Vanderpooten, Similarity relations as a basis for rough approximations, ICS Research Report 53/95, Warsaw University Technology, 1995. [34] R. Slowinski, D. Vanderpooten, A generalized definition of rough approximations based on similarity, IEEE Transactions on Data and Knowledge Engineering 12 (2000) 331–336. [35] Z. Pawlak, Rough Sets – Theoretical aspects of reasoning about data, Kluwer Academic, 1991. [36] S. Jamil, S.D. Jitender, Concept approximations based on rough sets and similarity measures, International Journal on Applied Mathematics and Computer Science 11 (3) (2001) 655–674. [37] M. Dash, H. Liu, P. Scheuermann, K.L. Tan, Fast hierarchical clustering and its validation, Data and Knowledge Engineering 44 (1) (2003) 109–138. [38] I.V. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White, Visualization of navigation patterns on a web site using model-based clustering, Knowledge Discovery and Data Mining (2000) 280–284. [39] P. Lingras, R. Yan, A. Jain, Web usage mining: Comparison of conventional, fuzzy, and rough set clustering, in: Y. Zhang, J. Liu, Y. Yao (Eds.), Computational Web Intelligence: Intelligent Technology for Web Applications, Springer, 2004, pp. 133–148.

Please cite this article in press as: P. Kumar et al., Rough clustering of sequential data, Data Knowl. Eng. (2007), doi:10.1016/j.datak.2007.01.003

Rough clustering of sequential data

a Business Intelligence Lab, Institute for Development and Research in Banking Technology (IDRBT),. 1, Castle Hills .... using rough approximation to cluster web transactions from web access logs has been attempted [11,13]. Moreover, fuzzy ...

495KB Sizes 1 Downloads 264 Views

Recommend Documents

web usage mining using rough agglomerative clustering
is analysis of web log files with web pages sequences. ... structure of web sites based on co-occurrence ... building block of rough set theory is an assumption.

data clustering
Clustering is one of the most important techniques in data mining. ..... of data and more complex data, such as multimedia data, semi-structured/unstructured.

Survey on Data Clustering - IJRIT
common technique for statistical data analysis used in many fields, including machine ... The clustering process may result in different partitioning of a data set, ...

Survey on Data Clustering - IJRIT
Data clustering aims to organize a collection of data items into clusters, such that ... common technique for statistical data analysis used in many fields, including ...

Survey on clustering of uncertain data urvey on ...
This paper mainly discuses on different models of uncertain data and feasible methods for .... More specifically, for each object oi, we define a minimum bounding .... The advances in data collection and data storage have led to the need for ...

Spike sorting: Bayesian clustering of non-stationary data
i}Nt i=1}T t=1 . We assume that in each frame data are approximated well by a mixture-of-Gaussians, where each Gaussian corresponds to a single source neuron. ..... 3.1. Problem formulation. A probabilistic account of transitions between mixtures-of-

Clustering in Data Streams
Small(er)-Space Algorithm (cont'd). • Application in data stream model. − Input m (a multiple of 2k) points at a time. − Reduce the first m points to 2k medians. − Maintain at most m level-i medians. − On seeing m, generate 2k level-(i+1) m

Mining Actionable Subspace Clusters in Sequential Data
such as astronomy, physics, geology, marketing, etc. ... Email: [email protected]. †School of Computer Engineering, Nanyang Technological University,.

Frequentist evaluation of group sequential clinical ... - RCTdesign.org
Jun 15, 2007 - repeated analysis of accruing data is allowed to alter the sampling scheme for the study, many ...... data-driven decisions. ..... Furthermore, a failure to report the scientifically relevant boundaries to the study sponsors and.

Automatic generation of synthetic sequential ...
M. D. Hutton is with the Department of Computer Science, University of. Toronto ..... an interface to other forms of circuits (e.g., memory [20]) or to deal with ...

Autonomous Traversal of Rough Terrain Using ... - CiteSeerX
Computer Science and Engineering, University of New South Wales, Sydney, Australia ... Clearly, there- fore, some degree of autonomy is extremely desir- able.

Automatic generation of synthetic sequential ...
M. D. Hutton is with the Department of Computer Science, University of. Toronto, Ontario M5S ... terization and generation efforts of [1] and [2] to the more dif- ficult problem of ..... for bounds on the fanin (in-degree) and fanout (out-degree) of

Mining Sequential Patterns - Department of Computer Science
ta x onomies. W e call ip an ancestor ofqp ( andrp a descendant of)ip) if there is an ...... In Proc. of the A CM SIGMOD Conference on M a n a gement of D a t a, ...

A Rough Estimation of the Value of Alaska ...
Warm model is CSIRO-Mk3.0, Australia; warmer model is GFDL-CM2.0, U.S. NOAA; warmest model is MIROC3.2.(hires) .... post offices, police stations, health clinics, and other infrastructure that make it possible for. Alaska to ..... relocation costs fo

Frequentist evaluation of group sequential clinical ... - RCTdesign.org
Jun 15, 2007 - CLINICAL TRIAL. The sepsis clinical trial introduced in the previous section was designed to compare 28-day mor- tality probabilities between groups of patients who received antibody to endotoxin and groups of patients who received pla

A Survey on Data Stream Clustering Algorithms
The storage, querying, processing and mining of such data sets are highly .... problems, a novel approach to manipulate the heterogeneous data stream ...