IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014, Pg: 155-158

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

Web page clustering using Query Directed Clustering algorithm: A Review Ms. Priya S.Yadav1, Ms. Pranali G. Wadighare2,Ms.Sneha L. Pise3 , Ms. Sonali M. Latelwar4, Ms.Ravina S.Sheikh5 Information Technology, DBACER, Wanadongari, Nagpur, Maharashtra Email id. : [email protected] , [email protected] , [email protected] , [email protected] , [email protected] Ms. Sumedha C. Chokhandre Lecturer, Information Technology, DBACER, Wanadongari, Nagpur, Maharashtra Email id. : [email protected] Abstract Web page clustering is one of the major preprocessing steps in web mining analysis. As the amount of data to process is potentially infinite if dynamic web pages are considered, the need of preprocessing this information seems necessary to deal with this computational problem. Considering individual pages could also not provide additional information. In order to deal with this issue, this contribution proposes the use of web clustering techniques. We describe a web page clustering algorithm QDC, which uses the user’s query as part of a reliable measure of cluster quality. This algorithm has five key innovations: a query directed cluster quality guide that uses the relationship between clusters and the query, an improved cluster merging method that generates semantically coherent clusters by using cluster description similarity in additional to cluster overlap, a new cluster splitting method that fixes the. Cluster chaining or cluster drifting problem, an improved heuristic for cluster selection that uses the query directed cluster quality guide, and a new method of improving clusters by ranking the pages by relevance to the cluster. Keywords : Web mining, Web page clustering, Query Directed Clustering (QDC)

I. INTRODUCTION Web page clustering methods categorize and organize search results into semantically meaningful clusters that assist users with search refinement. web page clustering puts together web pages in groups, based on similarity or other relationship measures. Tightly-couple pages, pages in the same cluster, are considered as singular items for following data analysis steps. A complete data mining analysis could be performed by using web pages information as it appears in web logs, but when the number of pages to take into account increases (i.e., in a corporative large scale web server or a server using dynamic web pages) this process could be quite hard or even unbearable. In order to deal with this issue, web page clustering appears as a reasonable solution. These techniques group pages together based on some kind of relationship measure. Pages in the same cluster will be considered as a single item for further data analysis steps. Web page clustering is one approach for assisting users to both comprehend the result set and to refine the query. Web page clustering algorithms identify semantically meaningful groups of web pages and present these to the user as clusters. The

Ms. Priya S.Yadav,

IJRIT

155

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014, Pg: 155-158

clusters provide an overview of the contents of the result set and when a cluster is selected the result set is refined to just the relevant pages in that cluster. Clustering performance is very important for usability. If cluster quality is poor, the clusters will be semantically meaningless or will contain many irrelevant pages. If cluster coverage is poor, then clusters representing useful groups of pages will be missing or the clusters will be missing many relevant pages. Therefore, improving the performance of web page clustering algorithms is both worthwhile and very important. This paper presents QDC, a query directed web page clustering algorithm that gives better clustering performance than other clustering algorithms. QDC has five key Innovations : a new query directed cluster quality guide that uses the relationship between clusters and the query, an improved cluster merging method that generates semantically coherent clusters by using cluster description similarity in additional to cluster overlap, a new cluster splitting method that fixes the cluster chaining (drifting) problem, an improved heuristic for cluster selection that uses the query directed cluster quality guide, and a new method of improving clusters by ranking the pages by relevance to the cluster.

II. WEB TEXT MINING MODEL The proposed web text mining model consists of different phases. Firstly query is given to search engine then we get 100 URL’s related to that query. Summary of extracted URL’s is then passes through various preprocessing phases. Then different clusters are formed by using query directed web page clustering algorithm. [4] Then vector space model is used for showing similarity between query & document.

Figure 1 : Web Text Mining Model

Ms. Priya S.Yadav,

IJRIT

156

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014, Pg: 155-158

III.

MOST IMPORTANT WEB PAGE CLUSTERING TECHNIQUES Web page clustering deal with a set of web pages hosted on a web server to obtain a collection of web page sets (clusters). These clusters are applied in the following steps of the mining process instead of original pages. There are three web clustering criteria: semantic, structure, and usage based.

1. SEMANTIC CLUSTERING Semantically web page clustering are based on the concept of web page hierarchies. The lowest level leaves in these hierarchies are web pages,that are grouped in higher level nodes based on semantically affinities. For example, product web pages are clustered in several product families that are later grouped in a cluster for all products, beside other clusters of corporative or support information can also be defined. Semantically hierarchies can be defined following many different criteria, depending on the objectives and strategies of this analysis, and, hence, many different collections of clusters can be provided. [5] This web page clustering techniques requires, anyway, some domain information, either from the domain experts or retrieved by any semantic repository 2. GRAPH PARTITIONING FOR WEB PAGE CLUSTERING Structure and usage page clustering are both very similar. These two approaches build a web page graph, in which nodes are the different web pages and arcs are the links among these pages. These links can be defined by the actual web links, in the case only web structure is considered or may be weighted by the usage of these transitions

IV. ALGORITHM – QDC This section describes query directed web page clustering algorithm with five stages: 1. FIND BASE CLUSTERS A base cluster is described by a single word and consists of all the pages containing that word. Base clusters are single word search refinements based on the current search results. After standard page pre-processing, QDC constructs a collection of base clusters, one for every word. 2. MERGE CLUSTERS QDC constructs larger clusters by merging clusters together. Each cluster is constructed from a set of base clusters and a cluster is described by the word that describes the cluster’s largest base cluster.

Figure 2 : Five Steps Of QDC

Ms. Priya S.Yadav,

IJRIT

157

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014, Pg: 155-158

3. SPLIT CLUSTERS Each cluster now contains at least all the base clusters that relate to one idea; this is assured as single-link clustering merges all related clusters. can produce clusters containing multiple ideas and irrelevant base clusters due to cluster chaining (drifting). Such clusters need to be split. QDC identifies the sub-cluster structure within each cluster. The algorithm uses a distance measure to build a dendrogram for each cluster starting from the base clusters in the cluster. Each cluster is split by cutting its dendrogram at an appropriate point — when the distance between the closest pair of sub-clusters falls below a threshold. This threshold means that any groups of base clusters that are not tightly interconnected with each other will be split 4. SELECT CLUSTERS At this stage, QDC has a small set of coherent clusters. However, there will still be more clusters than can be presented to the user. QDC needs to select the best subset of the clusters to present to the user. Ideally, these clusters should be high quality clusters that cover all the pages in the original set with minimal overlap. 5. CLEAN CLUSTERS Base clusters are sometimes formed from polysemous words and therefore clusters can contain pages that cover different topics. Since the clusters should relate to only one topic, pages from other topics are irrelevant. QDC computes the relevance of each page in each cluster and removes irrelevant pages. V. CONCLUSION This paper has presented a web text mining model. It includes combination of web page clustering. Clustering algorithm has five key innovations. Firstly, it identifies better clusters using a query directed cluster quality guide that considers the relationship between a cluster’s descriptive terms and the query terms. Secondly, it increases the merging of semantically related clusters and decreases the merging of semantically unrelated clusters by comparing the descriptions of clusters in addition to comparing the overlap of page contents between clusters. Thirdly, it fixed the cluster chaining (drifting) problem using a new cluster splitting method. Fourthly, it chooses better clusters to show the user by improving the ESTC cluster selection heuristic to consider the number of clusters to select and cluster quality. Finally, it improves the clusters by ranking the pages according to cluster relevance. We can give phrase as query to this model algorithm. VI. REFERENCES [1] Daniel Crabtree, Peter Andreae, Xiaoying Gao “Query Directed Web Page Clustering”, Proceedings of the 2006 IEEE/WIC/ACM 2006 IEEE [2] Ms. Chhaya M. Meshram , Prof. Rahila Sheikh B.D.C.O.E. Sevagram, R.G.C.E.R.T. Chandrapur “A mining technique for Web Data using Clustering”, ISSN: 2231-2803 pp.240-244, 2011 [3] Antonio LaTorre, Jose M. Pena, Vıctor Robles, Marıa S. Perez “A Survey in Web Page Clustering Techniques”. [4] A. Strehl Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. PhD thesis. Faculty of the Graduate School of The University of Texas at Austin, 2002. [5] Y. Wang and M. Kitsuregawa. On combining link and contents information for web page clustering. In 13th International Conference on Database and Expert Systems Applications DEXA2002, Aix-en-Provence, France, pages. 902–913, September 2002. [6] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499, 1994. [7] P. Atzeni, G. Mecca, P. Merialdo, Managing web-based data: Database models and transformations., IEEE Internet Computing 6 (4) pages 33–37, 2002.

Ms. Priya S.Yadav,

IJRIT

158

Web page clustering using Query Directed Clustering ...

IJRIT International Journal of Research in Information Technology, Volume 2, ... Ms. Priya S.Yadav1, Ms. Pranali G. Wadighare2,Ms.Sneha L. Pise3 , Ms. ... cluster quality guide, and a new method of improving clusters by ranking the pages by.

83KB Sizes 1 Downloads 103 Views

Recommend Documents

web usage mining using rough agglomerative clustering
is analysis of web log files with web pages sequences. ... structure of web sites based on co-occurrence ... building block of rough set theory is an assumption.

Pattern Clustering using Cooperative Game Theory - arXiv
Jan 2, 2012 - subjectively based on its ability to create interesting clusters) such that the ... poses a novel approach to find the cluster centers in order to give a good start .... Start a new queue, let's call it expansion queue. 4: Start a .....

Timetable Scheduling using modified Clustering - IJRIT
timetable scheduling database that has the information regarding timeslots .... One for admin login, teacher registration, student registration and last one is exit.

TCSOM: Clustering Transactions Using Self ... - Springer Link
Department of Computer Science and Engineering, Harbin Institute of ... of data available in a computer and they can be used to represent categorical data.

Timetable Scheduling using modified Clustering - IJRIT
resources to objects being placed in space-time in such a way as to satisfy or .... timetable scheduling database that has the information regarding timeslots of college. .... Java is a computer programming language that is concurrent, class-based, o

Posterior Probabilistic Clustering using NMF
Jul 24, 2008 - We introduce the posterior probabilistic clustering (PPC), which provides ... fully applied to document clustering recently [5, 1]. .... Let F = FS, G =.

Contextual Query Based On Segmentation & Clustering For ... - IJRIT
Abstract. Nowadays internet plays an important role in information retrieval but user does not get the desired results from the search engines. Web search engines have a key role in the discovery of relevant information, but this kind of search is us

Agglomerative Mean-Shift Clustering via Query Set ... - CiteSeerX
To find the clusters of a data set sampled from a certain unknown distribution is important in many machine learning and data mining applications. Probability.

data clustering
Clustering is one of the most important techniques in data mining. ..... of data and more complex data, such as multimedia data, semi-structured/unstructured.

Fuzzy Clustering
2.1 Fuzzy C-Means . ... It means we can discriminate clearly whether an object belongs to .... Sonali A., P.R.Deshmukh, Categorization of Unstructured Web Data.