IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014,Pg: 159-164

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

Study of Basics of Web Mining and Markov Models for Personalization Varun Hooda1 and Mamta Kathuria2 1

Deptt. of Computer Engineering, YMCA University of Science and Technology, Faridabad, Haryana, INDIA, [email protected]

2

Deptt. of Computer Engineering, YMCA University of Science and Technology, Faridabad, Haryana, INDIA, [email protected]

Abstract- Web mining is an important technique in field of IT to extract the useful data from internet. Effective web mining techniques’ implementation helps in predicting the web access of a user. User’s navigational pattern helps in decisions of site restructuring or modification. This paper presents various techniques of web mining and its applications in personalization. Prediction of the user’s next action in regard of web access can be done using Markov Model which takes users past visits of various web pages into consideration. Users’ inclination and trends are also helpful in business regard. Keywords- Web mining, Hypertext, Data mining, Navigational pattern, Dimensionality.

I. INTRODUCTION Web Mining is made up by aggregating two terms in the field of technology. First term is Web and the second term is mining. Web is World Wide Web (WWW). It is a system of interlinked hypertext documents accessed via Internet. Useful data is extracted from World Wide Web and it is called “mining”. In today’s world, the World Wide Web (WWW) has become one of the most complete resources of information. It provides the information required by almost every user. Exploration of the WWW has also grown exponentially as its size has increased. Ease of use and more use of computers and internet in daily life operations has increased the number of users. To meet the search requirements of every user in terms of relevancy of data is a challenging task. Web mining has given various techniques to deal effectively with the search queries fired by users. Web mining is extracting the useful web data from internet and using it for particular purpose by its analysis. Web mining can be considered as implementation of techniques of Data mining on the web data. Data mining is a process of knowledge discovery from large databases. Web data is typically unlabelled, heterogeneous, high dimensional, semi-structured, and distributed [1]. Hence any human interface could effectively handle context sensitive and imprecise queries. Data relevant to the user is very less as compared to the whole data on World Wide Web. Hence most of the time data is useless in regard of relevancy to the user.

Varun Hooda, IJRIT

159

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014,Pg: 159-164 With the careful analysis of web logs, the hidden knowledge can be extracted and further be used for business through web sites. User profiles and their past web accesses may help in gaining the interest of the customer. From the business point of view, web mining can be helpful in knowing the inclination of users. In this paper we discuss various web mining techniques and their applications. The paper is organized as follows: Section 2 discusses web mining and its techniques. Section 3, discusses Markov Model. We conclude this study in section 4.

II. WEB MINING AND ITS TECHNIQUES Web mining includes various techniques of selecting the useful data from the WWW. Web mining, being a broad technique is classified into three types [2; 3]: a) Web Structure Mining b)Web Usage Mining c)Web Content Mining A. Web Structure Mining Special feature of the World Wide Web different from other databases is that it is connected strongly by the interconnection of the hyperlinks. Hyperlinks connect the whole WWW by links among the web sites. WWW can be considered a directed labeled graph in which documents are nodes and the edges are the hyperlinks between them. It is known that Web is a huge structure and is growing very rapidly. WWW lacks in organization and structure, but is connected in all together by the hyperlinks between the web sites. Dealing with the structure of the hyperlinks within the Web itself is the challenge for Web Structure mining. Link analysis has become an old area of research but with the growing interest in web mining, structure analysis research has increased. This all resulted in a new emerging research area called Link Mining [4]. Present day content on the Web is much greater than that of traditional collection of text documents and also differences in authoring style averts that Web has no unifying structure. Web pages, links, and co-citation (two pages that are both linked to by the same page) are considered the objects in WWW and attributes of WWW are HTML tags, word appearances and anchor tags. Various tasks related to link mining are applicable to Web structure mining. Most recent upgrade from classic data mining task to linked domains is classification based on Web page link. Category of the webpage is predicted considering the words occurred on the page, links between pages, anchor text, html tags and other possible attributes found on the Web page.

1. HITS Concept HITS concept [5] is useful in understanding the intrinsic social organization of the Web. Main focus is on the social organization of the Web to determine the importance of web pages. On the basis of the content and links of the web page, two kinds of pages are there i.e. hubs and authorities. Authorities are the pages which have good sources of content. Hubs are the pages which have good sources of links. According to Kleinberg [5], “Hubs and authorities exhibit what could be called a mutually reinforcing relationship: a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs”

Varun Hooda, IJRIT

160

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014,Pg: 159-164 B. Web Content Mining Exponential growth of size of World Wide Web leads to the growth in same way as its usage. It (WWW) has become a comprehensive source of information for large number of users. Retrieval of information from web has become very difficult because of its dynamic and heterogeneous nature. Retrieval, organization, management, and discovery of the large amount of data and web resources can be effectively enabled by Web Content Mining. C. Web Usage Mining Identification of navigational pattern is an important aspect of web usage mining. It can be achieved by analyzing the web logs of sites which contains users’ activity while browsing through web sites. The quantity of information hidden in the web logs of a site is very high and is meaningless if not properly processed by efficient web usage mining techniques. Users’ patterns and interesting trends can be identified by using statistical methods and various data mining techniques. Interests and behaviors of visitors of a web site can be considered as input for redesigning or customization of the web site. From the business point of view this information of users’ history of web access can be used for creation of consumer profiles and obtaining the benefits of market segmentation. Personalization of the site can be achieved by adjusting the site content according to the needs of particular group of users. Web usage mining is simply the analysis of behaviors of users based on their web accesses. This can be considered as a three-staged process, including data preparation [6], pattern discovery and pattern analysis stage [7]. Data Preprocessing

Pattern Discovery

Pattern Analysis

Figure 1. Three Staged Web Usage Mining Process Input to Data Pre-processing stage is raw web data and output will be Pre-processed data. Output of first stage will be served as an input to second stage. And output of Pattern Discovery stage will be fed as in put to final stage (Pattern Analysis). This overall three staged process shown in Figure 1, coverts the unusable raw data to a useful data which can help in various web usage applications. In data preparation stage web data is preprocessed so that it can be used efficiently in further stages of web mining process. In the second stage i.e. pattern discovery, various statistical methods and data mining techniques are used for obtaining the hidden patterns in web logs. These obtained patterns are stored in order to further analyze them in the pattern analysis stage of web mining process. Table 1, specifies various challenges and applications of different Web Mining techniques. Table 1. Challenges and Applications of Web Mining Techniques Web Mining Type Structure Mining Content Mining Usage Mining

Challenges Semi-structured No unifying structure Dynamic & heterogeneous nature of Web User identification

Applications Navigational Pattern Retrieval of large amount of data Personalization Web site restructuring or customization

1. Data Preprocessing

Varun Hooda, IJRIT

161

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014,Pg: 159-164

Data mining results are directly affected by the quality of data used for the analysis. Raw data is highly inconsistent and hence improvement in quality is required. Hence data is first pre-processed to improve the ease and efficiency of mining process. Data pre-processing is crucial step in data mining process. Initial task of data pre-processing phase is data preparation. Web log data of pages returning error need to be cleaned depending upon the application. Crawler’s activities should be filtered from web log data as it does not provide useful information about the usability of web site. An important issue in data pre-processing is to deal with problem of caching. It is that visits to the cached pages are not included in the web log. Number of pages visited through caching is quite high because of use of multilevel caching. Hence that information about accesses to cached web pages is missed in the web log. To deal with caching is not so easy because of dependence on the client side technology. Issue of user identification is also to be dealt carefully. Various methods are there for identification of individual visitors. One of the basic solutions is to consider that each IP address signifies a single visitor [8]. But this is not a precise method for user identification because multiple users may access the web using the same IP address. Much accurate process to logically identify the unique visitors is the application of cookies. User registration will also be served as an effective approach of uniquely identification of users. But problem in application of approaches based on user identification is that users might avoid him while sharing personal information. After identification of user, the accesses of him to web pages will be stored and further analysis will be made to find the hidden pattern in pattern discovery stage. 2. Pattern Discovery Raw web data is taken as input for log analysis tools and process it to extract statistical information. Obtained statistical information includes statistics of number of visits, number of hits, average view time and average path length through a site. Statistics regarding the errors such as server errors or page not found errors are included. Statistics pertaining to client for example users’ web browser, operating system, cookies and other applications. As discussed the basic method applied on pre-processed data to discover patterns is statistical analysis. Rule of association is a mining technique used to find correlations and frequent patterns of sets of items. It is used to determine the correlations between a single sessions accessed web pages. Interest of users or a particular group of users is inferred from studying the possible relationship between various pages viewed together. Web site restructuring or customization can be guided by the various interrelations between the accessed web pages. With the knowledge obtained regarding the interest of user, performance of the system can be enhanced by pre-fetching the web data and hence saving free cycles of system. Pattern of access of web pages one after another can be useful for determining the trends of users and future predictions of patterns can be made. A much better way for discovering the pattern is to form the clusters of pages that are related according to the users’ access. Even clustering can be applied to users in a way that to form a cluster of users that have similar navigational pattern for visiting a web site. The most important method for extracting usage patterns from web logs is Markov Models. 3. Pattern Analysis After the discovery of hidden patterns, the obtained patterns are analysed to study the trends of users. For better interpretation of the obtained results various visualization methods are used. Association of these results with structure and content information of web site will help in getting important information required for guidance in modification of web site. Restructuring or customization of web site through this provides a better interface to the user. Table 2, summarizes the input and output of different web usage mining techniques.

Varun Hooda, IJRIT

162

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014,Pg: 159-164 Table 2. Input and Output of Web Usage Mining Techniques Technique

Input

Output

Data Preprocessing

Raw Data

Pre-processed Data

Pattern Discovery

Pre-processed Data

Statistical Information

Pattern Analysis

Obtained Pattern

Various presentations of interesting pattern

III. MARKOV MODEL Markov Models [9] can be considered as an application of Markov chain in the navigation of a user. Markov model is useful in predicting the next action of the user based on the previous actions. In context of web page accesses, this model can be used for predicting the possibility of access of a web page based on history of web page accesses. This approach is different from other pattern discovery approaches as this is based on probability rather than data mining techniques. Prediction of users’ actions and personalization of site can be achieved by usage of Markov Models. The pre-processing stage in Markov Models is similar to the data mining approaches. Raw data for Markov Models is taken in form of a weighted transition graph. The transition graph [9] is actually representing web accesses of a user or a particular group of users. Nodes of the graph represent the web page and the edges between the nodes give the view of hyperlinks between the various web pages. The weight of the graph represents the number of visits from one node to another. A. Prediction Prediction of a user to access a page in future is calculated by using the navigational pattern revealed by the transition graph. Weights of the edges between nodes of graph are used in calculation of probability of visiting a page j from page i. Transition probability matrix is created by calculating one-step transition probabilities of visiting from one web page to another. B. Limitations of Markov Models Markov Models is a useful model for predicting the accesses of web pages. But the main problem with it is that these models are not able to predict the pages which are not visited earlier by the users. It means that prediction cannot be done in case of pages which are new to the users. It can be concluded that two major shortcomings of Markov Models are: 1) Large size input data requirement Basis of Markov Models is statistical processes, so the prediction is dependent on quantity of the available web log. Hence need for very large size of input data is a shortcoming of the Markov Models. 2) Dimensionality Problem Transition matrix created from transition graph is usually of very big size due to more number of web pages. Problem of dimensionality can be reduced by clustering of similar web pages.

Varun Hooda, IJRIT

163

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 2, February 2014,Pg: 159-164 IV. CONCLUSION In this paper, we discussed the basics of web mining and its types. Various challenges and applications of all types of web mining are surveyed. HITS concept is included in paper that helps in understanding the intrinsic structure of World Wide Web (WWW). Focus of the paper is mainly on web usage mining and its applications. An important method of extracting usage patterns, Markov Models is discussed. Results of Markov Models i.e. prediction of users’ next action can be used for web personalization. REFERENCES [1] Miguel Gomes da Costa Junior, Zhiguo Gong, Web Structure Mining: An Introduction, IEEE, International Conference on Information Acquisition, June 27-July 3, 2005. [2] Raymond Kosala, Hendrik Blockeel, Web Mining Research: A Survey, ACM SIGKDD Explorations Newsletter, June 2000, Volume 2 Issue 1. [3] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pag-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from WebData, ACM SIGKDD Explorations Newsletter, January 2000, Volume 1 Issue 2. [4] L. Getoor, Link Mining: A New Data Mining Challenge. SIGKDD Explorations, vol. 4, issue 2, 2003. [5] Kleinberg, J.M., Authoritative sources in a hyperlinked environment. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1998, Pages 668-677 – 1998. [6] R.Nicole, “The Last Word on Decision Theory,” j. Computer Vision, submitted for publication. (Pending Publication) [7] Han, J., Kamber, M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann Publishers, 2000. [8] C.j. Kaufman, Rocky Mountain Research Laboratories, Boulder, Colo., personal communication, 1992. (Personal Communication) [9] J. Zhu, J. Hong, J.G.Hughes, Using Markov Chains for Link Prediction in Adaptive Web sites, in

Proceedings of the First International Conference on Computing in an Imperfect World, 2002

Varun Hooda, IJRIT

164

Study of Basics of Web Mining and Markov Models for ... - IJRIT

considering the words occurred on the page, links between pages, anchor text, html tags and other possible attributes found on the Web page. 1. HITS Concept.

95KB Sizes 2 Downloads 151 Views

Recommend Documents

Study of Basics of Web Mining and Markov Models for ... - IJRIT
Web usage mining is simply the analysis of behaviors of users based on their ... Raw web data is taken as input for log analysis tools and process it to extract ...

Study of basics of Web Mining and Fuzzy Clustering
information from web documents/services. Web mining is .... Content mining is the scanning and mining of text, pictures and graphs of a Web ... This scanning is completed after the clustering of web pages through structure mining and provides the res

Infinite-State Markov-switching for Dynamic Volatility Models : Web ...
Mar 19, 2014 - Volatility Models : Web Appendix. Arnaud Dufays1 .... As the function φ is user-defined, one can choose a function that smoothly increases such.

Mining Models of Human Activities from the Web
details, such as segmentation and feature selection of sensor data, and high-level structure, ... Keywords. Activity inference, activity models, RFID, web mining. 1.

Mining Trajectory Patterns Using Hidden Markov Models
'a day' in a traffic control application since many vehicles have daily patterns, ..... Peng, W.C., Chen, M.S.: Developing data allocation schemes by incremental ...

A Comparative Study of Bing Web N-gram Language Models for Web ...
the public beta period, the service is open to accredited colleges and universities. The data is originally collected in the datacenters of Mi- crosoft Bing. Counts of ...

A Comparative Study of Bing Web N-gram Language Models for Web ...
which case the suggestion of “heroic acts” would be “heroic actions”, is due to the ... Summary of machine translation results on the 2008 NIST. C2E Open MT ...

Hidden Markov Models - Semantic Scholar
A Tutorial for the Course Computational Intelligence ... “Markov Models and Hidden Markov Models - A Brief Tutorial” International Computer Science ...... Find the best likelihood when the end of the observation sequence t = T is reached. 4.

Hidden Markov Models - Semantic Scholar
Download the file HMM.zip1 which contains this tutorial and the ... Let's say in Graz, there are three types of weather: sunny , rainy , and foggy ..... The transition probabilities are the probabilities to go from state i to state j: ai,j = P(qn+1 =

Mining Health Models for Performance Monitoring of ...
2Microsoft Center for Software Excellence, One Microsoft Way, Redmond, WA, ... database servers). ... two real system – Microsoft's SQL Server 2005 and IIS 7.0.

A Study Of Various Techniques For The Brain Tumor ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 3 ..... assignment and there is a need and wide degree for future examination to ... Journal of Advanced Research in Computer Science and Software Engineering, Vol.

A Study Of Various Techniques For The Brain Tumor ... - IJRIT
A Study Of Various Techniques For The Brain Tumor ..... Journal of Advanced Research in Computer Science and Software Engineering, Vol. 2, No. 3, issue 3 ...

Data Mining: Current and Future Applications - IJRIT
(KDD), often called data mining, aims at the discovery of useful information from ..... Advanced analysis of data for extracting useful knowledge is the next natural ...

online bayesian estimation of hidden markov models ...
pose a set of weighted samples containing no duplicate and representing p(xt−1|yt−1) ... sion cannot directly be used because p(xt|xt−1, yt−1) de- pends on xt−2.

Discriminative Training of Hidden Markov Models by ...
the training data is always insufficient, the performance of the maximum likelihood estimation ... our system, the visual features are represented by geometrical.

Data Mining: Current and Future Applications - IJRIT
Artificial neural networks: Non-linear predictive models that learn through training ..... Semi-supervised learning and social network analysis are other methods ...

Detection and Prevention of Intrusions in Multi-tier Web ... - IJRIT
Keywords: Intrusion Detection System, Intrusion Prevention System, Pattern Mapping, Virtualization. 1. ... In order to detect these types of attacks an association .... website not listed in filter rules Noxes instantly shows a connection alert to.

Data Mining: Current and Future Applications - IJRIT
Language. (SQL). Oracle, Sybase,. Informix, IBM,. Microsoft. Retrospective, dynamic data delivery at record level. Data Warehousing. &. Decision Support. (1990s). "What were unit sales in. New England last. March? Drill down to. Boston. On-line analy

Detection and Prevention of Intrusions in Multi-tier Web ... - IJRIT
In today's world there is enormous use of Internet services and applications. ... networking and e-commerce sites and other web portals are increasing day by ...

Design and Implementation of e-AODV: A Comparative Study ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 2, Issue 6, ... Keywords: Wireless mobile ad hoc networks, AODV routing protocol, energy ... In order to maximize the network life time, the cost function defined in [9] ...

Design and Implementation of e-AODV: A Comparative Study ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 2, Issue 6, ... In order to maximize the network life time, the cost function defined in [9] ...