1 of 18

Web Usage Mining: A Review

Presented By: Urvek Shah (SVNIT) M.A.Zaveri (SVNIT) A National Conference on “Emerging Trends in Computer Technology” ,SCET,SURAT 26/6/2008

2 of 18

Web Mining The Extraction of Unknown Interesting Knowledge from WWW Web Data Content Structure Usage

Web Mining

A Taxonomy of Web Mining Content Mining

Structure Mining

Usage Mining

3 of 18

Web Mining – Categories

Content Mining The discovery of useful information from the Web contents

Structure Mining The discovery of useful information from the Web Structure

Web Usage Mining The discovery of interesting user access patterns from Web server logs

4 of 18

WUM – Server logs 123.456.78.9 - - [25/Apr/2008:19:13:44 –0400] “GET /depht.php/coed.php /depht.php/coed.php HTTP/1.0” HTTP/1.0” 200 1849 http://www.svnit.ac.in http://www.svnit.ac.in// “Mozilla/4.51 [en] (Win98;I)” (Win98;I)” IP Address

Time

Method/URL/Protocol

Sta tus

Size

Referred

Agent

123.456.78.9

[25/Apr/2008:03:04:41 –0500

GET A.html HTTP/1.0

200

3290

-

Mozilla/3.01 (Win95, I)

123.456.78.9

[25/Apr/2008:03:05:34 –0500

GET B.html HTTP/1.0

200

2050

A.html

Mozilla/3.01 (Win95, I)

123.456.78.9

[25/Apr/2008:03:05:39 –0500

GET L.html HTTP/1.0

200

4130

-

Mozilla/3.01 (Win95, I)

123.456.78.9

[25/Apr/2008:03:06:02 –0500

GET F.html HTTP/1.0

200

5096

B.html

Mozilla/3.01 (Win95, I)

123.456.78.9

[25/Apr/2008:03:06:58 –0500

GET A.html HTTP/1.0

200

3290

-

Mozilla/3.01 (X11, I, IRIX6.2, IP22)

123.456.78.9

[25/Apr/2008:03:07:42 –0500

GET B.html HTTP/1.0

200

2050

A.html

Mozilla/3.01 (X11, I, IRIX6.2, IP22)

123.456.78.9

[25/Apr/2008:03:07:55 –0500

GET R.html HTTP/1.0

200

8180

L.html

Mozilla/3.01 (Win95, I)

123.456.78.9

[25/Apr/2008:03:09:50 –0500

GET C.html HTTP/1.0

200

1820

A.html

Mozilla/3.01 (X11, I, IRIX6.2, IP22)

123.456.78.9

[25/Apr/2008:03:10:02 –0500

GET O.html HTTP/1.0

200

2270

F.html

Mozilla/3.01 (Win95, I)

123.456.78.9

[25/Apr/2008:03:10:45 –0500

GET J.html HTTP/1.0

200

9430

C.html

Mozilla/3.01 (X11, I, IRIX6.2, IP22)

123.456.78.9

[25/Apr/2008:03:12:23 –0500

GET G.html HTTP/1.0

200

7220

B.html

Mozilla/3.01 (Win95, I)

123.456.78.9

[25/Apr/2008:05:05:22 –0500

GET A.html HTTP/1.0

200

3290

-

Mozilla/3.01 (Win95, I)

123.456.78.9

[25/Apr/2008:05:06:03 –0500

GET D.html HTTP/1.0

200

1680

A.html

Mozilla/3.01 (Win95, I)

5 of 14

Web Usage Mining (WUM)

Possible Data Sources Server side collection client side collection proxy side collection

6 of 14

WUM – Three Phases

Pre-Processing

Raw Sever log

Pattern Discovery

User session File

Pattern Analysis

Rules and Patterns

Interesting Knowledge

7 of 18

WUM – Pre-Processing

Pre- Processing includes the tasks of: Data Cleaning removes log entries that are not needed for the mining process

User Identification identify different users (IP Address )

Session Identification groups user’s page references into user sessions

9 of 18

WUM – Issues in User Session Identification A single IP address is used by many users

different users

Proxy server

Web server

Different IP addresses in a single session

ISP server

Single user

Web server

Missing cache hits in the server logs

10 of 18

WUM – Solutions Remote Agent A remote agent is implemented in Java Applet It is loaded into the client only once when the first page is accessed The subsequent requests are captured and send back to the server

Modified Browser The source code of the existing browser can be modified to gain user specific data at the client side

Heuristics use a set of assumptions to identify user sessions and find the missing cache hits in the server log

11 of 18

WUM – Heuristics

The session identification heuristics Timeout: if the time between pages requests exceeds a certain limit, it is assumed that the user is starting a new session IP/Agent: Each different agent type for an IP address represents a different sessions Referring page filed: If the referring page file for a request is not part of an open session, it is assumed that the request is coming from a different session

12 of 18

WUM – Pattern Discovery Pattern discovery process applies data mining techniques to generate rules and patterns 

Data Mining Techniques Association Rule Generation Clustering Sequential patterns

13 of 18

WUM – Association Rule Generation Discovers the correlations between pages that are most often referenced together in a single server session Provide the information What are the set of pages frequently accessed together by Web users? What page will be fetched next? What are paths frequently accessed by Web users?

Association rule A B [ Support = 60%, Confidence = 80% ] Example “50% of visitors who accessed URLs /index.php and coed.php also visited coed_faculty.php”

14 of 18

WUM – Clustering Groups together a set of items having similar characteristics User Clusters Discover groups of users exhibiting similar browsing patterns Page recommendation User’s partial session is classified into a single cluster The links contained in this cluster are recommended

Page clusters Discover groups of pages having related content Page recommendation The links are presented based on how often URL references occur together across user sessions

15 of 18

WUM – Sequential Patterns sequential patterns (SP) are highly similar with mining association rules. Time element (order of event) is taken in to account is the only difference. Example 15% of access “page a.html then b.html than c.html.

Application Web site structure modification Web personalization Web access pattern

Algorithms: AprioriAll, GSP ,WAP tree.

16 of 18

Conclusion Web Usage Mining is one of the top research area today User access patterns found from WUM process can be helpful for Predict users’ next request which is used for prefetching Web site restructuring Web personalization

17 of 18

References Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Web mining: Information and pattern discovery on the world wide web. In International conference on Tools with Artificial Intelligence, pages 558-567, Newport Beach, 1997. IEEE. B. Mobasher, N. Jain, E. Han, and J. Srivastava. Web mining: Pattern discovery from world wide web transactions. Technical Report TR-96050, Department of Computer Science,University of Minnesota, M inneapolis, 1996. (TR 96-050), 1996 Boris Diebold and Michael Kaufmann. Usage-based visualization of web localities. In Australian symposium on Information visualisation, pages 159–164, 2001. R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. PhD thesis, University of Minnesota, 2000. Bettina Berendt, Bamshad Mobasher, Miki Nakagawa, and Myra Spiliopoulou. The impact of site structure and user environment on session reconstruction in web usage analysis. In Proceedings of the 4th WebKDD 2002 Workshop, at the ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD’2002),2002. WANG Tong HE Pi-lian. Web Log Mining by an Improved AprioriAll Algorithm. In proceeding of world academy of science, engineering, and technology,2005 p.p 97-100. Y. Lu, and C. I. Ezeife. Position Coded Pre-order Linked WAP-Tree for Web Log Sequential Pattern Mining. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Seoul, Korea, 2003, pp. 337-349.

18 of 18

Thank You

Web Usage Mining: A Review

Jun 26, 2008 - “50% of visitors who accessed URLs /index.php and coed.php ... Web Usage Mining: Discovery and Application of Interesting Patterns from ...

357KB Sizes 7 Downloads 218 Views

Recommend Documents

web usage mining using rough agglomerative clustering
is analysis of web log files with web pages sequences. ... structure of web sites based on co-occurrence ... building block of rough set theory is an assumption.

Web Usage Mining Using Artificial Ant Colony Clustering and Genetic ...
the statistics provided by existing Web log file analysis tools may prove inadequate ..... evolutionary fuzzy clustering–fuzzy inference system) [1], self-organizing ...

A Web Service Mining Framework
would be key to leveraging the large investments in applica- tions that have ... models, present an inexpensive and accessible alternative to existing in .... Arachidonic Acid. Omega-3 inheritance parent. Aspirin block COX1 energy block COX2.

Web Mining -
of the cluster and omit attributes that have different values. (generalization by dropping conditions). ❑ Hypothesis 1 (using attributes science and research)..

Enhancing Web Navigation Usability Using Web Usage ...
decorated websites, but very little about marketing a website or creating a website .... R. Padmaja Valli, T. Santhanam published an article [8] on “An overview.

Web Social Mining
Web social mining refers to conducting social network mining on Web data. ... has a deep root in social network analysis, a research discipline pioneered by ...

(>
BOOKS BY MATTHEW A. RUSSELL. An e-book is definitely an electronic edition of a standard print guide that can be study by utilizing a private personal ...

Context-Dependent Web Bookmarks and Their Usage ...
queries, which can be used for web pages that have never ... bookmarks, and a way to extract representative key- ... Proceedings of the 3rd International Conference on Web Information ..... We call the produced query vector Q it context-.

Web Mining and Social Networking.pdf
Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

Web Mining Tutorial 21.pdf
contribution from web robots has to be eliminated before proceeding with any further data mining,. i.e. when we are looking into web usage behaviour of real ...

web mining techniques pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

List_of_100_Important_English_Vocabulary_(Meaning-Usage)_ ...
Meaning: Huge, enormous, giant, massive, towering,. titanic, epic ... Definition: huge. Usage: A .... PDF. List_of_100_Important_English_Vocabulary_(Meaning .

OS usage - Tech Insider
Linux. Macintosh. Dean Kamen vs. Ginger. Windows 95. 1. lunar eclipse. 2. darwin awards. 3. temptation island. 4. gambar telanjang. 5. ginger. 6. britney Spears.