Web Spoofing Detection Systems Using Machine Learning Techniques A thesis Submitted to the Council of the College of Science at the University of Sulaimani in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer

By Shaida Juma Sayda B.Sc Computer Science (2006), University of Kirkuk Supervised by Dr. Sozan A. Mahmood Assistant Professor

2017, May

Dr. Noor Ghazi M. Jameel Lecturer

2717, Jozardan

) (.

Supervisor Certification

I certify that at this thesis which is entitled "Web Spoofing Detection System Using Machine Learning Techniques" accomplished by (Shaida Juma Sayda) was prepared under my supervision in the college of Science, at the University of Sulaimani, as partial fulfillment of the requirements for the degree of Master of Science in (Computer).

Signature: Name: Dr. Sozan A. Mahmood Title: Assistant Professor Date:

/

/2017

Signature: Name: Dr. Noor Ghazi M. Jameel Title: Lecturer Date:

/

/2017

In view of the available recommendation, I forward this thesis for debate by the examining committee.

Signature: Name: Dr. Aree Ali Mohammed Title: Professor Date:

/

/ 2017

Linguistic Evaluation Certification

I herby certify that this thesis titled "Web Spoofing Detection System Using Machine Learning Techniques" prepared by (Shaida Juma Sayda) has been read and checked and after indicating all the grammatical and spelling mistakes; the thesis was given again to the candidate to make the adequate corrections. After the second reading, I found that the candidate corrected the indicated mistakes. Therefore, I certify that this thesis is free from mistakes.

Signature: Name: Soma Nawzad Abubakr Position: English Department, College of Languages, University of Sulaimani Date:

/

/ 2017

Examining Committee Certification We certify that we have read this thesis entitled " Web Spoofing Detection System Using Machine Learning Techniques "was prepared by (Shaida Juma Sayda) and as Examining Committee, examined the student in its content and in what is connected with it, and in our opinion it meets the basic requirements toward the degree of Master of Science in Computer.

Signature:

Signature:

Name: Dr. Soran A. Saeed

Name: Dr. Aysar A. Abdulrahman

Title: Assistant Professor

Title: Lecturer

Date:

Date:

/

/ 2017

(Chairman)

/

/ 2017

(Member)

Signature:

Signature:

Name: Dr. Akar H.Taher

Name: Dr. Sozan A. Mahmood

Title: Lecturer

Title: Assistant Professor

Date:

Date:

/

(Member)

/2017

/

/2017

(Member-Supervisor)

Signature: Name: Dr.Noor Ghazi Mohammed Title: Lecturer Date:

/

/2017

(Member-Co-Supervisor) Approved by the Dean of the College of Science. Signature: Name: Dr. Bakhtiar Q. Aziz Title: Professor Date:

/

/ 2017

Acknowledgements First of all, my great thanks to Allah who helped me and gave me the ability to fulfill this work. We thank everybody who helped us to complete this project specially my supervisor Lecturer Dr. Noor Ghazi and Assistant Professor Dr. Sozan Abdullah for her help. Special thanks to my husband Mr.Ribwar, for his encouragements, scientific notes, and support that he has shown during my study and to finalize this work. Special thanks to my father , my mother and all my family for their endless support, understanding and encouragement. They have taken their part of suffering and sacrifice during the entire phase of my research. Special thanks goes to those who helped me during this work. I am glad to have this work done, Thanks to my colleagues and the entire faculty members.

Dedication

This thesis is dedicated to

My mother and father

my son

our families.

our friends.

Computer Department.

all who shared by any support.

Shaida

Abstract With the appearance of internet, various online attacks have been increased among them and the most well-known is a spoofing attacks. Web spoofing is the type of spoofing in which fake and spoofing websites made by fraudsters to copy real websites. Spoofing websites represent legitimate websites which attract users into visiting fake websites to steal users sensitive, personal information or install malwares in their devices. The stolen information will be used by the scammers for illegal purposes. The specific goal of this thesis is to build an intelligent system that detect and recognize between trusted and spoofing websites which try to mimic the trusted sites because it is very difficult to visually recognize whether they are spoofing or legitimate. This thesis deals with the detection of spoofing websites using Neural Network (NN) trained with Particle Swarm Optimization (PSO) algorithm. Information gain algorithm is used for feature selection, which was a useful step to remove the unnecessary features. The Information gain seem to improve the classification accuracy by reducing the number of extracted features and used as an input for training the NN using PSO. Training neural network using PSO provides less training time and good accuracy which achieved 99% compared to NN trained with back propagation algorithm which take more time for training and less accuracy which was 98.1%. The proposed technique is evaluated with a dataset of 2500 spoofing sites and 2500 legitimate sites. The results show that the technique can detect over 99% spoofing sites with NN trained using PSO.

I

CONTENTS Abstract .........................................................................................

I

Contents ……………………………………………………………

II

List of Tables ................................................................................

V

List of Figures................................................................................

VI

List of Abbreviations ....................................................................

VIII

Chapter One : General Introduction 1.1

Introduction ……………………………………………………

1

1.2

Spoofing Attack…………………………………………………….2

1.3

Web Spoofing or Internet Con game ……………………….....

3

1.3.1 Spoofing Websites…………………………………………....

4

1.3.2 The Impact of Spoofing Websites …………………………...

6

1.4

Literature Review…………………………………………….

6

1.5

The Aim of the Thesis…………………………………………

10

1.6

Thesis Layout..........................................................................

10

Chapter Two : Theoretical Background 2.1 Introduction………………………………………………….

12

2.2

Web Spoofing Attack………………………………………...

12

2.2.1 Steps involved in Web Spoofing……………………………

13

2.2.2 How Web Spoofing Attack works…………………………..

14

2.2.3 Spoofing Websites Features………………………………….

16

2.3 Feature Selection………………………………………………

24

2.3.1 Feature Selection Approaches………………………………..

25

II

2.3.2 Information Gain (IG)……………………………………..

26

2.4

Artificial Neural Network (ANN)……………………………

27

2.5 Particle Swarm Optimization (PSO)…………………………

30

2.6

33

Training Neural Networks with PSO………………………. Chapter Three : The Proposed System

3.1

Introduction………………………………………………….

36

3.2

The Proposed System Architecture…………………………

37

3.3

Preprocessing……………………………………………….

39

3.3.1 Dataset Preparation…………………………………………..

39

3.3.2 Data Cleaning and Source Code Retrieval…………………

42

3.3.3 Feature Extraction…………………………………………….

46

3.4

Features Selection using Information Gain Algorithm…….

51

3.5

Classification of Spoofing and Legitimate websites………

55

3.5.1 Training Neural Network (NN) with Particle Swarm Optimization (PSO) Algorithm…………………………………… 3.5.2 Testing Using Neural Network………………………………

56 59

Chapter Four :Results and Experiments 4.1 Introduction ………………………………………………..

61

4.2

62

Involved System Parameters………………………………..

4.3 Training and Testing Phases………………………………..

62

4.4 Information Gain Result for Features Selection……………..

65

4.5 The System Results for Training NN using PSO and Testing Using Feed Forward Neural Network…………………………

69

4.5.1 Training NN Results Using PSO …………………………….

69

4.5.1.1 The Effect of Different Number of Hidden Neurons…….

69

III

4.5.1.2

The Effect of Different Number of Particles ………….

4.5.2 Testing Using Feed Forward Neural Network…………….. 4.6

The Effect of Number of Hidden Neurons………………….

4.6.1 Testing Using Feed Forward neural Network (21 features)….

70 71 73 73

4.7 Comparison between Training NN with PSO and Training NN with Back propagation………………………………………..

74

4.7.1 Training Time ………………………………………………..

75

4.7.2 Test Accuracy………………………………………………..

76

Chapter Five : Conclusions and Suggestions for Future Work 5.1

Conclusions………………………………………………….

77

5.2

Suggestions for Future Work………………………………..

78

Reference…………………………………………………………...

IV

79

List of Tables Table No 2.1

Table title

Page No.

Some examples of html commands that have URLs………………………………………

3.1

15

Attributes And Column Names of Spoofing And Legitimate Website Dataset …………..

4.1

Performance Calculation Formula……………

48 63

4.2

Training NN Using PSO With 36 Features…

64

4.3

Information Gain Values In Descending Order

66

4.4

Training NN Using PSO With Different Number Of Features And 9 Nodes In Hidden Layer………………………………………

4.5

Training Phase of The NN With PSO With 21 Features……………………………………

4.6

68 70

The Effect of Different Number of Particles With 9 Nodes In Hidden Layer………………

70

4.7

Confusion Matrix of Testing Phase………….

72

4.8

The Effect of Different Number Of Hidden Nodes………………………………………..

73

4.9

Testing Using Feed Forward Neural Network.

74

4.10

Training Time Comparison…………………

75

4.11

Comparison Between NN trained with PSO and

NN trained with Back Propagation in

Testing Accuracy …………………………….

V

76

List of Figures

Figure NO.

Figure Title

Page No.

1.1

Spoofing Attack (man in middle attack)……………………

3

1.2

Real Websites For TCF Online Banking……………………

5

1.3

Spoofing websites TCF Online Banking……………………

5

2.1

Steps involved in web spoofing……………………………...

14

2.2

Flow Of Feature Selection…………………………………..

24

2.3

A simple neural network……………………………………..

28

2.4

Activation Functions…………………………………………

29

2.5

Flowchart Particle Swarm Optimization (PSO)……………

32

2.6

Flowchart For Training The Neural Network Using PSO Algorithm………………………………………………………

35

3.1

Block Diagram Of Spoofing URL Detection Framework……

37

3.2

Illustrates The Flowchart Of The Proposed System…………..

38

3.3

Creating a Legitimate dataset……………………………….

40

3.4

Example of Legitimate dataset URLs……………………….

41

3.5

Example of spoofing dataset URLs……………………………

42

3.6

Redundant legitimate URLs…………………………………

43

3.7

Redundant Spoofing URLs…………………………………..

43

3.8

Flowchart For Downloading HTML Source Code For URL.

45

3.9

HTML Source Code And Whois For URL……………………

46

3.10

Flowchart of Features Extraction Process……………………

47

3.11

Extraction of Legitimate URL Features……………………..

50

3.12

Extraction of Spoofing URL Features…………………………

51

3.13

Flowchart Feature Selection……………………………….

52

VI

3.14

Information gain value for the features………………………

55

3.15

Neural Network Training Using Particle Swarm Optimization

58

3.16

Testing Using Neural Network……………………………….

59

4.1 4.2 4.3

The Effect of Different Number of Hidden Nodes On The Training Accuracy……………………………………………. Information Gain Value For 21 Features……………………… Training Accuracy of NN Using PSO With Different Number of Features ……………………………………………………

65 68 71

4.4

The Effect of Different Number of Particles………………….

72

4.5

Testing Using Feed Forward Neural Network………………..

75

VII

List of Abbreviations ADSI

Automatic Detecting Security Indicator

BPNN

Back Propagation Neural Network

BPSO

Binary Particle Swarm Optimization

CSS

Cascading Style Sheets

DNS

Domain Name Server

FNN

Feed Forward Neural Network

FNR

False Negative Rate

FPR

False Positive Rate

gTLD

Generic top-level domains

HTTP

Hypertext Transport Protocol

HTTPS

Hypertext Transport Protocol Secure

MLP

Multi Layers Perception

NN

Neural Network

PC

Personal Computer

PSO

Partial Swarm Optimization

SEO

Search Engine Optimization

SSL

Security Socket Layer

SVM

Support Vector Machines

TCP

Transmission Control Protocol

TF-IDF TN TNR TP

Term Frequency/Inverse Document Frequency True Negative True Negative Rate True Positive VIII

TPR

True Positive Rate

URL

Universal Resource Locator

WWW

World Wide Web

IX

Chapter One General Introduction

1.1

Introduction

The world wide web is a global information network that users can access through the Internet, and this network consists of a collection of web sites. An individual web site is a collection of related text pages, videos, images and other resources that are hosted on a web server. Typically, users access web sites through browsers, client software that fetches and renders the text, images and other content associated with a site (examples of popular contemporary browser programs are Firefox, Internet explorer, Chrome and Safari). However, the browser must locate the desired site before fetching, and uniform resource locators (URLs) are the standard way of naming locations on the web [1] . The idea of online spoofing was originating in 1980s with the discovery of security hole in the Transmission Control Protocol (TCP) protocol. In the internet world spoofing, there are various forms of spoofing. Generally spoofing means false representation of some information. The aim of spoofing is to make fools of the users and gain unauthorized access to the user private information like password, account number etc. Some outcomes of spoofing may lead to theft, vindication and other malicious goals. Thus one can say that spoofing is the major security problem in the online internet services [2]. Web site spoofing is the act of replacing a world wide web site with a forged, probably altered, copy on a different computer. The key to this attack is for the attacker’s web server to sit between the victim and the rest of the web. This kind of arrangement is called a ‘man in the middle attack [3].

Chapter One

General Introduction

1.2 Spoofing Attack The computers and the internet have become an integral part of our living and spoofing has become one of the most feared threats to computer systems. Various types of spoofing attacks can be accomplished in the present internet like IP spoofing, email spoofing, profile spoofing, web spoofing, and many others, where each kind presents a unique threat to a person, business or society. Spoofing on the internet has become very common now-a-days and is leading to many criminal activities such as identity theft and fraud. Spoofing is the action of making something look like something that it is not, in order to gain unauthorized access to user’s resources [4]. Spoofing attack is a situation in which one person or program successfully masquerades as another by falsifying data and there by gaining an illegitimate advantage in a spoofing attack, the attacker creates misleading context in order to trick the victim into making an inappropriate security-relevant decision [5][2]. The main aim of spoofing is for hiding sender identity. In this case, the attacker unauthorized access the computer or network showing as if malicious message came from trusted machine by spoofing that machine address [5]. Spoofing attacks usually involve the following elements which are shown in figure (1.1) [2]: 1. Client machine: Requests for service from original server machine. 2. Internet: transaction done over the internet. 3. False server: Before reaching the original server, client requesting data are captured by attacker at his/her false server, now this captured data is not only accessed by the attacker but it can be modified thus between the original client and the server, a middle man controls the transaction and spoofed the original user and server i.e. known as man in middle attack.

2

Chapter One

General Introduction

4. Original server machine: Original server is a real server to which client machine wants service but without the knowledge of the client and the server false server fools both.

Figure (1.1) Spoofing Attack (man in middle attack) [2].

1.3 Web Spoofing or Internet Con Game Web spoofing where the “shadow copy of the whole (world wide web) WWW can be created by an invader. It is like an electronic con game where an invader forms a realistic but fake print of whole world web, the invader manages the fake web thus all system transfer amid fatality browser and web will go through invader [6]. Thus, it is a security attack that allows an adversary to observe and modify all web pages sent to the victim’s machine and observe all information entered into forms by the victim. Web spoofing is the internet con game in which attacker creates a mirror image of the entire world wide web that look like a real one that has all the links and web pages, through which processes his/her transaction on the spoofing web site. The attacker uses the URL rewriting method to implement this 3

Chapter One

General Introduction

attack. During this attack, the attacker sits between authorized user and rest of web [2]. Spoofing web, or hyperlink spoofing, provides victims with false information. Web spoofing is an attack that allows someone to view and modify all web pages sent to a victim's machine. They are able to observe any information that is entered into forms by the victim. This can be of particular danger due to the nature of information entered into forms, such as addresses, credit card numbers, bank account numbers, and the passwords that access these accounts [7]. 1.3.1 Spoofing Websites Spoofing sites are imitations of real commercial sites, intended to deceive the authentic sites’ customers. The objective of spoofing site is identity theft capturing users’ account information by having them log in to a fake site.

Commonly

spoofed websites include eBay, PayPal, and various banking and escrow service providers. The intention of these sites is online identity theft: deceiving customers of the authentic sites into providing their information to the fraudster operated spoofs hundreds of new spoofing sites are detected daily. These spoofing sites are used to attack millions of internet users [8]. Examples of spoofing websites are shown in figure (1.2) and figure (1.3) the real website and spoofing website of eBay.

4

Chapter One

General Introduction

Figure (1.2) Real Websites for TCF Online Banking [57]

Figure (1.3) Spoofing websites for Online Banking [57]

5

Chapter One

General Introduction

1.3.2 The Impact of Spoofing Websites Web spoofing allows the attacker to create a “shadow copy” of any legitimate website. Access to the shadow web is funneled through the attacker’s machine, allowing the attacker to monitor all of the victim’s activities, including any passwords or account numbers the victim enters. The attacker can also cause false or misleading data to be sent to web servers in the victim’s name, or to the victim in the name of any web server. Cyber criminals also use spoofed websites to deploy malware into the visitor’s personal computer (PC) thus making it as a part of their botnet. In spoofing, an attacker gains unauthorized access to a computer or a network by making it appear that a malicious message has come from a trusted machine be “spoofing” the address of that machine [9].

1.4 Literature Review This section provides related works in the web spoofing attack detection and classification using machine learning and non-machine learning approaches. Some of the works are on the detection of web spoofing in general and others work on web spoofing as part of one of the most dangerous type of attacks nowadays which is called phishing attack. phishing attack consists of two parts: email spoofing and web spoofing. The related works are briefly presented and discussed in the followings: Qi et al [10] [2006]proposed the countermeasure, which is such an automatic anti-spoofing tool that can not only function independently, but it is combined with other anti spoofing techniques to form more powerful defending fences. The countermeasure Automatic Detecting Security Indicator (ADSI) relaxes user’s burden by automating the process of detection and recognition of the web-spoofing, for security socket layer 6

Chapter One

General Introduction

SSL-enabled communication. The solution decreased intrusive on the browser while other countermeasures may disable Java Script, pop-up windows or change the color of the boundaries. The solution can defense the browser spoofing attack with the lowest security requirement level which only requires the PC is to be trusted which is described in trust model. The solution requires neither Logo Certification Authority, nor the personal folders with individually chosen background bitmaps. The work by Garera et al. [11] used logistic regression over 18 handselected features to classify phishing URLs. The features include the presence of certain red flag key words in the URL, some proprietary features based on Google’s Page Rank and webpage quality guidelines. Even though they did not analyze the page contents to used as features, they used the precomputed page based features from Google’s proprietary infrastructure that they call Crawl Database. They achieved a classification accuracy of 97.3% over a set of 2,500 URLs. Direct comparison with our approach, however, is difficult without access to the same datasets or features. Zhang et al [12] presented CANTINA, content-based approach to detect phishing websites, based on the TF-IDF information retrieval algorithm and the Robust Hyperlinks algorithm. By using a weighted sum of 8 features (4 content-related, 3 lexical, and 1 WHOIS-related) they showed that CANTINA can correctly detect approximately 95% of phishing sites. The goal of our approach is to avoid downloading the actual web pages and thus reduce the potential risk of analyzing the malicious content on user’s system .

7

Chapter One

General Introduction

Ma et al [13] The four data sets consist of pairing 15,000 URLs from a benign source (either Yahoo or DMOZ) with URLs from a malicious source (5,500 from Phish Tank and 15,000 from Spam scatter), Their work achieved classification accuracy of around 95% by extracting lexical and host-based features from URLs. Nguyen et al [14] proposed an efficient approach for detecting phishing websites based on the single-layer neural network proposed. Specifically, the proposed technique calculates the value of heuristics objectively. Then, the weights of heuristic were generated by a single-layer neural network. The proposed technique was evaluated with a dataset of 11,660 phishing sites and 10,000 legitimate sites. Nguyen’s showed that the technique can detect over 98% phishing sites. Rajaram and Patil [15] proposed a novel approach for classifying Web pages as malicious or benign based on a supervised machine learning. They extracted domain based features like IP address space of external sites, number of suspicious external sites, local domain gTLD, external domain gTLDs, typical suspicious features and HTTP session header based features like TCP port number, number of page redirection steps, number of different server headers, number of requests with common mime-types, number of local requests, number of requests to suspicious external sites and number of requests with incomplete headers. They used machine learning classifiers like Naïve Bayes, C4.5 and SVM for experimental evaluation. With the corpus of 50,000 benign Web pages and 500 malicious Web pages, they

8

Chapter One

General Introduction

have achieved detection rate of 92.2% of the malicious Web pages with a low false positive rate 0.1%. In Feroz and Mengel [16] benign URLs are collected from the DMOZ open directory project, Phishing URLs for experimentation are collected from PhishTank. phishing URLs were also classified based on their lexical and host based features and their URL ranking. The classifier achieves 93-98% accuracy by detecting a large number of phishing hosts. In Sananse and Sarode [17] Phishing URLs were collected from PhishTank which is a community based phish confirmation system on Internet. Developers and researchers are allowed to download verified phishing URL lists which are available in various file formats with the help of an API key but only after signing up. Non phishing URLs were collected from various credible sources and Google search engine. In this phase, 24 lexical features, 48 WHOIS features, PageRank, Alexa Rank and PhishTank-based features extracted. URLs classified using both Random Forest algorithm and Content-based algorithm. A system has been proposed that uses lexical features, WHOIS features, PageRank and Alexa rank and PhishTank-based features for Random Forest algorithm to classify phishing URLs. It has been demonstrated that by applying web mining heuristics on Random Forest algorithm, a precision of more than 90% has been achieved and FNR and FPR rates less than 1%. But in case of Content-based algorithm the precision achieved was less than 65%. In the work by Pradeepthi et al [18] the dataset for the proposed system was collected from public repository dmoz, which has a large collection of

9

Chapter One

General Introduction

genuine URLs from different domains, the phishing URLs were collected from the phishtank, which is a collection of phishing URLs. A total of 10,000 URLs were collected, of which 6000 were genuine and 4000 are fake. There were a total of 27 features which belong to various categorized, like lexical, domain based (collected from DNS server), network based and URL feature based. Binary Particle Swarm Optimization (BPSO) technique used for the detection of phishing URLs, a dataset of 10,000 URLs was constituted and an accuracy of 98.7% was achieved by using this method. 1.5 The Aim of the Thesis The aim of this thesis is to present an intelligent approach to classify a website as a spoofing website or not by NN trained with PSO. The system using the minimum number of features in short training time with high accuracy. The proposed approach was used to classify the websites depending on (21) features selected from 36 features using Information Gain feature selection algorithm. Particle swarm optimization algorithm is used to train the neural network to get the optimal set of weights for the NN and apply Feed forward NN for web spoofing detection.

1.6 Thesis Layout In this section, the contents of the remaining parts of this thesis consist of four chapters: Chapter Two: includes the background theory, and the related concepts of the web spoofing, detection process, explanation of the theoretical concepts of the methodologies used for the detection and classification of spoofing websites.

10

Chapter One

General Introduction

Chapter Three: This chapter presents deepest details of implementing the system by training NN using PSO and information gain for feature selection.

Chapter Four: A set of tests have been performed to evaluate the system performance. The results of some experimental tests are listed and discussed. More ever the effects of the involved system parameters are illustrated. Chapter Five: this chapter is devoted to present the derived conclusions and recommendations for future work.

11

Chapter Two Theoretical Background 2.1 Introduction This chapter discusses the web spoofing and steps involved in web spoofing attack. It also explains the theoretical background, the basics of artificial neural network, feature selection using information gain and training neural network with particle swarm optimization.

2.2 Web Spoofing Attack It is the process of creating a shadow of an original web site that a user requests to access. The fraudulent web site looks similar, if not identical, to an actual site, such as a bank web site. An attacker who intercepts the request to a web site and replaces it with another modified one creates the shadow. When a victim is at the spoofed site, not only can the attacker see the information that the victim types, such as internet banking username, password, credit card information, and social security number, but the attacker can make changes to the data that the victim receives [19]. Web spoofing occurs when a user demands access to a web page and an attacker blocks the request and creates a shadow copy of the requested web page [20]. Web spoofing is a kind of electronic con game in which the attacker creates a convincing but false copy of the entire World Wide Web. The false Web looks just like the real one: it has all the same pages and links. However, the attacker controls the false Web, so that all network traffic between the victim’s browser and the Web goes through the attacker [21].

12

Chapter Two

Theoretical Background

2.2.1 Steps Involved in Web Spoofing Web spoofing attack involve the following steps shown in figure (2.1) [2]

1. Request rewritten URL address for Service Rewritten URL is the spoofed address that looks like the real URL, it leads to the attacker website, this address is provided by the attacker for illegal access to the user account information and other data.

2. Request real URL address As the user requests the spoofed address it leads to attacker server through which the attacker receives the user’s information necessary for requesting the original server. Thus the attacker requests the real server for the service.

3. Real page contents Attacker receives the original page document from the original server.

4. Attacker modifies the contents As the attacker receives the real page document, he/she can change the contents of the page.

5. Receive rewritten document Server attacker sends the rewritten document or modified page content to the authorized user, and he/she thinks that it comes from real server and hence, the user is easily spoofed by the attacker.

13

Chapter Two

Theoretical Background

Request rewritten URL for service 1

Authorized user 4 Attacker

receive rewritten Document 5

modifies Attacker

the

Server 2

Request real URL addressReceive real page contents

Original

3

Server

Figure (2.1) Steps involved in web spoofing [2]. 2.2.2 How Web Spoofing Attack Works Generally, people request access to a web site through their web browser such as Netscape, Firefox, Microsoft internet explorer, etc., by typing the URL (Universal Resource Locator) of their desired web site, e.g. www. google. com. The first part of the URL consists of host name and the second part is DNS (Domain Name Server). In the case of "http://www. google.com", the host name is "www" and the DNS is "google.com". When users enter this in a web browser address field, the browser typically uses the DNS resolver on the system to determine the IP address of host "www" in domain "google.com". The above process is a normal user web page interaction and is based on the assumption that everything works smoothly. However, sometimes when a client types a URL in their browser to request a web site, instead of the browser going directly to the requested sites server it may go through a “middleman”. The middleman can change the URL and send it back to 14

Chapter Two

the client.

Theoretical Background

For example, If the actual URL is http: // www. good.com, the

middleman changes it to http: // middleman /http: //www.good.com. As a result, the browser thinks that the http://middleman is the web server location and http://www.good.com is the content the client is trying to get. The middleman web server sees the requested URL, knows that http://www.good.com is where the client wants to go, and calls that server for the client. After it makes a copy of all the pages the client requested, the middleman changes the entire special HTML commands that may reference a URL and changes them before giving it back to the client. Table (2.1) shows some examples of the HTML commands that have URLs [19]. The key to this attack is for the attacker’s Web server to sit between the victim and the rest of the Web. This kind of arrangement is called a “man in the middle attack”. Table (2.1) Some examples of html commands that have URLs [19] URL

Description



A link to something



To define a java applet location



To define the area of a section



To define the background image



To insert an object into a page



To define a form



To define the source for a frame



To display an image



To define the source for input



To perform a client side pull

15

Chapter Two

Theoretical Background

2.2.3 Spoofing Websites Features In this section, set of features are discussed which distinguish spoofing web sites from legitimate sites. The features are lexical, source code features. Lexical features analyses the format of the URL. It includes the length of the host name, length of the URL, the number of dots, presence of suspicious characters such @ symbol, hexadecimal characters and other special binary characters such as (‘.’, ‘=’, ‘/’ and etc.) either in the host or path name. IP addresses and hexadecimal characters are used to hide the actual URLs [35]. There are several features that distinguish spoofing websites from legitimate ones. In some references spoofing web sites are called phish websites if they have the purpose to steel user’s confidential and personal information and they are accessed through spoofed emails. In this study, 36 features are used and briefly described as below 1. Port number: a port number in the URL is checked if the port belongs to the list of well-known HTTP ports such as 80, 8080, 21, 443, 70, and 1080. If the port number does not belong to the list, it is possibly spoofing URL [35]. 2. Length of URL: Long URLs commonly used to hide the doubtful part in the address bar [38]. It hides the suspicious part of the URL, which may redirect the information submitted by the users or redirect the uploaded page to a suspicious domain [37] . Rule: IF URL length 54

feature = Legitimate

Else if URL length >54 feature=Spoofing 3. ‘.’ In path: secure web-page link contains at most 5 dots, if perhaps, there are more than 5 dots in a web page then it may be recognized as a spoofing link [39] . 4. ‘/’ in URL: The Number of ‘/’ within URL is greater than five is spoofing, other it is legitimate . The attackers try to trick web users by mimicking the 16

Chapter Two

Theoretical Background

doubtful URL to look legitimate. One such technique used in scamming is the addition of slashes in URL. The present studies, therefore, considers the number of slashes in URLs as a feature of identification of spoofing and examines the number of slashes (/) in legitimate and spoofing URLs [41]. Rule: If Number of ‘/ ‘in URL >=5 is spoofing Otherwise legitimate 5. ‘=’ within URL: The URL should not contain more number of “=”. If it contains more than one ‘=’ then the URL is considered to be a spoofing URL [22]. Rule: If Number of ‘= ‘in URL >1 is spoofing Otherwise legitimate 6. ‘@’ in URL: URLs having ‘@’ symbol leads the browser to ignore everything prior it and redirects the user to the link typed after it . The browser might ignore everything prior the @ symbol since the real address often follows the @ symbol [43]. Rule: If URL has ‘@ ‘is spoofing Otherwise Legitimate 7. ‘%’ in URL: the URL should not contain two much %. If it contains greater than six % then the URL is considered to be a spoofing URL [22]. Rule: if Number of ‘%’ > 6 is spoofing Otherwise Legitimate 8. (-) symbol to Domain: Dash is being used by attackers for creating malicious URLs, so users should aware of this symbol Adding Prefix or Suffix Separated by (-) to Domain. Dash is rarely used in legitimate URL [43].

17

Chapter Two

Theoretical Background

Rule: If domain part has ‘- ‘is spoofing Otherwise Legitimate 9. ‘,’ in path: If the path contains greater than zero ‘,’ then the URL is considered to be a spoofing URL [55]. Rule: if Number of ‘,’ in path >0 is spoofing Otherwise legitimate 10. ‘;’ in path: If path part of URL contains greater than four ‘;’ then the URL is considered to be a spoofing URL [42]. Rule: if Number ‘;’ in path >4 is spoofing Otherwise legitimate [42] 11. ‘.’ In host: the legitimate URL link generally has two dots in the URL by ignoring typing www. If the number of dots is equal to three or more it is classified as spoofing [42]. Rule: If dots in domain < =3 is Legitimate Otherwise spoofing 12. Length of host: If the length of host is greater than 22 then the URL is considered to be a spoofing URL [54]. Rule: Length of Host >22 is spoofing Otherwise legitimate 13. Length of path: If the length of the path is more than 152 then the URL is considered to be a spoofing URL [42]. Rule: Length of Path>152 is spoofing

Otherwise legitimate

14. Suspicious “//” in URL Path: The existence of “//” within the URL path means that the user will be redirected to another website. An example of such URL’s is: “http://www.legitimate.com//http://www.bad.com”. The location of “//” is examined. If the URL starts with “HTTP”, that means the 18

Chapter Two

Theoretical Background

“//” should appear in the sixth position. However, if the URL employs “HTTPS” then the “//” should appear in seventh position [44]. 15. Length of subdomain: If it includes more than 28 Length of subdomain then the URL is considered to be a spoofing URL [35]. Rule: Length of subdomain >28 is spoofing Otherwise legitimate 16. Dash within hostname: If hostname contains greater than two dash (‘- ‘) then the URL is considered to be a spoofing URL [54]. Rule: If number ‘- ‘in Host >2 is spoofing Otherwise legitimate 17. Subdomain: removing (www.) from the URL and then the number of dots are counted. If the number of dots in URL is three, then it is classified as spoofing and if it is less than three then it is considered as legitimate URL [56]. Rule: Number Dots in the domain part < 3 is legitimate Otherwise spoofing 18. Google index page: Google indexed pages gives an indication of number of pages indexed by Google and available in Google servers. There is various on-page Search engine optimization (SEO) factors which are helping to get higher search engine rankings including the number of web pages indexed by Google and other search engines (indexation). It plays an important role in the SEO score of particular site. In many cases, the winning factor of a site compared to its competitor is the number of its pages indexed by Google or other search engines .This feature examines whether a website is in Google’s index or not. When a site is indexed by Google. Usually, spoofing web pages are merely accessible for a short period and as a result, many spoofing web pages may not be found on the Google index [46]. 19

Chapter Two

Theoretical Background

19. Disable right click: JavaScript is used to disable the right click function, so that users cannot view and save the source code [22]. Rule: Right Click disabled is spoofing Otherwise legitimate 20. Age of Domain: This feature can be extracted from WHOIS database. if the domain age is more than one year and less than 2 years then it’s classified as spoofing otherwise the website is considered legitimate . The website is considered legitimate if the domain aged more than 2 years [36]. Rule: Age of domain >= 2 is Legitimate Otherwise spoofing 21. Domain Token Count: If URL contains greater than four Domain token count then the URL is considered to be a spoofing URL [49]. Rule: Domain token count>4 is spoofing Otherwise legitimate 22. Path Token Count: If URL contains greater than 21 Path token count then the URL is considered to be a spoofing URL [49]. Rule: path token count >21 is spoofing Otherwise legitimate 23. ‘?’ in URL: If URL contains greater than one ‘?’ then the URL is considered to be a spoofing URL[53]. Rule: ‘?’ in URL >1 is spoofing Otherwise legitimate 24. Google page rank: Google PageRank is one of the methods used by Google to estimate the relevance or importance of a page. Important pages are encountered to have higher PageRank and have higher probability to appear at the top of search results. If PageRank value for a given URL is less than 5 then 20

Chapter Two

Theoretical Background

the URL will be classified as spoofing URL, Value of Google Page Rank ranging from 0 to 10 [50] . Rule: Google Page Rank <5 is spoofing Otherwise legitimate 25. Alexa Rank: It is a ranking system set by alexa.com that basically audits the frequency of visits on numerous websites and makes it public. Alexa ranking is computed based on volume of traffic noted down from the users that have installed the Alexa toolbar for more than a period of 3 months. The parameters on which the traffic is based on to be reached and page viewed. The number of Alexa users who visit a particular site in one day is referred to as a reach. While the number of times a particular URL is viewed by Alexa users is known as Page view. This feature evaluates how popular the website is by determining the number of visitors and the number of pages visited by them. Some spoofing websites have short lifetime. So they may not be acknowledged by the Alexa database. By analyzing the dataset, it is found that in worst-case legitimate websites ranked among the top 150,000. If the Alexa rank of URL exceeds this threshold value, then it would be classified as spoofing [17]. Rule: Rank 150000

feature is Legitimate

Otherwise spoofing 26. LDigit [0-9] in Host: Digit [0-9] in Host is spoofing otherwise legitimate [55]. 27. Keyword-based URL: Many spoofing URLs are found to contain eye-catching word tokens (e.g., login, sign in, confirm, verify, etc.) to attract users’ attention [42]. 28. IP Based Host: an IP address is used as an alternative of the domain name in the URL, such as “http://125.98.3.123/fake.html”. 21

Chapter Two

Theoretical Background

In Legitimate websites instead of IP addresses, generally domain names are used in the URL [43]. Rule: the domain part has an IP address is spoofing Otherwise Legitimate 29. Hex Based Host: if hexadecimal codes are presented in host then they are considered spoofing, it not, they are considered legitimate [50]. 30. Redirect page: This feature is commonly used to hide the real link and direct user to a spoofing website [36]. 31. Request URL:A webpage usually consists of a text and some objects such as images and videos. Typically, these objects are loaded to the webpage from the same domain where the webpage exists. If the objects are loaded from a domain which is different from the domain typed in the URL address, then the webpage is spoofing [36]. 32. Script:

sometimes scripts are used to send personal information or PC

information to attackers, and some scripts send viruses or load from external websites. Scripts tag use to put any external file in the page like jquery or Cascading Style Sheets (CSS) and if it is with start and end tag, it is legal because

this

is

the

correct

and

standard

script

tag.

Example:

, it is now load file to make the page appearance good. When there are tags like this

Recommend Documents

Web Spoofing Detection Systems Using Machine Learning ...
... Systems Using Machine. Learning Techniques ... Supervised by. Dr. Sozan A. .... Web Spoofing Detection Systems Using Machine Learning Techniques.pdf.

Using Machine Learning Techniques for VPE detection
Technical Report 88.268, IBM Science and Technology and Scientific. Center, Haifa, June 1989. (Quinlan 90) J. R. Quinlan. Induction of decision trees. In Jude W. Shavlik and Thomas G. Dietterich, editors, Readings in Machine. Learning. Morgan Kaufman

Using Machine Learning Techniques for VPE detection
King's College London ... tic account (Fiengo & May 94; Lappin & McCord ... bank. It achieves precision levels of 44% and re- call of 53%, giving an F1 of 48% ...

Video Concept Detection Using Support Vector Machine with ...
Video Concept Detection Using Support Vector Machine with Augmented. Features. Xinxing Xu. Dong Xu. Ivor W. Tsang ... port Vector Machine with Augmented Features (AFSVM) for video concept detection. For each visual ..... International Journal of Comp

Detection of Malicious Web Pages Using System Calls Sequences ...
In this paper we propose to use system calls to detect malicious JavaScript. ... In: Proceedings of the 2nd USENIX Conference on Web Application Development, p. 11. ... Roy, C.K., Cordy, J.R.: A survey on software clone detection research.

Speaker Verification Anti-Spoofing Using Linear ...
four major direct spoofing attack types against ASV systems. [11]. Among these ... training data. Therefore, SS and VC attacks are potential threats for falsifying ASV systems. For a detailed review and general information on spoofing attacks against

Detection of Malicious Web Pages Using System ... - Gerardo Canfora
Existing techniques for detecting malicious JavaScript suffer from ..... SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.

Detection of Malicious Web Pages Using System ... - Gerardo Canfora
Existing techniques for detecting malicious JavaScript suffer from some limitations like: the ... A system for circumventing these hurdles is to analyze the web threats at a finer grain, which is the one ..... Development, p. 11. USENIX Association .

Forecasting Web Page Views - Journal of Machine Learning Research
Also, Associate Professor, Department of Statistics, The Pennsylvania State University. c 2008 Jia Li and .... Without side information, such surges cannot be predicted from the page view series alone. ...... 12 information technology. 3. Aristotle.

Forecasting Web Page Views - Journal of Machine Learning Research
Abstract. Web sites must forecast Web page views in order to plan computer resource allocation and estimate upcoming revenue and advertising growth.

Machine learning (ML)-guided OPC using basis ...
signal computation and Python for MLP construction. ... K.-S. Luo, Z. Shi, X.-L. Yan, and Z. Geng, “SVM based layout retargeting for fast and regularized inverse.

Identification of Rare Categories Using Extreme Learning Machine
are useful in many fields such as Medical diagnostics, Credit card fraud detections etc. There are ... Here the extreme learning machine is use for classification.ELM is used .... Generative Classifiers: A Comparison of Logistic Regression and.

Data Mining Using Machine Learning to Rediscover Intel's ... - Media16
OctOber 2016. Intel IT developed a .... storage, and network—on which to develop, train, and deploy analytic models. ... campaigns. ResellerInsights also reveals ...

Identification of Rare Categories Using Extreme Learning Machine
are useful in many fields such as Medical diagnostics, Credit card fraud detections etc. There are many methods are use to find the rare classes, they are ...

Increasing Product Quality and Yield Using Machine Learning - Intel
Verifiable engineering lead improvements with process diagnostics ... With a growing market comes increased pressure to deliver products to market faster.

Using Machine Learning for Non-Sentential Utterance ...
Department of Computer Science. King's College London. UK ...... Raquel Fernández, Jonathan Ginzburg, and Shalom Lap- pin. 2004. Classifying Ellipsis in ...

Machine learning (ML)-guided OPC using basis ...
Machine Learning (ML)-Guided OPC Using Basis. Functions of Polar Fourier Transform. Suhyeong Choi a. , Seongbo Shim ab. , and Youngsoo Shin a a.