A Multi-mode Internet Protocol Intrusion Detection System.pdf ...

Viewer
Transcript

A MULTI-MODE INTERNET PROTOCOL INTRUSION DETECTION SYSTEM

A THESIS SUBMITTED TO THE COUNCIL OF THE FACULTY OF SCIENCE AND SCIENCE EDUCATION SCHOOL OF SCIENCE, UNIVERSITY OF SULAIMANI IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER

BY DEEMAN YOUSIF MAHMOOD B.SC. COMPUTER SCIENCE (2008), UNIVERSITY OF KIRKUK

SUPERVISED BY DR. MOHAMMED ABDULLAH HUSSEIN ASSISTANT PROFESSOR

June (2014 A.D)

Pushpar (2714 K)

‫‪‬‬ ‫‪‬‬

‫بسم اهلل الرمحن الرحيم‬

‫‪‬‬

‫‪‬‬

‫َ َ ُ ُ ِّ ْ ْ َّ َ ا‬ ‫وما أو ِتيتم من العِلم ِ ِإال قلِيل‬

‫صدق اهلل العظيم‬

‫اإلسراء ‪58‬‬

Supervisor Certification I certify that the preparation of this thesis entitled, "A Multi-mode Internet Protocol Intrusion Detection System" accomplished by (Deeman Yousif Mahmood) was prepared under my supervision at the School of Science, Faculty of Science and Science Education at the University of Sulaimani, as partial fulfillment of the requirements for the degree of Master of Science in Computer Science.

Signature: Name: Ass. Prof. Dr. Mohammed Abdullah Hussein University of Sulaimani, Electrical Engineering Department Date: 25 / 03 / 2014

In view of the available recommendation, I forward this thesis for debate by the examining committee.

Signature: Name: Dr. Kamaran HamaAli Faraj University of Sulaimani, Head of Computer Science Department Date: 25 / 03 / 2014

Linguistic Evaluation Certification

I hereby certify that this thesis entitled, "A Multi-mode Internet Protocol Intrusion Detection System" prepared by Deeman Yousif Mahmood, has been read and checked and after indicating all the grammatical and spelling mistakes; the thesis was given again to the candidate to make the adequate corrections. After the second reading, I found that the candidate corrected the indicated mistakes. Therefore, I certify that this thesis is free from mistakes.

Signature: Name

: Jutiar Omer Salih

Position : English Department, School of Languages, University of Sulaimani Date

:

14 / 04 / 2014

Examining Committee Certification

We certify that we have read this thesis entitled "A Multi-mode Internet Protocol Intrusion Detection System" prepared by (Deeman Yousif Mahmood), and as Examining Committee, examined the student in its content and in what is connected with it, and in our opinion it meets the basic requirements toward the degree of Master of Science in Computer Science.

Signature:

Signature:

Name: Dr. Subhi R. M. Zebari

Name: Dr. Suzan Abdulla Mahmood

Title: Assistant Professor

Title: Assistant Professor

Date: 20 / 7 / 2014

Date: 17 / 7 / 2014

(Chairman)

(Member)

Signature:

Signature:

Name: Dr. Kamaran HamaAli Faraj

Name: Dr. Mohammed A. Hussein

Title: Lecturer

Title: Assistant Professor

Date: 21 / 7 / 2014

Date: 17 / 7 / 2014

(Member)

(Supervisor‐Member)

Approved by the Dean of the Faculty of Science.

Signature: Name: Dr. Bakhtiar Qader Aziz Title: Professor Date: 7 / 8 / 2014 (The Dean)

Dedication

This thesis is dedicated to: My parents for their endless love, support and encouragement, source of motivation and strength during moments of despair and discouragement.

Acknowledgments

Behind every successful work, there is a lot of devotion, hard work, efforts and sacrifice. Thanks to Allah for giving me this opportunity, the strength and the patience to complete my dissertation finally, after all the challenges and difficulties. This work would not have made it to this stage without the guidance of Dr. Mohammed Abdullah Hussein; I would like to thank him for introducing me to this interesting problem of network security. His knowledge, support, and guidance have a great contribution to the success of this work. I also would like to express my gratitude to all teaching staff at the university of Sulaimani/ School of Science – Computer Science Dept., who taught me during my Master courses; I really appreciate your efforts, encouragements and valuable instructions. Profound thanks to Prof. Dr. Hussein H. Khanaqa, previous president of Kirkuk University, for his encouragement during my work in rector office in presidency of Kirkuk University (2009-2012), and his valuable advice during my study which is a result of a great experience in directing and supervising for more than 35 years. Also I have to thank all my friends for their support, encouragement, and assistance in many aspects that I cannot list all them. Finally, I take this opportunity to thank my family for their moral support throughout my life. In particular, my parents who were behind me and inspired me during my entire studies. Their support and guidance gave me the power to struggle and survive during hard times.

Abstract Intrusion Detection Systems (IDS) are gaining more and more scope in the ﬁeld of secure networks and new ideas and concepts regarding intrusion detection processes keep surfacing. Various services offered on the internet have problems of being unavailable for authorized users because of Denial-of-Service (DoS) attacks, which is the main concern of this thesis by implementing a semi-supervised hybrid IDS that can judge whether network traffics are normal or abnormal (attack) using machine learning techniques. To show the applicability of proposed intrusion detection approach the Knowledge Discovery and Data mining (KDD) Cup 99 dataset, which is considered as a standard dataset used for evaluation of security detection mechanisms, this dataset has served well to demonstrate that machine learning can be useful in intrusion detection. Two machine learning algorithms are applied to the basic security model to construct a semi-supervised hybrid technique for detecting intrusions: the K-means clustering (for unsupervised learning) and the Decision Tree algorithm (for supervised learning). These algorithms with information gain attribute ranking are used to filter and classify network packets. Although the K-means has been used previously for detecting intrusions, the addition of feature ranking enabled us to obtain better results compared to using K-means alone. With the K-means, packets could be classified either as normal or DoS packets, the DoS cluster feeds the Decision Tree, and with the addition of Decision Tree (DT) algorithm attack type classifications are made possible. Through the DT a hybrid system has been established. The result is an IDS that is effective in detecting network intrusions according to obtained high detection and low error rates, (DR = 98.2143%, Error Rate = 1.7857% for K-means and DR=99.9136%, Error Rate = 0.0864% for C4.5 Decision Tree).

i

CONTENTS

Abstract …………………………………………...………………………………… Contents …………………………………………………………………………….. List of Tables ……...………………………………………………………………... List of Figures ………………...…………………………………………………….. List of Abbreviations.…...…………………………………………………………...

i ii v vi vii

Chapter One: Introduction 1.1 Overview…………………………………………………………………...

1

1.2 Literature Survey…………………………………………………………...

3

1.3 Aim of the Thesis…………………………………………………………..

6

1.4 Thesis Outlines……………………………………………………………..

6

Chapter Two: Intrusion Detection and Data Mining 2.1 Introduction………………………………………………………………...

7

2.2 Definitions and Terminology……..………………………………………..

8

2.3 Intrusion Detection System (IDS)…..………………..........................……

11

2.4 Types of Intrusion Detection System………………………………………

12

2.4.1 Host-Based IDS…………………………………………......……..

13

2.4.2 Network-Based IDS……………..…………………………………

13

2.5 Intrusion Detection System Components and Requirements……………...

14

2.6 Intrusion Detection Techniques………………………….………………...

16

2.6.1 Anomaly Intrusion Detection……………...……………………....

17

2.6.2 Misuse Intrusion Detection………………………..……………….

18

2.7 Learning Procedures……………………………………………………….

19

2.8 Common Attacks and Vulnerabilities in NIDS…………………………....

20

2.9 Technical Discussion……………………………………………………….

21

2.9.1 Internet Protocol – IP………………………………………………

22

ii

2.9.2 Transmission Control Protocol – TCP…………………………….

22

2.10 IP Spoofing………………………………………………………………..

24

2.10.1 Denial of Service Attack…………………………………………..

25

2.11 Data Mining and Intrusion Detection System…………………………….

27

2.12 Feature Selection (FS)…………………………………………………….

28

2.12.1 General Methods for Feature Selection…………………………..

30

2.12.2 Information Gain (IG) Feature Selection…………………………

31

2.13 Clustering Algorithms…………………………………………………….

32

2.13.1 Classification of Clustering Algorithms…………………………..

33

2.13.2 K-means Algorithm………………………………………………

34

2.14 Decision Tree……………………………………………………………..

35

2.14.1 C4.5 Decision Tree Algorithm…………………………………...

36

2.15 Dataset Collection…………………………………………………………

38

2.15.1 Attacks in KDD Cup 99 Dataset………………………………….

39

2.15.2 Features of KDD Cup 99 Dataset…………………………………

39

Chapter Three: Proposed System Methodology 3.1 Introduction…………………………………………………………………

42

3.2 Dataset Pre-Processing……………………………………………………...

42

3.2.1 Dataset Transformation…………………………………………….

42

3.2.2 Dataset Normalization……………………………………………...

43

3.3 Proposed Detection Model………………………………………………….

44

3.4 Information Gain Feature Selection………………………………………...

46

3.5 K-means Clustering for the Proposed System……………………………...

47

3.5.1 Distance Calculation………………………………………………..

49

3.6 Decision Trees as a Model for Intrusion Detection………………………...

51

iii

Chapter Four: Implemented Results and Discussions

4.1 Introduction…………………………………………………………………..

55

4.2 Training and Testing the Dataset……………………………………………..

55

4.3 Experiment 1: Results of Pre-processing……………………………………..

55

4.3.1 Transformation and Normalization…………………………………...

55

4.3.2 Features Ranking and Subset Selection………………………………

59

4.4 Experiment 2: K-means Clustering (First Layer)……………………………..

61

4.5 Experiment 3: C4.5 Decision Tree (Second Layer)…………………………..

66

4.6 The Graphical User Interface (GUI)…………………………………………..

67

Chapter Five: Conclusions and Future Works 5.1 Conclusions……………………………………………………………………

71

5.2 Future Works…………………………………………………………………..

73

References………………………………………………………………………….

74

Appendices

iv

List of Tables

Table No.

Table Title

Page No.

2.1

Confusion Matrix

10

2.2

Comparison of Intrusion Detection Techniques

16

2.3

Basic Features of TCP Connection

40

2.4

Content Features of the TCP Connection

41

2.5

Time Based Features of the TCP Connection

41

3.1

Transformation Table for Different Values of Protocols, Flag and

43

Services 4.1

Sample Records of KDD Cup 99

56

4.2

Transformed Nominal Data and Normalized Numeric Data

57

Samples of KDD Cup 99 Dataset 4.3

Proportions of the Normal and DoS Classes in the Data Subset

58

4.4

Attribute Ranking by Information Gain

59

4.5

Attribute Ranking Using GainR for C4.5 DT

60

4.6

Attributes Centroid Using Euclidian Distance Metric for 20

62

Features with Highest Ranking 4.7

Attributes Centroid Using Manhattan Distance Metric for 20

63

Features with Highest Ranking 4.8

Evaluation and Results of K-means with Distance Functions Using

64

the Full Dataset 4.9

Evaluation and Results of K-means with Distance Functions Using

64

the Highest 10 Features Ranked by IG 4.10

Evaluation and Results of K-means with Distance Functions Using

65

the Highest 20 Features Ranked by IG 4.11

Evaluation and Results of C4.5 Algorithm

v

66

List of Figures

Figure No.

Figure Title

Page No.

2.1

OSI Model

21

2.2

IP Packet Header

22

2.3

TCP Packet Header

23

2.4

Types of Clustering Methods

34

2.5

Example of Decision Tree for IDS Classification

38

3.1

Records of the KDD Cup 99 Dataset

43

3.2

Records of the KDD Cup 99 Dataset After Transformation

44

3.3

Proposed Detection Model Structure

45

3.4

First Layer of Proposed Detection Model

47

3.5

K-means Clustering Flowchart

48

3.6

Euclidean Distance between Two Points

49

3.7

Manhattan Distance between Two Points

50

3.8

Decision Tree Structure for DoS Attack Classification

54

4.1

Comparative Chart of Distance Functions Values Using K-means

65

4.2

Main GUI of the Detection Model

68

4.3

Capturing and Classification of Network Traffics by the System

68

4.4

Extracting Normal and Attack Packets from Captured Packets

69

4.5

Log File of Captured Packets

70

vi

List of Abbreviations Abbreviation

Description

Acc

Accuracy

ACK

Acknowledge

ATM

Automated Teller Machine

CFS

Correlation-based Feature Selection

DDoS

Distributed Denial of Service attack

DNS

Domain Name Server

DoS

Denial of Service attack

DR

Detection Rate

DT

Decision Tree

ES

Expert System

FCBF

Fast Correlation-Based Feature selection

FN

False Negative

FNR

False Negative Rate

FP

False Positive

FPR

False Positive Rate

FS

Feature Selection

FSA

Feature Selection Algorithm

FTP

File Transfer Protocol

GainR

Gain Ratio

GUI

Graphical User Interface

HIDS

Host-based Intrusion Detection System

HTTP

Hyper Text Transfer Protocol

ICMP

Internet Control Message Protocol

IDE

Integrated Development Environment

IDS

Intrusion Detection System

vii

IG

Information Gain

IP

Internet Protocol

JDK

Java Development Kit

KDD

Knowledge Discovery in Database

MAE

Mean Absolute Error

MITM

Man In The Middle

ML

Machine Learning

MSE

Mean Square Error

NIDES

Next generation of Intrusion Detection Expert System

NIDS

Network-based Intrusion Detection System

OSI

Open Systems Interconnection

PCA

Principal Component Analysis

PoD

Ping of Death

PPV

Positive Predictive Value

R2L

Remote to Local

RMSE

Root Mean Squared Error

SOM

Self-Organizing Maps

SQL

Structured Query Language

SVM

Support Vector Machines

Sr. No.

Source Number

SYN

Synchronize

TCP

Transfer Control Protocol

TN

True Negative

TNR

True Negative Rate

TP

True Positive

TPR

True Positive Rate

U2R

User to Root

viii

Chapter One Introduction

Chapter One Introduction

1.1 Overview The world has seen rapid advances in science and technology in the last two decades. This has enabled dealing with a wide spectrum of human needs eﬀectively. These needs vary from simple day-to-day needs like online shopping, online booking tickets, online banking, e-library, etc. [1]. These technologies have made life easier for average people, but make it harder for security experts and network administrators, and in the middle of this phenomenon, the rise and growth of a parallel technology is fearful that of compromising security, thereby resulting in diﬀerent eﬀects detrimental to the use of technology. This includes attacks on information, such as stealing private information, hacking, and outage of services [2]. Media and other forms of network security literature report the possibility of the existence of underground anonymous attack networks which can eﬀectively attack any given target at any time [3]. An intrusion to a computer system does not need to be executed manually by a person; it may be executed automatically with engineered software. A well-known example of this is the Slammer worm (also known as Sapphire), which performed a global Denial of Service (DoS) attack in 2003. The worm exploited vulnerability in Microsoft’s SQL Server, which allowed it to disable database servers and overload networks. Slammer was the fastest computer worm in history and affected approximately 75,000 computer systems around the world within 10 minutes. Not only did the Slammer worm restrict the general Internet trafﬁc, it caused network outages and unforeseen consequences such as canceled airline ﬂights, interference with elections, and ATM failures [4].

1

Chapter One

Introduction

There are several mechanisms that can be adopted to increase the security in computer systems. A commonly used three-level protection is by [5]: Attack prevention: Firewalls, user names and passwords, and user rights. Attack avoidance: Encryption. Attack detection: Intrusion detection systems. Despite adopting mechanisms such as cryptography and protocols to control the communication between computers (and users), it is impossible to prevent all intrusions, Firewalls serve to block and ﬁlter certain types of data or services from users on a host computer or a network of computers, aiming to stop some potential misuse by enforcing restrictions. However, ﬁrewalls are unable to handle any form of misuse occurring within the network or on a host computer. Furthermore, intrusions can occur in trafﬁc that appears normal [6]. IDS do not replace the other security mechanisms, but compliment them by attempting to detect when malicious behavior occurs. The purpose of an IDS, in general terms, is to detect network traffics when the behavior of a user conﬂicts with the intended use of the computer, or computer network, e.g., committing fraud, hacking into the system to steal information, conducting an attack to prevent the system from functioning properly or even break down. Before the 1990s, the intrusion detection was performed by system administrators, manually analyzing logs of user behavior and system messages, with poor chances of being able to detect intrusions in progress [7]. Due to the increased use of computers, the magnitude of data in contemporary computer networks still renders this a signiﬁcant challenge, while the range of attacks that can be performed on targets is as broad as the spectrum of constructive technology itself, this thesis deals with a particular class of attacks known as Denial of Service (DoS) attacks that mostly uses IP spoofing. DoS attacks is a class of attacks on targets which aims at exhausting target resources, thereby denying service to valid users [3].

2

Chapter One

Introduction

1.2 Literature Survey As the network dramatically extended, security is considered as a major issue in networks. Internet attacks are increasing, and there have been various attack methods, researchers and companies have analyzed these methods and below are a survey on some of related researches: In 1980, the concept of intrusion detection began with Anderson’s seminal paper [8]; he introduced a threat classification model that develops a security monitoring surveillance system based on detecting anomalies in user behavior. In 1995, Anderson et al. [9], designed the Next generation of Intrusion Detection Expert System (NIDES) to operate in real time to detect intrusions as they occur. NIDES is a comprehensive system that uses innovative statistical algorithms for anomaly detection, as well as an expert system that encodes known in intrusion scenarios. Again in 1995, Kummer [10], used the classification of intrusion based on the "signatures" (patterns) they leave in the audit trial of the system made. The classification is intended or used in intrusion detection systems based on pattern matching. In 2002, Andrew et al. [11], used KDD CUP 1999 Data set for training and testing their model. Data were classified in to two classes: Normal (+1) and Attack (-1). They had used the SVM light freeware package. For data reduction, they had applied SVMs to identify the most significant features for detecting attack patterns. The procedure is to delete one feature at a time, and train SVMs with the same data set. By this process, 13 out of the 41 features of KDD CUP 1999 dataset are identified as most significant: 1, 2, 3, 5, 6, 9, 23, 24, 29, 32, 33, 34, and 36. Training was done using the RBF (Radial Bias Function) kernel option. In their

3

Chapter One

Introduction

experiment, authors got 98.9% accuracy for true negative case, and 99.7% accuracy for true positive case. In 2005 Mitrokotsa and Douligeris [12], proposed an approach that detects DoS attacks using Emergent Self-Organizing Maps. The approach is based on classifying “normal” traffic against “abnormal” traffic in the sense of DoS attacks. The approach permits the automatic classification of events that are contained in logs and visualization of network traffic. Extensive simulations show the effectiveness of this approach compared to previously proposed approaches regarding false alarms and detection rates. In 2008 Rajesh and Shina [13], proposed a method of analysis for the best feature selection method for Network intrusion detection model. In their paper they analyzed three measures namely: the Chisquare, Information Gain and the Gini Index methods for feature selection. These are the various filter based approaches that have been used. Among these filter based approaches given upon the open source Windows version 3.4 three of them were tested. Results have proved that the Information gain when used for the feature selection produces accurate results by accurately detecting the least prominent attack in the dataset. In 2009 Bian et al. [14], used K-means algorithm to cluster and analyze the data of KDD Cup 99 dataset. The simulation results that run on KDD Cup 99 dataset showed that the K-means method is an effective algorithm for partitioning large dataset and can detect unknown intrusions with detection rate 96%. In 2010 Affendey et al. [15], compared the efficiency of machine learning methods in intrusion detection system, including Classification Tree and Support Vector Machines. that Classification Decision Tree algorithm detects attacks at a very much greater rate than the Support Vector machines (SVM’s), the same dataset were evaluated with the two Data mining approaches. The correlation between the

4

Chapter One

Introduction

samples was measured by using the min-max normalization. The Results show that the C4.5 Classification Decision Tree algorithm is giving fewer false alarm rates than SVM. Again in 2010 Bharti et al. [16], used fuzzy k-mean clustering algorithm and random forest tree classification techniques for assigning a cluster to a particular class. From experimental results it is observed that for two class datasets the combination of clustering random forest tree gives the better results than the clustering alone. In 2012 Bhaskar and Kumar [17], presented an approach for identifying network anomalies by visualizing network flow data which is stored in weblogs. Various clustering techniques can be used to identify different anomalies in the network. Here, they present a new approach based on simple K-Means for analyzing network flow of data using different attributes like IP address, Protocol, Port number etc. to detect anomalies. By using visualization, they can identify which sites are more frequently accessed by the users. In their approach they provide overview about given dataset by studying network key parameters. In this process they used preprocessing techniques to eliminate unwanted attributes from weblog data. Since it is challenging for IDSs to maintain high accuracy, an IDS that uses attack signatures to detect intrusions cannot discover new attacks. These IDSs are becoming incapable of protecting computer system; therefore a detection approach that is able to detect new attacks is necessary for building reliable and efficient IDS. For this purposes an unsupervised data mining approach deployed the K-means clustering algorithm in the first layer of proposed IDS model, which is a selfadministrative and can learn new patterns within the dataset without any interference from outside (i.e., an administrator), and C4.5 DT deployed in the second layer for classifying DoS attack types which is a very accurate and easy classifier.

5

Chapter One

Introduction

1.3 Aim of the Thesis The aim of this thesis is to design an efficient IDS to detect DoS attacks in a NIDS. This thesis provides a survey of the state-of-the-art in the field of hybrid approaches applied to IDS’s and ends with implementing a system that utilizes unsupervised K-means and supervised Decision Tree algorithms. Additionally, it shows that each class of attacks could be treated separately as the thesis focuses only on DoS attack. In fact it is possible that at least one algorithm can be assigned to detect one class of attacks instead of using a single algorithm to detect all classes of attacks.

1.4 Thesis Outlines The rest of the thesis is organized as follows:  Chapter Two (Intrusion Detection and Data Mining): This chapter deals with the concept of intrusion detection systems. It will also cover the diﬀerent types of IDSs, and explain what a network-based IDS is, Machine learning types, used algorithms, and different types of attack and concepts of IP spoofing.  Chapter Three (Proposed System Methodology): This chapter will cover an overall design of the IDS regarding the pre-processing, algorithms, and the overall proposed detection model structure.  Chapter Four (Implemented Results and Discussions): This chapter will present results of functionally and eﬃciency test of the implemented IDS model.

 Chapter Five (Conclusions and Future Works): This chapter will cover concluding remarks on the IDS and the whole work of this thesis, and gives some possibilities of future works.

6

Chapter Two Intrusion Detection and Data Mining

7

Chapter Two Intrusion Detection and Data Mining

2.1 Introduction Computer networks have expanded significantly in use and number. This expansion makes them target to different attacks [18]. It is obvious that, in today’s era of Information Technology, the sharing of resources and information in interconnected network is essential. But as to secure this information from unauthorized uses and manipulation, it is necessary to impose some restrictions. Some of the tools that are developed for these purposes are firewalls, anti-viruses and intrusion detection programs [19]. The use of an intrusion detection system is becoming common due to the increase in attack complexity and the evolution of computer systems. Generally intrusion detection system works in pre-defined manner regardless of the implementation mechanism selected. These are some common steps followed by the intrusion detection system [20]:  Data is captured, often in the form of IP packets.  The data are decoded and transformed into a uniform format, through the process of feature extraction.  The data are then analyzed in a manner which is specific to the individual IDS, and classified as threatening or not.  Alerts are generated if a threatening pattern is encountered. Computer and data security is a complex topic. The goals of computer security are [21]:

7

Chapter Two

Intrusion Detection and Data Mining

1. Data Confidentiality: protection of data so that it is not disclosed in an unauthorized fashion. 2. Data Integrity: protection against unauthorized modification of data. 3. Data Availability: protection from unauthorized attempts to withhold information or computer resources. This chapter starts with an introduction to the concept of intrusion detection system and the components of intrusion detection system. Algorithms and techniques of IDS that are used in this thesis are discussed.

2.2 Definitions and Terminology An Intrusion Detection System (IDS) employs techniques for modeling and recognizing intrusive behavior in a computer system. When referring to the performance and measurement factors of IDSs, the following terms are often used: Alarm: A signal suggesting that a system has been or is being attacked. True positive (TP): classifying an intrusion as an intrusion. The true positive rate is synonymous with detection rate, sensitivity and recall, which are terms often used in the literature. False positive (FP): incorrectly classifying normal data as an intrusion, also known as a false alarm. True negative (TN): correctly classifying normal data as normal. The true negative rate is also referred to as specificity. False negative (FN): incorrectly classifying an intrusion as normal.

In particular, the following measures will be used to assess the IDS's performance. The performances metrics are calculated as follows:

8

Chapter Two

Intrusion Detection and Data Mining

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝑇𝑃𝑅) =

𝑇𝑃 𝑇𝑃+𝐹𝑁

𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝐹𝑃𝑅) =

𝐹𝑃 𝐹𝑃+𝑇𝑁

𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝑇𝑁𝑅) =

𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝐹𝑁𝑅) =

#𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠

=

#𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠

=

𝑇𝑁 𝑇𝑁+𝐹𝑃

#𝑁𝑜𝑟𝑚𝑎𝑙 𝑎𝑠 𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠 #𝑁𝑜𝑟𝑚𝑎𝑙

=

𝐹𝑁 𝐹𝑁+𝑇𝑃

#𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑜𝑟𝑚𝑎𝑙 #𝑁𝑜𝑟𝑚𝑎𝑙

=

#𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛 𝑎𝑠 𝑁𝑜𝑟𝑚𝑎𝑙 #𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠

Eq.2.1

Eq.2.2

Eq.2.3

Eq.2.4

True Positive Rate is also referred to as Sensitivity or Recall, and precision is also referred to as Positive Predictive Value (PPV). True Negative Rate is also called Specificity.

Commonly additional performance metrics are used referred to as accuracy, Error rate, precision and F-measure:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =

𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

=

#𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 #𝐴𝑙𝑙 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =

𝑇𝑃 𝑇𝑃+𝐹𝑃

𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗

=

Eq.2.5

Eq.2.6

#𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠 #𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑎𝑠 𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙

Eq.2.7

Eq.2.8

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙

Accuracy is the most basic measure of the performance of a learning method. This measure determines the percentage of correctly classified instances and the overall classification rate, while F-measure is a measure of a test's accuracy. It considers both the precision and the recall of the test. The F-measure can be

9

Chapter Two

Intrusion Detection and Data Mining

interpreted as a weighted average of the precision and recall, where F-measure reaches its best value at 1 and worst score at 0. These metrics are derived from a basic data structure known as the confusion matrix [22;23],

which contains information about actual and predicted

classifications done by a classification system. A sample confusion matrix for a two class case can be represented as shown in Table 2.1. Table 2.1: Confusion Matrix Predicted Class Activity Attack Normal Actual Class Attack Normal

TP

FN

FP

TN

Another evaluation method is to calculate the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) values. Small values indicate classes of a higher quality. MAE is the average absolute difference between classifier predicted output and actual output, while RMSE is the square root of the Mean Square Error (MSE), which is the average of the sum of squared differences between classifier predicted output and actual output. 1

𝑀𝐴𝐸 = ∑𝑁 𝑖=1|𝐷𝑒𝑠𝑖𝑟𝑒𝑑𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 | 𝑁

1

2 𝑀𝑆𝐸 = ∑𝑁 1 (𝐷𝑒𝑠𝑖𝑟𝑒𝑑𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 ) 𝑁

1

2 𝑅𝑀𝑆𝐸 = √ ∑𝑁 1 (𝐷𝑒𝑠𝑖𝑟𝑒𝑑𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 ) 𝑁

10

Eq.2.9 Eq.2.10

Eq.2.11

Chapter Two

Intrusion Detection and Data Mining

2.3 Intrusion Detection System (IDS) An intrusion can be defined as: any set of actions that attempt to compromise the integrity, confidentiality, or availability of resources. Intrusion detection is therefore required as an additional wall for protecting systems. [24]. Intrusion detection system (IDS) is a security layer that is used to discover ongoing intrusive attacks and anomaly activities in information systems and it is usually working in a dynamically changing environment. There are two types of intrusion detection systems, one of them is host based and the other is network based and usually they differ in the detection techniques they use. It ranges from misuse detection, anomaly detection to supervised and unsupervised based learning [24,25]. IDS’s perform the following operation in order to identify an intrusion [26]:  Manual log examination.  Automated log examination.  Host-based intrusion detection software.  Network-based intrusion detection software.  Audit of system structure and fault.  Audit tracing management of operating system and recognition of user’s behavior against security policy of an organization.  Statistics analysis of abnormal activities.  Monitoring and analyzing user and system activities.  Recognition activity model for identification of known attacks and generating the alarm as an indication of attack.  Measuring the confidentiality and integrity of the system and data files. Manual log examination can be effective but it can also be time-consuming and prone to error. Human beings are just not good at manually reviewing computer logs. A better form of log examination would be to create programs or scripts that

11

Chapter Two

Intrusion Detection and Data Mining

can search through computer logs looking for potential anomalies. Intrusion detection systems were once touted as the solution to the entire security problem. No longer would we need to protect our files and systems, we could just identify when someone was doing something wrong and stop them [26]. In fact, some of the intrusion detection systems were marketed with the ability to stop attacks before they were successful. Strictly speaking IDS does not prevent the intrusion from occurring but it detects the intrusion and reports it to the system operator. No intrusion detection system is foolproof and thus they cannot replace a good security program or a good security practice. They will also not detect legitimate users who may have incorrect access to information. The implementation of intrusion detection mechanisms should not be considered until the majority of high-risk areas are addressed, because they are broadly considered to be a classification problem [26]. The main issue in standard classification problem lies in minimizing the probability of error while making the classification decision; hence the key point is how to choose an effective classification method to construct accurate intrusion detection system with high detection rate and keeping low false alarm rate [27,28].

2.4 Types of Intrusion Detection Systems There are several types of intrusion detection systems and the choice of which one to use depends on the overall risks to the organization and the resources available [22]. One of the classifications of IDSs is established by the resource they monitor. According to this classification, IDSs are divided into two categories or two primary types of IDS according to their location: Host-based (HIDS) and Network-based (NIDS). As the name suggests, HIDS is located on the host computer. HIDSs analyzes audit trail data such as user logs, system calls (which are calls to functions provided by the operating system kernel) on the host where it is installed and looks for indications of attacks on that host [29].

12

Chapter Two

Intrusion Detection and Data Mining

NIDS on the other hand, resides on a separate system that watches network traffic, looking for indications of attacks that traverse that portion of the network and intercept packets passing through the network in order to analyze them and detect possible intrusion attempts. The current trend in intrusion detection is to combine both host based and network based information to develop hybrid systems [26,30].

2.4.1 Host-Based IDS A host-based IDS operates on data collected from a single computer system (host). These data can be from the innermost part of the host's operating system (audit data) or system log data. Host-based IDS uses these data to detect traces of an attack. They are usually deployed in the host system and usually they use the host's computational infrastructure that will lead to performance degradation. It is also deployed on individual hosts that make the configuration difficult as different hosts may have different behaviors and usage [31]. HIDS have access to detailed information on system events that may get disabled or made useless by an attacker who successfully gains administrative privileges on the protected machine. An intrusion that installs root kits (a piece of software that installs itself as part of the operating system kernel) is able to hide traces of anomalous in system activities [32]. Once the root kit is installed, it enables the attacker to cover the traces of malicious activities by cleaning system logs and hiding information about malicious processes at the kernel level.

2.4.2 Network-Based IDS A network-based IDS acquires and examines network traffic packets for signs of intrusion. A network-based IDS comprises a set of dedicated sensors or hosts which scan network traffic data to detect attacks or intrusive behaviors and protects the hosts connected to the network [31].

13

Chapter Two

Intrusion Detection and Data Mining

The major advantages of network-based IDS include its ability to scan large networks in a transparent way without affecting the normal operation of the network. Also, it has the ability to scan the traffic passively without being visible, and this makes it invisible to attackers and makes the network more secure [34]. NIDS analyzes packets crossing an entire network segment. NIDS has the advantage of being able to protect a higher numbers of hosts at the same time. However, it can suffer from performance problems due to the large amount of traffic it needs to analyze in real-time. In addition it can receive some attacks that exploit ambiguities in network protocols and cause the exhaustion of the memory and computational resources of the IDS [33]. The major disadvantages of network-based IDS are inability to handle encrypted data, incapacity to report whether an attack was successful or not and incapability to handle fragmented packets (that makes the IDS unstable). Furthermore, it can report only the initiation of an attack [34]. Furthermore, Network-Based IDS cannot easily monitor encrypted communications and is inherently unable to monitor intrusive activities that do not produce externally observable evidence.

2.5 Intrusion Detection System Components and Requirements IDS components can be fulfillment and summarized from two perspectives [35]: 1. From an algorithmic perspective:  Features - capture intrusion evidence from audit data.  Models – to infer attack from evidence. 2. From a system architecture perspective:  Audit data processor, knowledge base, decision engine, alarm generation and responses. While the requirements to develop an IDS can be listed at two levels of abstraction [36]:

14

Chapter Two

Intrusion Detection and Data Mining

1. High Level Requirements:  Develop a capable application that can sniff the traffic to and from the host machine.  Development of an application that is competent of analyzing the network traffic and detects numerous pre-defined intrusion attacks and mappings.  Development of an application that warns the owner of the host machine about the likely occurrence of an intrusion attack.  The application should block traffic to and from a machine identified to be potentially malicious and usually it is defined by the owner of the host machine.

2. Low Level Requirements:  Develop an application capable enough of displaying the incoming and outgoing traffic from the host machine in the form of packets to the owner of the host.  An application that detects occurrence of Denial of Service (DoS) attacks such as Smurf and Syn-Flood is required.  Development of an application that detects attempts to map the network of the host, using techniques such as Efficient Mapping and Cerebral Mapping.  An application is required that detects actions attempting to gain unauthorized access to the services provided by the host machine using techniques such as Port Scanning.  An application that maintains a "Log Record" of identified intrusion attacks done on the host in the present session and also displays it upon request.  Activation or de-activation of each of the Attack Detection methods should be possible.

15

Chapter Two

Intrusion Detection and Data Mining

 Provide a selection procedure for the user of the host for framing rules which explicitly specifies the set of IP addresses to be blocked or allowed. These Rules shall determine the flow of traffic at the host.

2.6 Intrusion Detection Techniques The techniques for the intrusion detection can be divided into two categories:  Anomaly Intrusion Detection  Misuse Intrusion Detection These techniques are categorized based upon different approaches like Statistics, Data mining, and Neural Network. Table 2.2 shows a comparison between different intrusion detection techniques [26].

Table 2.2: Comparison of Intrusion Detection Techniques Detection of Detection of No. Detection Technique

Approach

Known

Unknown

Attack

Attack

1

Misuse

Genetic Algorithm

Yes

No

2

Based

Expert system

Yes

No

3

Detection

State Transition

Yes

No

Data Mining

Yes

Yes

Rule Based

Yes

Yes

Decision Tree

Yes

Yes

Statistical

Yes

Yes

8

Signature

Yes

Yes

9

Neural network

Yes

Yes

4 5 6 7

Anomaly Based Detection

16

Chapter Two

Intrusion Detection and Data Mining

Intrusion detection methods may also include the detection using supervised and unsupervised learning. Supervised learning methods for intrusion detection can only detect known intrusions, while unsupervised learning methods can detect intrusions that have not been learned previously. Examples of unsupervised learning for intrusion detection include K-means-based approaches and self-organizing feature maps.

2.6.1 Anomaly Intrusion Detection This method works by using the definition "anomalies are not normal" [37,38]. Anomaly detection tries to determine whether deviation from the established normal usage patterns can be flagged as an intrusion. Anomaly detection technique assumes that all the intrusive activities are anomalous. There are many anomaly detection techniques that work on the principle of detecting deviations from normal behavior. This means that a normal activity profile for a system could be established and it could be stated that all system states that are varying from the established profile could be classified as an intrusion [38]. Anomaly Detection techniques includes Statistical, Neural Network, Immune System, File Checking and Data Mining [26]. Below is a brief description of each:  Statistical based methods: Statistical methods monitor the user/network behavior by measuring certain variable statistics over time.  Distance based methods: These methods try to overcome limitations of the statistical approach when the data are difficult to estimate in the multidimensional distributions.  Rule based: Rule based system uses a set of "if-then" implication rules to characterize computer attacks. State transition is used to identify an intrusion by using a finite state machine that is deduced from the network. IDS states correspond to different states of the network, and an event makes a transition in

17

Chapter Two

Intrusion Detection and Data Mining

this finite state machine. An activity identifies intrusion if state transitions in the finite state machine of the network reach a sequel state.  Profile based methods: This method is similar to rule based method. Here normal behavior’s profiles are built for different types of network traffics, users, and devices. Deviations from these profiles mean intrusion.  Model based methods: This approach is based on the differences between a normal and abnormal behavior by modeling them but without creating several profiles of them. In model based methods, researchers attempt to model the normal or abnormal behaviors and deviation from this model means intrusion.  Signature based: Matching available signatures in a database with collected data from activities for identifying intrusions.  Neural Network Based: Neural Network model can distinguish between normal and attack patterns by training them and it can also identify the type of the attack.

2.6.2 Misuse Intrusion Detection Misuse detection is the most common approach used in the commercial IDS. Misuse Intrusion Detection uses the pattern of known attacks or weak spots of the system to match and identify attacks [26]. So there are some ways to represent attacks in forms of patterns or attacks signatures and even variations of the same attack can be detected. The main object of misuse detection focuses on the use of an expert system to identify intrusions based on an available knowledge base. This approach detects all the known attacks and tries to recognize known bad behavior [38]. Misuse attack detection techniques include genetic algorithm, expert system, pattern matching, state transition analysis and keystroke monitoring [26]. Below is a brief description of each:  Genetic Algorithm based Detection (GAD): There are many researchers who used GAD in IDS to detect malicious intrusion. The Genetic Algorithm provides

18

Chapter Two

Intrusion Detection and Data Mining

the necessary population breeding, randomizing, and statistics gathering functions.  Expert System based Detection: Expert System is software or a combined software and hardware capable of competently executing a specific task usually performed by a human expert. Expert systems are highly specialized computer systems capable of simulating human specialist knowledge and reasoning by using a knowledge-base. It is characterized by a set of facts and heuristic rules. Heuristic rules are rules of thumb accumulated by a human expert through intensive problem solving in the domain of a particular task.  State Transition based Detection: In this approach the IDS identify an intrusion by using a finite state machine that is deduced from the network. IDS states correspond to different states of the network and an event generates a transition in this finite state machine. An activity is identified as an intrusion if the state transition in the finite state machine reaches an abnormal state. The main problem in this technique is to find out known signatures that include all the possible variations of pertinent attack, and which do not match the non-intrusive activity. 2.7 Learning Procedures Machine learning algorithms can be organized into a taxonomy based on the desired outcome of the algorithm or the type of input available during training the machine to [39,40]: 

Supervised learning algorithms are trained on labeled examples. The supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used to speculatively generate an output for previously unseen inputs.

19

Chapter Two 

Intrusion Detection and Data Mining

Unsupervised learning algorithms operate on unlabeled examples. Here the objective is to discover structure in the data (e.g. through a cluster analysis) for inputs where the desired output is unknown.



Semi-supervised learning combines both labeled and unlabeled examples to generate an appropriate function or classifier.



Reinforcement learning is concerned with how intelligent agents ought to act in an environment to maximize some notion of reward. The agent executes actions which cause the observable state of the environment to change. Through a sequence of actions, the agent attempts to gather knowledge about how the environment responds to its actions, and attempts to synthesize a sequence of actions that maximize a cumulative reward.

Learning procedure of this thesis fall in the Semi-supervised learning category.

2.8 Common Attacks and Vulnerabilities in NIDS Current NIDSs requires substantial amount of human interference and administrators for an effective operation. Therefore, it becomes important for the network administrators to understand the architecture of NIDS, the well-known attacks and the mechanisms used to detect them to contain the damages. In this section, some well-known attack types, exploits, vulnerabilities (in the end host operating systems) will be discussed, attack categories are [41]: 1. Confidentiality: In such kinds of attacks, the attacker gains access to confidential and otherwise inaccessible data. 2. Integrity: In such kinds of attacks, the attacker can modify the system state and alter the data without proper authorization from the owner.

20

Chapter Two

Intrusion Detection and Data Mining

3. Availability: In such kinds of attacks, the system is either shut down by the attacker or made unavailable to general users. Denial of Service attacks fall into this category. 4. Control: In such attacks the attacker gains full control of the system and can alter the access privileges of the system thereby potentially triggering all of the above three attacks.

2.9 Technical Discussion To completely understand how these attacks take place, one must examine the structure of the TCP/IP protocol suite of the OSI model Figure 2.1. A basic understanding of these headers and network exchanges is crucial to the process.

OSI Model Data unit

Layer 7. Application

Function Network process to application Data

Host

Data

representation,

encryption

6. Presentation decryption, convert machine dependent data to machine independent data

layers 5. Session

Inter host communication, managing sessions between applications End-to-end connections, reliability and flow

Segments

4. Transport

Packet/Datagram

3. Network

Path determination and logical addressing

Frame

2. Data link

Physical addressing

Bit

1. Physical

Media, signal and binary transmission

control

Media layers

and

Figure 2.1: OSI Model

21

Chapter Two

Intrusion Detection and Data Mining

2.9.1 Internet Protocol – IP Internet Protocol (IP) is a network protocol operating at layer 3 (network) of the OSI model. It is a connectionless model, meaning there is no information regarding transaction state, which is used to route packets on a network [42]. Additionally, there is no method in place to ensure that a packet is properly delivered to the destination. Examining the IP header Figure 2.2, the first 12 bytes (or the top 3 rows of the header) contain various information about the packet. The next 8 bytes (the next 2 rows), however, contain the source and destination IP addresses. Using one of several tools like (HPing, NMap, PacketExcalibur, Scapy, etc.) [43], an attacker can easily modify these addresses specifically the "source address" field. It is important to note that each datagram is sent independent of all others due to the stateless nature of IP.

Figure 2.2: IP Packet Header 2.9.2 Transmission Control Protocol – TCP IP can be thought of as a routing wrapper for layer 4 (transport) of OSI model, which contains the Transmission Control Protocol (TCP). Unlike IP, TCP uses a connection-oriented design. This means that the participants in a TCP session must

22

Chapter Two

Intrusion Detection and Data Mining

first build a connection - via the 3-way handshake (SYN-SYN/ACK-ACK) then update one another on progress via sequences and acknowledgements [42]. This "conversation", ensures data reliability, since the sender receives an OK from the recipient after each packet exchange [44]. A TCP header is very different from an IP header Figure 2.3. The concerned will be with the first 12 bytes of the TCP packet, which contain port and sequencing information. Much like an IP datagram, TCP packets can be manipulated using software. The source and destination ports normally depend on the network application in use (for example, HTTP via port 80). What's important for understanding of spoofing are the sequence and acknowledgement numbers. The data contained in these fields ensures packet delivery by determining whether or not a packet needs to be resent [42].

Figure 2.3: TCP Packet Header The sequence number is the number of the first byte in the current packet which is relevant to the data stream. The acknowledgement number, in turn, contains the value of the next expected sequence number in the stream. This relationship confirms, on both ends, that the proper packets were received. It is quite different than IP since transaction state is closely monitored [42].

23

Chapter Two

Intrusion Detection and Data Mining

2.10 IP Spoofing The basic protocol for sending data over the Internet and many other computer networks is the Internet Protocol ("IP") [44]. The header of each IP packet contains, among other things, the numerical source and destination address of the packet. The source address is normally the address that the packet was sent from. By forging the header so it contains a different address, an attacker can make it appear that the packet was sent by a different machine. The machine that receives spoofed packets will send response back to the forged source address. This means that this technique is mainly used when the attacker does not care about response or the attacker has some way of guessing the response [45]. IP spoofing or Internet protocol address spoofing is the method of creating an Internet protocol packet or IP packet using a fake IP address that is impersonating a legal and legitimate IP address. IP spoofing is a method of attacking a network in order to gain unauthorized access [46]. The attack is based on the fact that Internet communication between distant computers is routinely handled by routers which find the best route by examining the destination address, but generally ignore the origination address. The origination address is only used by the destination machine when it responds back to the source [47]. In a spoofing attack, the intruder sends messages to a computer indicating that the message has come from a trusted system. To be successful, the intruder must first determine the IP address of a trusted system, and then modify the packet headers to a form that it appears that the packets are coming from the trusted system [47], these include obscuring the true source of the attack, implicating another site as the attack origin, pretending to be a trusted host, hijacking or intercepting network traffic, or causing replies to target another system.

24

Chapter Two

Intrusion Detection and Data Mining

IP spoofing is most frequently used in denial-of-service attacks which will be addressed in the next section of this chapter.

2.10.1 Denial of Service Attack IP spoofing is almost always used in what is currently one of the most difficult attacks to defend against – denial of service attacks, or DoS. Since crackers are concerned only with consuming bandwidth and resources, they need not to worry about properly completing handshakes and transactions. Rather, they wish to flood the victim with as many packets as possible in a short amount of time [48]. In order to prolong the effectiveness of the attack, they spoof source IP addresses to make tracing and stopping the DoS as difficult as possible. When multiple compromised hosts are participating in the attack, all sending spoofed traffic; it will be very challenging to quickly block traffic [49]. A denial-of-service attack (DoS attack) or distributed denial-of-service attack (DDoS attack) is an attempt to make a computer resource unavailable to its intended users. Although the means to carry out, motives for, and targets of a DoS attack may vary, it generally consists of the efforts of a person or persons to prevent an Internet site or service from functioning efficiently, temporarily or indefinitely [50]. Perpetrators of DoS attacks typically target sites or services hosted on highprofile web servers such as banks, credit card payment gateways, and even DNS root servers [51]. One common method of attack involves saturating the target (victim) machine with external communications requests, such that it cannot respond to legitimate traffic, or responds so slowly as to be rendered effectively unavailable. In general terms, DoS attacks are implemented by either forcing the targeted computer(s) to reset, or consume its resources so that it can no longer provide its intended service or obstructing the communication media between the intended users and the victim so

25

Chapter Two

Intrusion Detection and Data Mining

that they can no longer communicate adequately [52]. Main types of DoS attack are listed below:  Smurf Attack: Smurf attack exploits the target by sending repeated ping request to broadcast address of the target network. The ping request packet often uses forged IP address (return address), which is the target site that is to receive the denial of service attack. The result will be lots of ping replies flooding back to the innocent, spoofed host. If number of hosts replying to the ping request is large enough, the network will no longer be able to receive real traffic [52,53]. 

SYN Floods (Neptune): When establishing a session between TCP client and server, a hand-shaking message exchange occurs between a server and client. A session setup packet contains a SYN field that identifies the sequence in the message exchange. An attacker may send a flood of connection request and do not respond to the replies. This leaves the request packets in the buffer so that legitimate connection request cannot be accommodated [44].

 Ping of Death (PoD): Ping of Death is caused by an attacker overwhelming the victim network with Internet Control Message Protocol (ICMP) echo requests packets. This is a fairly easy attack to perform without extensive network knowledge as many ping utilities support this operation. A flood of ping traffic can consume significant bandwidth on low to mid speed networks bringing down a network to a crawl. A ping of death is also known as "long ICMP" [53].

26

Chapter Two

Intrusion Detection and Data Mining

 Teardrop Attack: Teardrop attack exploits by sending IP fragment packets that are difficult to reassemble. A fragment packet identifies an offset that is used to assemble the entire packet to be reassembled by the receiving system. In the teardrop attack, the attacker's IP puts a confusing offset value in the subsequent fragments and if the receiving system does not know how to handle such situation, it may cause the system to crash [53].  Back: This type of DoS attack works against the Apache web server, an attacker submits requests with URL's containing many fronts’ lashes. As the server tries to process these requests it will slow down and becomes unable to process other requests [54].

This thesis focuses on detection of DoS attack class and its types, system training and testing done on normal packets and DoS packets, to construct a model for DoS detection. 2.11 Data Mining and Intrusion Detection System The term data mining is frequently used to designate the process of extracting useful information from large databases. The term knowledge discovery in databases (KDD) is used to denote the process of extracting useful knowledge from large datasets. Data mining, by contrast, refers to one particular step in this process, which ensures that the extracted patterns actually correspond to useful knowledge [55]. Data mining refers to a set of procedures that use the process of excavating previously unknown but potentially valuable data from large stores of past data. Data mining techniques basically correspond to pattern discovery algorithms, but

27

Chapter Two

Intrusion Detection and Data Mining

most of them are drawn from related fields like machine learning or pattern recognition [56]. In this thesis two machine learning techniques have been used: Unsupervised K-means algorithm and Supervised Decision Tree (C4.5).

2.12 Feature Selection (FS) Feature selection is an important topic in data mining, especially for high dimensional datasets [57]. Multiple dimensions are hard to think in, impossible to visualize, and due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality, this problem is known as the curse of dimensionality [58]. Feature selection (also known as subset selection) is a process of selecting a group of useful features from the original feature space [59]. This process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of a learning algorithm. The best subset contains the least number of dimensions that mostly contribute to accuracy, and the remaining unimportant dimensions will be discarded. Feature selection is an important stage of preprocessing and is one of the ways of avoiding the curse of dimensionality which refers to how certain learning algorithms may perform poorly in multi-dimensional data. Usually before collecting data, features are specified or chosen. Features can be discrete, continuous, or nominal. Generally, features are characterized as [60]: 1. Relevant: Features which have an influence on the output and their role cannot be assumed by the rest. 2. Irrelevant: Irrelevant features are defined as those features that do not have any influence on the output, and whose values are generated at random.

28

Chapter Two

Intrusion Detection and Data Mining

3. Redundant: A redundancy exists whenever a feature can take the role of another (the simplest way to model redundancy). Feature Selection is an essential data processing step prior to applying a learning algorithm [61]. Features are not all useful in constructing the system model, some features may be redundant or irrelevant; thus, not contributing to the learning process. The main aim of the feature selection process is to determine a minimal feature subset from the problem domain while retaining a suitably high accuracy in representing the original features. There are two approaches in feature selection (FS) known as Forward Selection and Backward Selection. Forward Selection start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not significantly decrease the error, while Backward Selection start with all the variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly. To reduce over fitting, the error referred to in above is the error of a validation set that is distinct from the error of a training set [60]. The main idea of the FS process is to choose a subset of input variables by eliminating features that are with little or no predictive information. Advantages of FS can be listed as:  It reduces the dimensionality of the feature space, to limit storage requirements and increase algorithm speed.  It removes the redundant, irrelevant or noisy data.  The immediate effects for data analysis tasks are speeding up the running time of the learning algorithms.  Improving the data quality.  Increasing the accuracy of the resulting model.

29

Chapter Two

Intrusion Detection and Data Mining

 Feature set reduction to save resources in the next round of data collection or during utilization.  Performance improvement to gain in predictive accuracy.  Data understanding to gain knowledge about the processes that generated the data or simply to visualize the data in a better way. Feature selection is also useful as part of the data analysis process, as it shows which features are important for prediction, and how these features are related. The removal of irrelevant and redundant information often improves the performance of the machine learning algorithm.

2.12.1 General Methods for Feature Selection The relationship between a feature selection algorithm (FSA) and the inciter chosen to evaluate the usefulness of the feature selection process can be classified into two types: Wrapper and Filter methods. The Wrapper approach uses the method of classification itself to measure the importance of the feature set, hence the feature selected depends on the classifier model used. Wrapper methods generally result in a better performance than the filter methods because the feature selection process is optimized for the classification algorithm to be used. However, wrapper methods are too expensive for large dimensional database in terms of computational complexity and time, since each feature set considered must be evaluated with the classifier algorithm used. The Filter approach actually precedes the actual classification process, independent of the learning algorithm, computationally simple, fast and scalable. Using the Filter method feature selection is done only once and then can be provided as an input to different classifiers. Various feature ranking and feature selection techniques have been proposed such as Correlation-based Feature Selection (CFS), Principal Component Analysis (PCA), Gain Ratio (GainR) attribute evaluation, Chi-square Feature Evaluation, Fast Correlation-based Feature

30

Chapter Two

Intrusion Detection and Data Mining

(FCBF), Information Gain (IG), Euclidean distance, I-test and Markov blanket filter. Some of these filter methods do not perform feature selections but only feature rankings, hence they are combined with a search method when one needs to find out the appropriate number of attributes. Such filters are often used with forward selection (which considers only additions to the feature subset), backward elimination, bi-directional search, best-first search, and genetic search.

2.12.2 Information Gain (IG) Feature Selection Information Gain (IG) is an entropy-based feature evaluation method, widely used in the field of machine learning. As Information Gain is used in feature selection, it is defined as the amount of information provided by the feature items for the IDS [62]. Information gain is calculated by how much of a term can be used for classification of information in order to measure the importance of lexical items for the classification. In Information Gain the features are filtered to create the most prominent feature subset before the start of the learning process. It takes number and size of branches into account when choosing an attribute as it corrects the information gain by taking the intrinsic information of a split into account [22]. The procedures of the information gain are shown below: Let S be a set of training set samples with their corresponding labels. Suppose there are m classes and the training set contains si samples of class i and S is the total number of samples in the training set. Expected information needed to classify a given sample is calculated as in Eq. 2.12: 𝑠

𝑠𝑖

𝑆

𝑆

𝑖 𝐼(𝑠1 , 𝑠2 , … , 𝑠𝑚 ) = − ∑𝑚 𝑖=1 log 2

Eq.2.12

A feature F with values {f1, f2, … , fv} can divide the training set into v subsets { S1, S2, …, Sv } where Sj is the subset which has the value fj for feature F.

31

Chapter Two

Intrusion Detection and Data Mining

Furthermore let Sj contain sij samples of class i. Entropy of the feature F is calculated as in Eq. 3.: 𝐸 (𝐹 ) = ∑𝑣𝑗=1

𝑠1𝑗 +⋯+𝑠𝑚𝑗 𝑆

∗ 𝐼(𝑠1 , 𝑠2 , … , 𝑠𝑚 )

Eq.2.13

Information gain for feature F can be calculated as in Eq.2.14: 𝐼𝐺 (𝐹 ) = 𝐼 (𝑠1 , 𝑠2 , … , 𝑠𝑚 ) − 𝐸(𝐹)

Eq.2.14

2.13 Clustering Algorithms Clustering, or cluster analysis groups the data objects based on the information found in the data, which describes the objects and their relationships. The goal is to make objects within a group similar (or related) to one another and different (or unrelated) to objects in other groups. The quality of clustering is determined by distinctiveness of these groups, as well as homogeneity within a single group [63]. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering objects according to measured or perceived intrinsic characteristics or similarity [64]. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of data into subsets (clusters), so that the data in each subset (ideally) share some common trait of proximity according to some defined distance measure [65]. By clustering, one can spot dense and sparse regions and consequently, discover overall distribution samples and interesting relationships among the data attributes. Clustering algorithms are used extensively not only to organize and categorize data, but are also useful for data compression and model construction. By finding similarities in data, one can represent similar data within fewer symbols [66,67].

32

Chapter Two

Intrusion Detection and Data Mining

Also by finding groups of data, a model of the problem could be built based on those groupings. Another reason for clustering is its descriptive nature which can be used to discover relevant knowledge in huge dataset [67]. Clustering is a challenging field of research as it can be used as a separate tool to gain insight into the allocation of data, to observe the characteristic feature of each cluster and to spotlight on a particular set of clusters for more analysis. The advantage of applying Data Mining technology to Intrusion Detection Systems lies in its ability of mining the succinct and precise characters of intrusions in the system from large quantities of information automatically. It can solve the problem of difficulties in picking-up rules and in coding of the traditional Intrusion Detection System [56].

2.13.1 Classification of Clustering Algorithms There are essentially two types of clustering methods (Figure 2.4): hierarchical clustering and partitioning clustering. In hierarchical clustering once groups are found and objects are assigned to the groups, this assignment cannot be changed. In case of partitioning clustering, the assignment of objects into groups may change during the algorithm application. Further, the Partitioning clustering is categorized into hard clustering and soft clustering. Hard Clustering is based on mathematical set theory i.e. either a data point belong to a particular cluster or not. K-means clustering is a type of hard clustering. Soft Clustering is based on fuzzy set theory i.e. a data point may partially belong to a cluster [56]. Clustering algorithms can also be classified based on different parameters, based on whether the number of clusters to be formed are well known (priory) in advance or not known (a-priory). In priory since the number of clusters are well known in advance, priory algorithms try to partition the data into the given number of clusters. Since K-means and fuzzy c-means clustering algorithms need prior knowledge of the number of clusters, they belong to priory type. In the case of a-priory, since

33

Chapter Two

Intrusion Detection and Data Mining

number of clusters are not known in advance, the algorithm starts by finding the first large cluster, and then goes to find the second and so on, Mountain and Subtractive clustering algorithms are examples of this type [56].

Data Clustering

Hierarchal Clustering

Partitional Clustering

Hard Clustering (K-means)

Soft Clustering (Fuzzy C-means)

Figure 2.4: Types of Clustering Methods

K-means clustering algorithm has been used in this thesis. The K-means clustering algorithm clusters the combination of normal and Denial of Service (DoS) dataset into two clusters, normal and DoS attack clusters.

2.13.2 K-means Algorithm K-means is one of the simplest unsupervised clustering algorithms that solve the well-known problems in many fields. K-means is an iterative algorithm in which the number of clusters must be determined before the execution. The K-means algorithm partitions n data points into k clusters where the number of clusters K is pre-decided by users [68]. At the beginning K centroids are initialized according to some rule (usually at random from the data points) and they represent the centers of weight of corresponding clusters. For each data point in set the closest centroid is computed so that clusters of points are created. Assignment of the data points to clusters is depending upon the distance between cluster centroid and data point [69].

34

Chapter Two

Intrusion Detection and Data Mining

In the next step all data points assigned to a given cluster are used to recalculate the centroid. The procedure is repeated until certain termination condition is met. The general steps of K-means algorithm are as following:  Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.  Assign each object to the group that has the closest centroid.  When all objects have been assigned, recalculate the positions of the K centroids.  Repeat steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

2.14 Decision Tree A Decision Tree is defined as a predictive modeling technique from the fields of machine learning and statistics that builds a simple tree-like structure to model the underlying pattern of data [70]. Decision Trees are one example of a classification algorithm. Classification is a data mining technique that assigns objects to one of several predefined categories. Classification algorithms recognize distinctive patterns in a dataset and classifying activity based on this information [63]. A Decision Tree is a collection of if-then conditional rules for assignment of class labels to instances of a dataset. Decision Trees consist of nodes that specify a particular attribute of the data, branches that represent a test on each attribute value, and leaves that correspond to the terminal decision [71]. Decision Trees are well known machine learning technique and they are composed of three basic elements [72]:

35

Chapter Two

Intrusion Detection and Data Mining

 A decision node specifying a test attributes.  An edge or a branch corresponding to one of the possible attributes values.  A leaf, usually named an answer node, which contains the class to which the object belongs. In Decision Trees, two major phases should be ensured:  Building the tree: Based on a given training set.  Classification: Order to classify a new instance. At start the root of the tree is determined, and then the node specified property is tested. The test results allow moving down the tree relative to a given instance of the attribute value. This process is repeated until it encounters a leaf. The instance is then classified in the same class based on leaves characteristics [73]. In summary, Decision Trees provide a simple set of rules that can categorize new data. Creating Decision Trees requires a pre-classified dataset in order for the algorithms to learn patterns in the data. The training dataset is made up of features which are quantifiable characteristics of the data. When the Decision Tree is built from these features, the rules of characterizing information can be used to identify and classify new data of interest by incorporating the logic into existing defenses, like IDSs, firewalls, custom-built detection scripts, or classification software [74].

2.14.1 C4.5 Decision Tree Algorithm C4.5 Decision Tree algorithm has been used in this thesis. The C4.5 is an algorithm used to generate a Decision Tree developed by Ross Quinlan [73]. C4.5 is an extension of Quinlan's earlier ID3 algorithm [75]. The Decision Trees generated by C4.5 can be used for classification and for this reason the C4.5 is often referred to as a statistical classifier [76].

36

Chapter Two

Intrusion Detection and Data Mining

The pseudo code for building C4.5 Decision Trees is written below [23]: 1. Check for a base case 2. For each attribute find the normalized information gain ratio. 3. Let a_best be the attribute with the highest normalized information gain 4. Create a decision node that splits on a_best 5. Recurse on the sublists obtained by splitting on a_best. Add the obtained nodes as children of the a_best node Decision Tree algorithms use the strategy of future generations, from root to leaves. To ensure this process, the attribute selection measure is used, taking into account the discriminative power of each attribute over the classes in order to choose the "best" one as the root of the (sub) Decision Tree [77]. In other words, best attribute should be used as a root node for splitting the tree. Objective criteria for judging the efficiency of the split is needed, and information gain measure is used to select the test attribute at each node in the tree [23]. The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node [78]. This attribute minimizes the information needed to classify samples in the resulting partitions. C4.5 uses an extension of information gain known as gain ratio for attributes ranking, which applies normalization to information gain [79]. Gain ratio (GainR) should be larger when data is evenly spread and small when all data belong to one branch attribute. GainR for set S to get split on feature F is: 𝐺𝑎𝑖𝑛𝑅 (𝑆, 𝐹 ) =

𝐼𝐺(𝑆,𝐹)

Eq.2.15

𝐸(𝐹)

Where the Information Gain IG(S,F) and Entropy E(F) is calculated by using Eqs. 2.13 and 2.15, respectively. From an intrusion detection perspective, classification algorithms can characterize network

data as normal or attack using information like

37

Chapter Two

Intrusion Detection and Data Mining

source/destination ports, IP addresses, and the number of bytes sent during a connection. Classification algorithms create a Decision Tree like the one presented in Figure 3.7, by identifying patterns in an existing dataset and using that information to create the tree. The algorithms take pre-classified data as input. They learn the patterns in the data and create simple rules to differentiate between the various types of data in the pre-classified dataset.

Figure 2.5: Example of Decision Tree for IDS Classification

2.15 Dataset Collection To verify the effectiveness and the feasibility of the proposed IDS`, KDD Cup 99 dataset has been used [80]. This dataset considered as a standard dataset and the most wildly used dataset for the evaluation of intrusion detection methods [22,29]. A connection is a sequence of TCP packets to and from some IP addresses, starting and ending at some well-defined times. This dataset contains seven weeks of network traffic; this was processed into about five million connection records and two weeks of test data that have around two million connection records. KDD Cup 99 training dataset consists approximately 4,900,000 single connection vectors, each of which is a vector of extracted feature values of that network connection which contains 41 features [Appendix A, Table A1].

38

Chapter Two

Intrusion Detection and Data Mining

2.15.1 Attacks in KDD Cup 99 Dataset The simulated attacks in the KDD Cup 99 dataset fall in one of the following four categories [81]:  Denial of Service (DoS): Attacker tries to prevent legitimate users from using a service.  Remote to Local (R2L): Attacker does not have an account on the victim machine, hence tries to gain access.  User to Root (U2R): Attacker has local access to the victim machine and tries to gain super user privileges.  Probe: Attacker tries to gain information about the target host.

2.15.2 Features of KDD Cup 99 Dataset In KDD Cup 99, the original TCP dump files were pre-processed for utilization in the Intrusion Detection System benchmark of the International Knowledge Discovery and Data Mining Tools Competition [81]. Packet information in the TCP dump file is summarized into connections. Specifically, a connection is a sequence of TCP packets starting and ending at some well-defined times, between which data flows from a source IP address to a target IP address under some well-defined protocol, with 41 features for each connection. The features are grouped into three categories: 

Basic Features: Basic features can be derived from TCP/IP connection packet headers without inspecting the payload. Basic features are listed in Table 2.3.



Content Features: Domain knowledge is used to assess the payload of the original TCP packets. This includes features such as the number of failed login attempts as shown in Table 2.4.

39

Chapter Two 

Intrusion Detection and Data Mining

Traffic Features: This category includes features that are computed with respect to a window interval and divided into two groups: -

"Same Host" Features: Examine only the connections in the past 2 seconds that have the same destination host as the current connection, and calculate statistics related to protocol behaviour, service, etc.

-

"Same Service" Features: Examine only the connections in the past 2 seconds that have the same service as the current connection.

The two aforementioned types of "Traffic" features are called time-based and are listed in Table 2.5. Table 2.3: Basic Features of TCP Connection No.

Feature

Description

1

Duration

2

Protocol_type

3

Service

4

Flag

5

Src_bytes

No. of Data Bytes sent from source to destination

6

Dst_bytes

No. of Data Bytes sent from destination to source

7

Land

8

Wrong_fragment

9

Urgent

Length of the connection (No. of Seconds) Type of connection Protocol (tcp, udp) Network Service on the destination (talnet, ftp) Status flag of the connection

1 if connection is from/to the same host/port; 0 otherwise No. of wrong fragments No. of urgent packets

The feature protocol type has 3 different values of icmp, tcp and udp. Likewise, the feature service has 70 different values and the flag feature has 11 different values. The description of different flag values are listed in [Appendix A, Table A2]. These 3 features and their different values acquire significant position to construct grammars of the proposed method.

40

Chapter Two

Intrusion Detection and Data Mining Table 2.4: Content Features of the TCP Connection

No.

Feature

Description

10

Hot

11

Num_failed_logins

12

Logged_in

13

Num_compromised

14

Root_shell

15

Su_attempted

16

Num_root

17

Num_file_creations

18

Num_shells

19

Num_access_files

20

Num_outbound_cmds

21

s_host_login

1 if the login belongs to the “hot” list; 0 otherwise

22

s_guest_login

1 if the login is a “guest” login; 0 otherwise

No. of “hot” indicators No. of failed logins 1 if successfully logged in; 0 otherwise No. of “compromised” conditions 1 if root shell is obtained; 0 otherwise 1 if “su root” command attempted; 0 otherwise No. of “root” accesses No. of file creation operations No. of shell prompts No. of operations on access control files No. of outbound commands in an ftp session

Table 2.5: Time Based Features TCP Connection No.

Feature

Description

23

Count

24

Srv_count

25

Serror_rate

% of connections that have “SYN” errors

26

Srv_serror_rate

% of connections that have “SYN” errors

27

Rerror_rate

% of connections that have “REJ” errors

28

Srv_rerror_rate

% of connections that have “REJ” errors

29

Same_srv_rate

% of connections to the same service

30

Diff_srv_rate

% of connections to different services

31

Srv_diff_host_rate

No. of connections to the same host as the current connection in the past two seconds No. of connections to the same service as the current connection in the past two seconds

% of connections to different hosts

41

Chapter Three Proposed System Methodology

42

Chapter Three Proposed System Methodology

3.1 Introduction This chapter describes the architecture and workflow process of the proposed IDS. It explains pre-processing of the dataset used for experiments including features transformation and normalization, optimal features selection using information gain. The proposed hybrid model will be described with its basic architecture in block diagram, and then gives details of each part.

3.2 Dataset Pre-Processing The first part of analysis engine component of the hybrid IDS model is the preprocessing dataset. The pre-processing of dataset is of great importance as it results in the increase the efficiency of intrusion detection mechanism in case of training, testing, and clustering of network activity into normal and abnormal. Pre-processing of original KDD Cup 99 dataset is necessary to make it suitable for IDS structure. Dataset pre-processing can be achieved by applying:  Dataset transformation for nominal features  Dataset normalization for numeric features

3.2.1 Dataset Transformation The training dataset of KDD Cup 99 consists of approximately 4,900,000 single connection instances. Each connection instance contains 42 features including target class attacks or normal. These labelled connection instances have to be transformed from nominal features to numeric values to be a suitable input for clustering by the

42

Chapter Three

Proposed System Methodology

K-means algorithm. For this transformation, Table 3.1 will be used. In this step, some useless data will be filtered and modified. For example, some text items need to be converted into numeric values. There are several nominal values like HTTP, TCP and SF. Hence it is necessary to transform these nominal values to numeric values in advance. For example, the service type of "tcp" is mapped to 1, "udp" is mapped to 2 and "icmp" is mapped to 3. Hence, keys in Table 3.1 will be followed to transform the nominal values of dataset features into the numeric values. Table 3.1: Transformation Table for Different Values of Protocols, Flag and Services TCP 1 Protocol Type UDP 2 ICMP 3 OTH 1 REJ 2 RSTO 3 RSTOS0 4 RSTR 5 Flag S0 6 S1 7 S2 8 S3 9 SF 10 SH 11 Service All services 1 to 70

An example of original KDD Cup 99 dataset record is shown in Figure 3.1. 0 tcp ftp_data SF 491 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 150 25 0.17 0.03 0.17 0 0 0 0.05 0 normal 0 udp other SF 146 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 1 0 0 0 0 0.08 0.15 0 255 1 0 0.6 0.88 0 0 0 0 0 normal

Figure 3.1: Records of the KDD Cup 99 Dataset

43

Chapter Three

Proposed System Methodology

After transformation, the original KDD Cup 99 dataset will become as shown in Figure 3.2. 0,1,30,10,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,1,0,0,150,25,0.17,0.03,0.17,0,0,0, 0.05,0,0 0,2,40,10,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,0,0,0,0,0.08,0.15,0,255,1,0,0.6,0.88,0,0, 0,0,0,0

Figure 3.2: Records of the KDD Cup 99 Dataset After Transformation

3.2.2 Dataset Normalization Dataset normalization is essential to enhance the performance of intrusion detection system when datasets are too large. The first step is to normalize continuous attributes, so that attribute values fall truly within a specified range of 0 to 1. Here, Min-Max method of normalization has been used, using the following equation [82]:

𝑥𝑖 =

𝑣𝑖 −min(𝑣𝑖 )

Eq.3.1

max(𝑣𝑖 )−min(𝑣𝑖 )

Where, xi is the normalized value, vi is the actual value of the attribute, and the maximum and minimum are taken over all values of the attribute. Normally xi is set to zero if the maximum is equal to the minimum.

3.3 Proposed Detection Model This thesis aims at building and simulating an intelligent IDS that can detect known and unknown network intrusions automatically. Under machine learning framework, the IDS is trained with unsupervised learning algorithm, namely the Kmeans algorithm.

44

Chapter Three

Proposed System Methodology

With the K-means two clusters are obtained which are normal and DoS attacks. With the normal one there is no action. For DoS attacks, the cluster acquired by Manhattan distance will be passed to the second layer classifier to feed classifier which is the C4.5 DT. At this stage the tree has already been constructed and learned and it can generate rules to classify types of DoS attacks to Smurf, Neptune, Pod, Back and Teardrop. Figure 3.3 shows the structure of the proposed system.

KDD Cup Dataset (Normal & DoS) records

Information Gain (IG) Feature Selection (Pre-Processing)

Testing Set (40%)

Training Set (60%)

K-means Clustering Algorithm with K=2 using Euclidean Distance metric

K-means Clustering Algorithm with K=2 using Manhattan Distance metric

Normal Cluster

DoS Cluster

Decision Tree (C4.5) Classification

Testing Set (40%)

Normal Cluster

Results comparison and evaluation

Results and performance evaluation

Figure 3.3: Proposed Detection Model Structure

45

DoS Cluster

Chapter Three

Proposed System Methodology

3.4 Information Gain Feature Selection The dataset which is used as an input for the proposed IDS consists of a huge amount of data with normal and DoS attacks records, and each record of data has numerous attributes associated with it, which means that it needs a lot of processing. A classification process that considers all these attributes needs a lot of processing time and it leads to an increase in the error rate, and a decrease in the efficiency of the classification process. The proposed system comes with a solution to overcome this problem by using Information Gain feature selection process. Information Gain (IG) algorithm can be described in algorithm 3.1 Algorithm 3.1: Information Gain Input: Number of samples in training set S. Number of class m. Output: a value represents Information gain for feature F. Step1: [Divide Training Set] Divide the training set into v subsets {S1, S2 …Sv} where Sj is the subset which has the value fj for feature F. Step2: [Compute Information Needed for Clustering S] 𝒎

𝑰(𝒔𝟏 , 𝒔𝟐 , … , 𝒔𝒎 ) = − ∑ 𝒊=𝟏

𝒔𝒊 𝒔𝒊 𝐥𝐨𝐠 𝟐 𝑺 𝑺

Step3: [Compute the Entropy of feature F] 𝒗

𝑬(𝑭) = ∑ 𝒋=𝟏

𝒔𝟏𝒋 + ⋯ + 𝒔𝒎𝒋 ∗ 𝑰(𝒔𝟏 , 𝒔𝟐 , … , 𝒔𝒎 ) 𝑺

Step4: [Compute Information Gain for Feature F] 𝑰𝑮(𝑭) = 𝑰(𝒔𝟏 , 𝒔𝟐 , … , 𝒔𝒎 ) − 𝑬(𝑭)

46

Chapter Three

Proposed System Methodology

3.5 K-means Clustering for the Proposed System The general structure of the first layer of the proposed IDS presented in Figure 3.4. Subset of KDD Cup 99 dataset

Transformation and Normalization

IG Feature Selection

Training Set (60%)

Testing Set (40%)

K-means Clustering Algorithm with K=2

Normal Cluster

DoS Cluster

Figure 3.4: First Layer of Proposed Detection Model K-means clustering includes procedures and steps to determine centroids of each cluster as shown in Figure 3.5. K-means training phase determines the centroid of both normal and attack cluster. The centroid is used in distance calculation for any coming packet to classify it to either normal or attack, based on the minimum distance to cluster centroid. Two distance metrics has been used, the Euclidean and the Manhattan, evaluate of the results and the performance of the K-means clustering with both metrics has been done. Manhattan distance metric did show much higher detection rates with

47

Chapter Three

Proposed System Methodology

reasonable true positive rates when compared to the Euclidean distance using the subset of the KDD Cup 99 dataset.

Start

Number of clusters K

Select randomly K points from the data as initial centroids

Calculate distance of objects to centroids

Group based on minimum distance

Calculate centroid

Is there objects movements between groups?

Yes

No Store the centroid

End

Figure 3.5: K-means Clustering Flowchart

48

Chapter Three

Proposed System Methodology

3.5.1 Distance Calculation Assignment of the data points to clusters depends upon the distance between cluster centroid and data point. A distance function is required to compute the distance between two objects. Distance functions also affect the size and members of a cluster as different distance functions use a different approach to find the distance between the data objects which is the most important step of the creation of clusters, so distance functions should be chosen wisely and according to the dataset. Generally K-means algorithm uses Euclidean distance, which is a distance function used to compute the distance between two objects. Two distance metrics used with K-means in this thesis: Euclidian Distance and Manhattan Distance. ● Euclidean Distance Metric: In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula [83]. By using this formula for distance, Euclidean space becomes a metric space as shown in Figure 3.6.

Y

X

Figure 3.6: Euclidean Distance between Two Points

The Euclidean distance between points x and y is the length of the line segment connecting them (𝑥𝑦 ̅̅̅). The formula for this distance between a point X (X1, X2, etc.) and a point Y (Y1, Y2, etc.) is:

49

Chapter Three

Proposed System Methodology

2 𝑑 (𝑥, 𝑦) = √∑𝑚 𝑖=1(𝑥𝑖 − 𝑦𝑖 )

Eq.3.2

Two input vectors with m quantitative features where x = (x1,….,xm) and y = (y1,….,ym). ● Taxicab Geometry (Manhattan): Manhattan is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. The taxicab metric is also known as rectilinear distance, L1 distance or l1 norm, Manhattan distance, or Manhattan length, with corresponding variations in the name of the geometry [84]. The Manhattan distance function computes the distance that would be traveled to get from one data point to the other, if a gridlike path is followed as shown in Figure 3.7. Y

X

Figure 3.7: Manhattan Distance Between Two Points The formula for this distance between a point X= (X1, X2, …. , Xn) and a point Y= (Y1, Y2, …. , Yn) is: 𝑑(𝑥, 𝑦) = ∑𝑛𝑖=1|𝑥𝑖 − 𝑦𝑖 |

Eq.3.5

Where n is the number of variables, and Xi and Yi are the values of the ith variable, at points X and Y respectively.

50

Chapter Three

Proposed System Methodology

3.6 Decision Trees as a Model for Intrusion Detection Intrusion detection can be considered as classification problem where each connection or user is identified either as one of the attack types or normal based on some existing data. Decision Trees can solve this classification problem of intrusion detection as they learn the model from the dataset and can classify new data items into one of the classes specified in the dataset. Decision Trees can be used as misuse intrusion detection as they can learn a model based on the training data and can predict the future data as one of the attack types or normal based on the learned model. DT constructs easily interpretable models, which is useful for a security officer to inspect and edit. In this thesis different set of (if-then) rules based on the GianR attribute ranking has been used to construct DT, and the rule with highest detection rate for known and unknown attacks will be adopted as the second layer of the proposed IDS.

Rule 1: Root node = flag If flag = SF and protocol_type = tcp and dst_host_same_srv_rate < 0.94 Then Classification = unknown If flag = SF and protocol_type = tcp and dst_host_same_srv_rate >= 0.94 Then Classification = back If flag = SF and protocol_type = udp Then Classification = teardrop If flag = SF and protocol_type = icmp and src_bytes < 1256 Then

51

Chapter Three

Proposed System Methodology

Classification = smurf If flag = SF and protocol_type = icmp and src_bytes >= 1256 Then Classification = pod If flag = RSTO or SH or OTH or or RSTOS0 or S1 or S0 or REJ Then Classification = back

Rule 2: Root node = protocol_type If protocol_type = tcp and serror_rate <= 0.02 and dst_host_diff_srv_rate <= 0.01 Then Classification = back If protocol_type = tcp and serror_rate <= 0.02 and dst_host_diff_srv_rate > 0.01 Then Classification = unknown If protocol_type = tcp and serror_rate > 0.02 Then Classification = neptune If protocol_type = udp Then Classification = teardrop If protocol_type = icmp and src_bytes <= 1235 Then Classification = smurf If protocol_type = icmp and src_bytes > 1235 Then Classification = pod

52

Chapter Three

Proposed System Methodology

Rule 3: Root node = srv_serror_rate If srv_serror_rate <= 0 and wrong_fragment <= 0 and dst_bytes <= 186 and src_bytes <= 39 Then Classification = unknown If srv_serror_rate <= 0 and wrong_fragment <= 0 and dst_bytes <= 186 and src_bytes > 39 Then Lassification = smurf If srv_serror_rate <= 0 and wrong_fragment <= 0 and dst_bytes > 186 Then Classification = back If srv_serror_rate <= 0 and wrong_fragment > 0 and protocol_type = tcp or udp Then Classification = teardrop If srv_serror_rate <= 0 and wrong_fragment > 0 and protocol_type = icmp Then Classification = pod If srv_serror_rate > 0 Then Classification = Neptune

Rule 3 with the feature (srv_serror_rate) as a root node did show much higher detection rates when compared to Rule 1 and Rule 2 using the DoS cluster from the first layer. Rule 3 will be used to construct the classification Decision Tree for the second layer of the proposed IDS model Figure 3.8.

53

Chapter Three

Proposed System Methodology

srv_serror_rate srv_serror_rate <= 0

srv_serror_rate > 0

Wrong_fragment wrong_fragment <= 0

wrong_fragment > 0

dst_bytes dst_bytes <= 0

src_bytes <=39

Unknown

protocol_type

dst_bytes>186

src_bytes

Neptun e

Back

protocol_type = icmp

protocol_type=tcp ‘or’ udp

Pod

src_bytes >39

Smurf

Figure 3.8: Decision Tree Structure for DoS Attack Classification

54

Teardrop

Chapter Four Implemented Results and Discussions

55

Chapter Four Implemented Results and Discussions

4.1 Introduction This chapter presents the results of a set of tests conducted on the proposed system described in the previous chapter. Test results computed and obtained using a computer running Windows 8 Pro, 64-bit Operating System, Intel® CoreTM i7-3537U CPU @ 2.00 GHz and 8 GB of RAM. 4.2 Training and Testing the Dataset The KDD (Knowledge Discovery and Data mining cup) dataset is divided into two parts, training dataset and testing dataset. The training dataset is used to tune the cluster centroid of the K-means cluster for intrusion detection (i.e., generate normal and attack signatures), and construct a Decision Tree rules. Testing dataset is used to evaluate the performance of the hybrid proposed system. A dataset of 100000 records which is extracted from the whole KDD Cup99 dataset and it includes both normal and DoS attack records to train and test the system has been used. The percentage of the training set is 60% of the extracted dataset and the remaining is for testing and validating the system.

4.3 Experiment 1: Results of Pre-processing The results of this stage consist of all preprocessing phases: 4.3.1 Transformation and Normalization A sample of KDD Cup 99 dataset presented in Table 4.1, and the transformation and

normalization

results

of

this

55

sample

presented

in

Table

4.2.

Implemented Results and Discussions Chapter Four

Table 4.1: Sample Records of KDD Cup 99

0 Tcp

0 Tcp

http

http

http

SF

SF

SF

219

235

486

181

1337

1337

486

5450

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 0 1 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0

39

29

12

9

39

29

19

9

1 0 0.03

1 0 0.03

1 0 0.05

1 0 0.11

0

0

0

0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

Dataset

0 Tcp SF

0 0 0 0

http

0

0 0 0 0

0 Tcp

1 0 0.02

0.04

0 0 0 0

0 0 0 0 59

1 0

0.04

0 0 0 0

0 59

69

1 0 0.09

0.04

0 0 0 0

1 0 0.02

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 0 1 0 0

1

79

1 0 0.12

0.05

49

2032

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 0 1

11

89

1 0 0.12

49

217 1940

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 5 5 0 0 0 0 1 0 0

8

99

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 0 1 0 0

SF 212 4087

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0

8

2032

http SF 159 151

0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0

217

0 Tcp http SF 210

786

SF

0 Tcp http SF 212

http

0 Tcp http SF

0 Tcp

0 Tcp http

1

0 Tcp

56

Implemented Results and Discussions Chapter Four

0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0.31

0

1

1

Table 4.2: Transformed Nominal Data and Normalized Numeric Data Samples of KDD Cup 99 Dataset 1 1 0 0 0 0 1 0 0 0.14 0 1 0 0.03

10 1

0 1 33

0 1 33

0 1 33

10

10

10

10

0.66

0.64

0

0.66

0.73

0.12

0

0.74

0.34

0.36

0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

1

1

0.5

0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.71

1

1

0.5

0

0.71

0 0 0 0 1 0 0 0.12

0 0 0 0 1 0 0 0.12

0 0 0 0 1 0 0 0.17

0 0 0 0 1 0 1

0 0 0 0 1 0 0

0

1

1

0.89

0.78

0.67

0.56

1 0

1 0

1 0

0.1

1 0 0.07

1 0

0.1

1

0

1

0.8

0.8

0.8

0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0.28

0 1 33 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0

0.06

0 0 0 0 1 0 0 0.48

0

1

1

1 0 0.03

10

1

0.11

0 1 33

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0.33

0 1 33 10

0

0 0 0 0

0.22

0 0 0 0 1 0 0 0.66

1 0

0

0.95

0.71

0.44

1 0 0.01

10

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.71

0 0 0 0 1 0 0 0.83

0.22

0 1 33 0.22

0.71

0 0 0 0

0.75

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.71

0

10 0.36

1 0 0.01

0 1 33 0.73

0 0 0 0

10

0

0 1 33

0 1 33

57

Chapter Four

Implemented Results and Discussions

The proportions of Normal and DoS Attack Class in a subset of (100000) records of KDD Cup 99 dataset that used to train and test the proposed IDS presented in Table 4.3. Table 4.3: Proportions of the Normal and DoS Classes in the Data Subset Class

Full Subset (100000) Records

Training (60%)

Test (40%)

Normal

19600 (19.6%)

11760

7840

DoS

80400 (80.4%)

48240

32160

Total

100000 (100%)

60000

40000

Proportions of each class calculated as follow: 1- Normal Class: Proportion of Normal records in the Full Subset =

19600 100000

∗ 100 % = 19.6 %

No. of Normal records in the Training set (60%) =

19600 ∗ 60% = 11760 100%

No. of Normal records in the Training set (60%) =

19600 ∗ 40% = 7840 100%

2- DoS Class: Proportion of DoS records in the Full Subset =

80400 100000

∗ 100 % = 80.4 %

No. of DoS records in the Training set (60%) =

80400 ∗ 60% = 48240 100%

No. of DoS records in the Training set (60%) =

80400 ∗ 40% = 32160 100%

58

Chapter Four

Implemented Results and Discussions

4.3.2 Features Ranking and Subset Selection Information Gain used for feature ranking and then subset selection of the ranked features based on highest ranking. The result of attribute ranking by using IG over a sample of KDD dataset consisting of normal and DoS attack records are shown in Table 4.4. Table 4.4: Attribute Ranking by Information Gain Attribute Rank Sr. No. Attribute name 0.6832 5 src_bytes 0.67616 6 dst_bytes 0.67143 37 dst_host_srv_diff_host_rate 0.50454 23 Count 0.49037 3 Service 0.46333 12 logged_in 0.27043 33 dst_host_srv_count 0.25604 32 dst_host_count 0.23577 34 dst_host_same_srv_rate 0.23241 35 dst_host_diff_srv_rate 0.22553 36 dst_host_same_src_port_rate 0.19651 25 serror_rate 0.19651 30 diff_srv_rate 0.19651 29 same_srv_rate 0.18667 4 Flag 0.18086 38 dst_host_serror_rate 0.18086 39 dst_host_srv_serror_rate 0.18086 26 srv_serror_rate 0.13657 24 srv_count 0.10854 2 protocol_type 0.05489 31 srv_diff_host_rate 0.02458 8 wrong_fragment 0.01921 41 dst_host_srv_rerror_rate 0.01921 40 dst_host_rerror_rate 0.01877 10 Hot 0.01678 13 num_compromised 0.00583 27 rerror_rate 0.00583 28 srv_rerror_rate

59

Chapter Four

Implemented Results and Discussions 0 0 0 0 0 0 0 0 0 0 0 0 0

7 9 1 19 18 22 20 21 14 11 17 15 16

Land Urgent Duration num_access_files num_shells is_guest_login num_outbound_cmds is_host_login root_shell num_failed_logins num_file_creations su_attempted num_root

Gain Ration attribute ranking used as a preprocess step to construct the C4.5 Decision Tree, in order to determine using which features to construct the tree based on the amount of information of each feature. The attribute with the highest GainR is selected as the splitting attribute. Results of attribute ranking by Gain Ratio is presented in Table 4.5: Table 4.5: Attribute Ranking Using GainR for C4.5 DT Attribute Rank Sr. No. 1 1 1 1 1 1 1 1 1 0.987 0.969 0.955 0.93 0.93

2 26 12 5 6 37 40 41 38 8 10 4 25 30

Attribute name protocol_type srv_serror_rate logged_in src_bytes dst_bytes dst_host_srv_diff_host_rate dst_host_rerror_rate dst_host_srv_rerror_rate dst_host_serror_rate wrong_fragment Hot Flag serror_rate diff_srv_rate

60

Chapter Four

Implemented Results and Discussions 0.93 0.884 0.841 0.815 0.789 0.718 0.679

29 13 35 23 3 31 33

same_srv_rate num_compromised dst_host_diff_srv_rate Count Service srv_diff_host_rate dst_host_srv_count

0.611 0.611 0.577 0.577 0.486 0.454

1 24 28 27 34 36

Duration srv_count srv_rerror_rate rerror_rate dst_host_same_srv_rate dst_host_same_src_port_rate

0.29 0 0 0 0

32 7 9 19 18

dst_host_count Land Urgent num_access_files num_shells

0 0 0 0 0 0

22 20 21 14 11 17

is_guest_login num_outbound_cmds is_host_login root_shell num_failed_logins num_file_creations

0 0

15 16

su_attempted Num_root

4.4 Experiment 2: K-means Clustering (First Layer) In this stage K-means clustering algorithm is implemented by using two distance metrics to evaluate the metric that results in least error rate. These metrics are Euclidean and Manhattan. The K-means algorithm used the subset of the KDD Cup 99 dataset in three ways: the dataset with full features, the dataset with highest 10 features ranked by IG and the dataset with highest 20 features ranked by IG.

61

Chapter Four

Implemented Results and Discussions

Experiments show that Manhattan distance function is more accurate in terms of detection and false alarm rates and it outperforms the Euclidean distance function, furthermore, Manhattan distance function requires less computation than Euclidean distance function, which in turn improves the computational time complexity of kmeans. The results obtained by using dataset with highest 20 features ranked by IG, was more accurate with less false alarms compared to other ways of using the dataset. Centroids for K-means clustering using both Euclidean and Manhattan metrics presented in Table 4.6, 4.7 respectively. Table 4.6: Attributes Centroid Using Euclidian Distance Metric for 20 Features with Highest Ranking Feature Name Normal DoS src_bytes 6518.4059 0 dst_bytes 1692.6453 0 dst_host_srv_diff_host_rate 0.0185 0 Count 214.5045 120.8646 Service 80 67 logged_in 0 0 dst_host_srv_count 178.2433 10.0378 dst_host_count 162.4904 204 dst_host_same_srv_rate 0.8297 0.0781 dst_host_diff_srv_rate 0.0062 0.083 dst_host_same_src_port_rate 0.4122 0.0097 serror_rate 0.0005 1 diff_srv_rate 0.0012 0.0654 same_srv_rate 0.9995 0.0841 Flag 13 9 dst_host_serror_rate 0 1 dst_host_srv_serror_rate 0 1 srv_serror_rate 0 1 srv_count 214.585 10.4016 protocol_type 1 1

62

Chapter Four

Implemented Results and Discussions

Table 4.7: Attributes Centroid Using Manhattan Distance Metric for 20 Features with Highest Ranking Feature Name

Normal

DoS

src_bytes

1032

0

dst_bytes

0

0

dst_host_srv_diff_host_rate

0

0

Count

25

120

Service

80

67

logged_in

0

0

dst_host_srv_count

255

10

dst_host_count

186

255

dst_host_same_srv_rate

1

0.05

dst_host_diff_srv_rate

0

0.07

dst_host_same_src_port_rate

0.14

0

serror_rate

0

1

diff_srv_rate

0

0.07

same_srv_rate

1

0.09

Flag

13

9

dst_host_serror_rate

0

1

dst_host_srv_serror_rate

0

1

srv_serror_rate

0

1

srv_count

25

10

protocol_type

1

1

Evaluation and results of K-means clustering tests using different distance functions which are: Euclidean and Manhattan with different sets of data (Full dataset, 10 features, and 20 features) presented in Table 4.8, 4.9, and 4.10 respectively.

63

Chapter Four

Implemented Results and Discussions

Table 4.8: Evaluation and Results of K-means with Distance Functions Using the Full Dataset

Accuracy

Results of K-means With Euclidean Function 77.6786 %

Results of K-means With Manhattan Function 89.404 %

Error Rate

22.3214 %

10.596 %

Average True Positive Rate

77.7 %

89.4 %

Average False Positive Rate

24.9 %

9.9 %

Average Precision

84.3 %

90.5 %

Average Recall

77.7 %

89.4 %

Average F-Measure

76.2 %

89.4 %

Mean absolute error

0.2232

0.1455

Root mean squared error

0.4725

0.2769

Parameter

Table 4.9: Evaluation and Results of K-means with Distance Functions Using the Highest 10 Features Ranked by IG

Accuracy

Results of K-means With Euclidean Function 86 %

Results of K-means With Manhattan Function 93.8492 %

Error Rate

14 %

6.1508 %

Average True Positive Rate

86 %

93.8 %

Average False Positive Rate

15.2 %

5.9 %

Average Precision

89 %

94 %

Average Recall

86 %

93.8 %

Average F-Measure

85.6 %

93.9 %

Mean absolute error

0.14

0.0966

Root mean squared error

0.3742

0.2116

Parameter

64

Chapter Four

Implemented Results and Discussions

Table 4.10: Evaluation and Results of K-means with Distance Functions Using the Highest 20 Features Ranked by IG

Accuracy

Results of K-means With Euclidean Function 90.0875 %

Results of K-means With Manhattan Function 98.2143 %

Error Rate

9.9125 %

1.7857 %

Average True Positive Rate

90.1%

98.2 %

Average False Positive Rate

9.8 %

1.9 %

Average Precision

90.8 %

98.2 %

Average Recall

90.8 %

98.2 %

Average F-Measure

90.8 %

98.2 %

Mean absolute error

0.1247

0.0366

Root mean squared error

0.243

0.1241

Parameter

Figure 4.1, is showing a comparison between the two distance functions using K-means with different sets of data in a graphical way. The figure clearly indicates that in terms of accuracy and error rate Manhattan function and a data subset with 20 highest ranked features is better than the Euclidean function and other datasets.

100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%

Accuracy Error Rate TPR FPR Precision F-Measure K-means K-means K-means K-means K-means K-means with with with with with With Euclidean Manhattan Euclidean Manhattan Euclidean Manhattan Full dataset Full dataset 10 Highest 10 Highest 20 Highest 20 Highest Features Features Features Features

Figure 4.1: Comparative Chart of Distance Functions Values Using K-means

65

Chapter Four

Implemented Results and Discussions

4.5 Experiment 3: C4.5 Decision Tree (Second Layer) This stage includes implementing the C4.5 algorithm to generate a set of rules based on given training set, which includes DoS attack types that can correctly classify attack types. C4.5 algorithm is very accurate in term of classification, reducing and eliminating false alarms since it is a supervised algorithm and very clear (easy to get understood by users). Three set of rules generated by C4.5 has been tested and evaluated to determine the rule of the best tree structure in the second layer of proposed IDS. Detection rates and performance of C4.5 Decision Tree using three different rules presented in Table 4.11. Rule 3 shows a very high performance in term of Accuracy with low Error Rate and False Positive Rate.

Table 4.11: Evaluation and Results of C4.5 Algorithm Parameter

Rule 1

Rule 2

Rule 3

Accuracy

93.5644 %

98.5149 %

99.9136 %

Error Rate

6.4356 %

1.4851 %

0.0864 %

Average True Positive Rate

93.6 %

98.5 %

99.9 %

Average False Positive Rate

6.7 %

1.5 %

0%

Average Precision

94.1 %

98.5 %

99.9 %

Average Recall

93.6%

98.5 %

99.9 %

Average F-Measure

93.5 %

98.5 %

99.9 %

Mean absolute error

0.0669

0.0163

0.0003

Root mean squared error

0.2415

0.1218

0.0186

66

Chapter Four

Implemented Results and Discussions

4.6 The Graphical User Interface (GUI) The core of the designed system uses unsupervised learning for detecting network intrusions. Based on unsupervised learning algorithms, detection techniques were implemented and tested, showing a very high detection rate with a reasonable true positive rate. The designed GUI enables user-friendly handling of the system core. The main system window is shown in Figure 4.2 Java platform with JDK 7 has been used for implementing the model under NetBeans 7.4 IDE which is a powerful integrated development environment for developing applications on Java platforms. The main GUI of the detection model consists of a menu bar, several subwindows and a number of buttons. The IDS menu contains buttons for capture, analyze packets, stop capturing, and extract normal and attack packets. The Log menu is related to operations on the log file, such as activate log, open log and clear log. The Exit menu contains an Exit command and Exit with clear log file. The Start Capture and Analyze button is related and linked directly with the IDS analysis algorithms. Normal packets is the outcome of K-means normal cluster while other types of classified DoS attacks are the outcome of K-means DoS cluster which will get passed to the C4.5 algorithm for further classification, analysis of packets appear in packet analysis window as shown in Figure 4.3

67

Chapter Four

Implemented Results and Discussions

Figure 4.2: Main GUI of the Detection Model

Figure 4.3: Capturing and Classification of Network Traffics by the System The End Capture button stops capturing network traffics. Normal and Attack Packets buttons extract captured and classified packets with their type and present the normal packets in the normal packets window and the attack packets in the attack packets window, as shown in Figure 4.4.

68

Chapter Four

Implemented Results and Discussions

Also, the main window contains an Exit command to terminate the execution of the application.

Figure 4.4: Extracting Normal and Attack Packets from Captured Packets One of the most important functions of this system is the ability to activate log file to record all actions that are captured by the IDS, all captured packets with their classification type, the time and the date will be stored in the log file, as shown in Figure 4.5. This log file permits the user to open it via Open Log button or clear it by using the Clear Log button. The Exit &Clear Log button gives the user the option to terminate the application and clear the log file content.

69

Chapter Four

Implemented Results and Discussions

70

Chapter Five Conclusions and Future Works

71

Chapter Five Conclusions and Future Works

5.1 Conclusions

The proposed system has concluded an IDS that can efficiently detect DoS attacks by outer and insider intruders in a network-based system. The IDS monitors all network traffic behaviors and tests them to check if they are normal or DoS attacks. The use of machine learning algorithms (including, supervised and unsupervised learning) and the C4.5 decision tree algorithm enabled us to achieve the proposed goals.

Analyzing the results obtained and presented in chapter 4 concludes the below points:

1. To overcome the bad features selection that positively affect the whole performance of the presented IDS model, Information gain algorithm is used in this research to find suitable subsets of of relevant features with optimal sensitivity and highest discriminatory power

for the selected

attack category within the subset of the KDD dataset.

2. The K-means algorithm was chosen to evaluate the performance of an unsupervised learning method for anomaly detection. The results of the evaluation of using K-means with feature selection confirm that a high detection rate can be achieved while maintaining a low false alarm rate ( DR = 98.214%, Error rate = 1.7857%), compared to the results obtained in [14] that uses K-means alone for analyzing and partitioning the data of KDD Cup 99 dataset (DR ≈ 96%, Error rate ≈ 4%).

71

Chapter Five

Conclusions and Future Works

3. K-means uses similarity measures for clustering based on distance function, thus the metric for calculating the distance affects the overall performance and the processing time. Obtained results did show that the Manhattan function can achieve a high detection rate with low false alarms compared

to

Euclidean

function

(for

Manhattan

Distance:

DR = 98.2143%, Error Rate = 1.7857%, and for Euclidean Distance: DR = 90.0875%, Error Rate = 9.9125%).

4. DoS attack types classifications have been made possible by using the C4.5 Decision Tree (used as supervised learning algorithm) and the outcome of K-means clustering algorithm. Once the tree structure is built, C4.5 can classify traffic according to it with detection rate (DR) reaching 100%, (DR = 99.9136%, Error Rate = 0.0864%) due to the supervised work nature that makes it very accurate in detecting and classifying known patterns which have been learned.

5. There is no need to get concerned about new types of attacks and the performance of the system is not reduced if the IDS undergoes unknown attacks as unsupervised learning algorithm has been use as a detection model in the first layer. With the adoption of the unsupervised layer there is no need for carrying daily updates and inserting new types of attacks to the IDS database as the system cluster data based on similarity to a centroid cluster.

72

Chapter Five

Conclusions and Future Works

5.2 Future Works

There are several possible suggestions of how far this research can be extended. They can be listed as follows:

1- The presented IDS model classifies the network packets into two classes; normal and DoS. Therefore, this model activity can be further extended in the future to classify network activities based on the intrusion categories.

2- The accuracy of the model may be further improved by using extended version of K-means algorithm called K-modes algorithm. The K-modes algorithm extends the K-means paradigm to cluster large categorical data by using:  Simple matching dissimilarity measure for categorical objects.  Modes instead of means for clusters.

3- A graphical user interface (GUI) has been designed as a part of implementing the system. Although the GUI of implemented proposed system getting its input data from the KDD dataset, amending it to get real packet data could be done by adding and using some Java packages and classes.

73

References:

[1] Ashok K. Sahu, and Gulam Rasul, (2011), "Use of IT and Its Impact on Service Quality in an Academic Library", Library Philosophy and Practice (ejournal), Libraries at University of Nebraska-Lincoln ISSN 1522-0222. [2] Dhruv A. Patel, and Prof. Hasmukh Patel, (2014), "Detection and Mitigation of DDOS Attack against Web Server", International Journal of Engineering Development and Research (IJEDR),Volume 2, Issue 2, ISSN: 2321-9939. [3] R. Vijayasarathy, (Feb. 2012), "A Systems Approach to Network Modelling for DDoS Attack Detection using Na`ıve Bayes Classiﬁer", Master thesis, Department of Computer Science and Engineering, Indian Institute of Technology, Madras. [4] D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver, (Aug. 2003) "Inside the Slammer Worm'', IEEE Security and Privacy, vol. 1, no. 4, pp. 33-39. [5] C. Kruegel, F. Valeur and G. Vigna, (2004), Intrusion Detection and Correlation: Challenges and Solutions, ©2005 Springer Science + Business Media, Inc. Boston, ISBN: 978-0-387-23398-7, pp. 11-12. [6] D. Gollmann, (2006), Computer Security. Wiley, 2nd edition, ISBN 0470862939, John Wiley & Sons, New York, NY, pp. 251-252. [7] G. Vigna, and R.A. Kemmerer (April 2002) "Intrusion Detection: A Brief History and Overview", Security and Privacy, supplement to IEEE Computer, pp. 27-30. [8] James P. Anderson, (April 1980), "Computer Security Threat Monitoring and Surveillance", Technical report, James P. Anderson Co., Fort Washington, Pennsylvania. [9] Anderson D., Frivold T., and Valdes A., (May 1995), "Next Generation Intrusion Detection Expert System ", Computer Science Laboratory.

74

[10] Kumer, S., (1995), "Classification and detection of computer Intrusion", Ph.D. Thesis, Department of computer science, Purdue university. [11] Andrew Sung, Guadalupe Janoski, and Srinivas Mukkamala, (2002), "Intrusion

detection

machines",

Proceedings

Neural

Networks,

using of

IJCNN,

neural the

networks

and

International

Joint

Honolulu, HI,

support

vector

Conference

ISSN: 1098-7576,

on DOI:

10.1109/IJCNN.2002.1007774, pp. 1702-1707. [12] Aikaterini Mitrokotsa, and Christos Douligeris, (Dec. 2005), "Detecting Denial of Service Attacks Using Emergent Self-Organizing Maps", Signal Processing and Information Technology. Proceedings of the Fifth IEEE International Symposium. [13] R Rajesh, and Shina Sheen, (2008), "Network Intrusion Detection using Feature Selection and Decision tree classifier", TENCON IEEE Region 10 Conference, Hyderabad, DOI: 10.1109/TENCON.2008.4766847. [14] Bian Ling, Meng Jianliang, and Shang Haikun, (2009), "The Application on Intrusion Detection Based on K-means Cluster Algorithm", Information Technology and Applications. IFITA International Forum IEEE. [15] Affendey, Ektefa, Memar, and Sidi, (March 2010), "Intrusion Detection Using Data Mining Techniques", Information Retrieval and knowledge Management (CAMP), IEEE International Conference, Shah Alam, Selangor, pp. 200-203, DOI: 10.1109/INFRKM.2010.5466919. [16] Bharti K., Jain S., and Shukla S., (2010), "Fuzzy K-mean Clustering via Random Forest for Intrusion Detection System", International Journal on Computer Science and Engineering Vol. 02(6), pp. 2197-2200. [17] A. Bhaskar, and B. K. Kumar, (June 2012), "Identifying Network Anomalies Using Clustering Technique in Weblog Data", International Journal of Computers & Technology Volume 2 No.3. [18] Reyadh Sh.Naoum, and Wafa' S. Al-Sharafat, (April 2009), "Adaptive Framework for Network Intrusion Detection by Using Genetic-Based

75

Machine Learning Algorithm", IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4. [19] Munish Sharma, and Tajinder Kaur, (2014), "A Study on Network Intrusion Detection Based on Proactive

Mechanism", International Journal of

Emerging Research in Management &Technology, Volume 3, Issue 1, ISSN: 2278-9359. [20] D.P. Gaikwad, Kunal Thakare, Sonali Jagtap, and Vaishali Budhawant, (Nov. 2012), "Anomaly Based Intrusion Detection System Using Artificial Neural Network and Fuzzy Clustering", International Journal of Engineering Research & Technology (IJERT), Vol. 1 Issue 9, ISSN: 2278-0181. [21] Matt Bishop (2002), Computer Security: Art and Science (1st ed.), AddisonWesley Professional, ISBN: 0-201-44099-7, pp. 10-11. [22] Deeman Y. Mahmood, and Dr. Mohammed A. Hussein, (Dec. 2013), "Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction", International Organization of Scientific Research Journal of Computer

Engineering

(IOSR-JCE),

Volume

15,

Issue

5,

DOI:

10.9790/0661-155107112, PP. 107-112. [23] Deeman Yousif Mahmood, and Dr. Mohammed Abdulla Hussein, (2014), "Analyzing NB, DT and NBTree Intrusion Detection Algorithms", Journal of Zankoy Sulaimani – Part A (JZS-A), Volume 16, No. 1. [24] Ajith Abraham, Crina Grosan, and Yuehui Chen, (2005), "Cyber Security and the Evolution in Intrusion Detection Systems", Journal of Engineering and Technology, Volume 1, Issue 1, pp. 74-81. [25] R. Bace, (1999), "An introduction to intrusion Detection Assessment for system and network security management", Infidel Inc., prepared for ICSA Inc. [26] Ganesh Prasad, Sandip Sonawane, and Shailendra Pardeshi, (2012), "A survey on intrusion detection techniques", World Journal of Science and Technology, 2(3), ISSN: 2231 – 2587, 127-133.

76

[27] Neethu B, (March 2013), "Adaptive Intrusion Detection Using Machine Learning", IJCSNS International Journal of Computer Science and Network Security, Volume 13 No.3, pp. 118-124. [28] Abraham A., Panda M., and Patra M. R., (2010), "Discriminative Multinomial Naïve Bayes for Network Intrusion Detection", IEEE Sixth International Conference on Information Assurance and Security (IAS), DOI: 10.1109/ISIAS.2010.5604193, pp.5-10. [29] B. Pearlmutter, C. Warrender, and S. Forrest, (1999), "Detecting intrusions using system calls: alternative data models", IEEE Symposium on Security and Privacy, pages 133–145. [30] Dr. Sameer Shrivastava, (April 2012), "Case Study on JAVA based IDS", International Journal of Scientific & Engineering Research, Volume 3, No. 4. [31] Henok Alene, (Oct. 2011), "Graph Based Clustering for Anomaly Detection in IP Networks" Master thesis, Department of Information and Computer Science, School of Science, Aalto University. [32] C. Verbowski, H. J. Wang, J. R. Lorch, P. M. Chen, S. T. King, and Y. Wang, (2006), "SubVirt: Implementing malware with virtual machines", IEEE Symposium on Security and Privacy, pp. 314–327. [33] C. Kreibich, M. Handley, and V. Paxson, (2001), "Network intrusion detection: Evasion, traffic normalization, and end-to-end protocol semantics". USENIX Security Symposium, pp. 115–131. [34] Rebecca Bace, and Peter Mell, (2001), "NIST Special Publication on Intrusion Detection Systems", Infidel, Inc., Scotts Valley, CA, National Institute of Standards and Technology. [35] Leonard J. LaPadula, and Therese R. Metcalf, (2000), "Intrusion Detection System Requirements", Center for Integrated Intelligence Systems, Bedford, Massachusetts © MITRE Corporation.

77

[36] A. Appa Rao, B. Chakravarthy, K.Marx, P. Kiran and P.Srinivas, (2006), "A Java Based Network Intrusion Detection System (IDS)", Proceedings of the 2006 IJME - INTERTECH Conference [37] A. Movaghar, and F. Sabahiand, (2008), "Intrusion Detection: A Survey", IEEE

Third

International

Conference

on

Systems

and

Networks

Communications, DOI: 10.1109/ICSNC.2008.44. [38] Anil A. Ahlawat, and Brijpal Singh, (2013), "Intrusion detection of Network Attacks Using Artificial Neural Networks & Fuzzy Logic", International Journal of Engineering & Management Technology, ISSN: 2320-7043, Volume 1, Issue 1, pp. 53-66. [39] Elham Hormozi, Hadi Hormozi, and Hamed Rahimi Nohooji, (2012), "The Classification of the Applicable Machine Learning Methods in Robot Manipulators", International Journal of Machine Learning and Computing, Volume 2, No. 5, pp. 560-563. [40] Taiwo Oladipupo Ayodele, (2010), New Advances in Machine Learning, Chapter three: Types of Machine Learning Algorithms, pp. 20-23 ISBN: 978953-307-034-6, InTech, University of Portsmouth, United Kingdom. [41] Salahedin Ali Namroush, and Shauki Abdusalam Fatshul, (2006), "security issues, attack Trends related to the confidentiality, integrity, and availability of information assets on an organization's computer system", Proceedings of the Postgraduate Annual Research Seminar, Center of Advanced Software Engineering (CASE), University Technology Malaysia. [42] Yash Batra, (2013), "IP Spoofing", International Indexed & Refereed Research Journal, ISSN: 0974-2832, Volume V, ISSUE- 59, pp. 44-46. [43] D.K. Bhattacharyya , J.K. Kalit, Monowar H. Bhuyan, N. Hoque, and R.C. Baishya, (2014), "Network attacks: Taxonomy, tools and systems", ELSEVIER Journal of Network and Computer Applications, pp. 307–324.

78

[44] Steven J. Templeton, and Karl E. Levitt, (2003), "Detecting spoofed packets", IEEE DARPA Information Survivability Conference and Exposition Proceedings, Vol. 1, DOI: 10.1109/DISCEX.2003.1194882, pp. 164-175. [45] Sharmin Rashid, and Subhra Prosun Paul, (2013), "Proposed Methods of IP Spoofing Detection & Prevention", International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064. Volume 2 Issue 8, pp. 438444. [46] Mridu Sahu, and Rainey C. Lal, (2012), "CONTROLLING IP SPOOFING THROUGH PACKET FILTERING", International Journal of Computer Techology & Applications IJCTA, Volume 3 (1), pp 155-159. [47] Victor Velasco, (2000), "Introduction to IP Spoofing", SANS Institute InfoSec

Reading

Room,

URL:

http://www.sans.org/reading-

room/whitepapers/threats/introduction-ip-spoofing-959 [48] Srinivas Aluvala, (2011)"Inter-domain Packet Filters To Control IPForging", Research Journal of Computer Systems Engineering- An International Journal, ISSN: 2230-8563, Volume 2, Issue 2, pp. 67-72. [49] M. Rajasekhar, and S. Kishor Kumar, (2012),"GROUP SIGNATURE PROTOCOL

FOR

SECURITY

TREATMENT

FOR

BLOCKING

MISBEHAVING USERS", International Journal of Research Sciences and Advanced Engineering, ISSN: 2319-6106, Volume 2 (5), pp. 30-34. [50] G. Sindhuri, K. Sachin, K. Sravani, and Y.Madhavi Latha, (2012), "A novel approach for the detection of SYN Flood Attack", International Journal of Computer Trends and Technology, ISSN: 2231-2803, Volume 3, Issue 2, pp. 286-289. [51] Gurjinder Kaur, V. K. Jain, and Yogesh Chaba, (2011), "Distributed Denial of Service Attacks in Mobile Adhoc Networks", World Academy of Science, Engineering and Technology (WASET), Volume 5, No. 1, pp. 591-593. [52] A. Vasavi, B V Ramana Murthy, and Vuppu Padmakar, (2014), "Significances and Issues of Network Security", International Journal of

79

Advanced Research in Computer and Communication Engineering, ISSN: 2319-5940, Volume 3, Issue 6. [53] Jon Erickson (2008). HACKING the art of exploitation (2nd ed.). San Francisco, ISBN: 1-59327-144-1, pp. 256-258. [54] DARPA Intrusion Detection Evaluation, Lincoln Laboratory Massachusetts Institute of Technology (MIT), Article about Back DoS attack, URL: http://www.ll.mit.edu/mission/communications/cyber/CSTcorpora/ideval/doc s/attackDB.html#back [55] Daniel Barbar, and Sushil Jajodia, Application of Data Mining in Computer Security, George Mason University, Kluwer Academic Publishers, Boston, Dordrecht, London, pp34-35. [56] A. M. Chandrashekhar, and K. Raghuveer, (2012), "Performance evaluation of data clustering techniques using KDD Cup-99 Intrusion detection data set", International Journal of Information & Network Security (IJINS), ISSN: 2089-3299, Volume 1, No.4, pp. 294-305 [57] Khoshgoftaar M., Napolitano A., Van J., and Wald R., (2009),"Feature Selection with High-Dimensional Imbalanced Data", IEEE International Conference

on

Data

Mining

Workshops,

ICDMW

'09,

DOI:

10.1109/ICDMW.2009.35, pp. 507-514. [58] Article

about:

Clustering

high-dimensional

data,

URL:

http://en.wikipedia.org/wiki/Clustering_high-dimensional_data. [59] GU Chun-hua, LIN Jia-jun, and ZHANG Xue-qin, (2006), "INTRUSION DETECTION SYSTEM BASED ON FEATURE SELECTION AND SUPPORT VECTOR MACHINE", IEEE, Communications and Networking in

China,

2006.

ChinaCom.

First

International

Conference,

DOI:

10.1109/CHINACOM.2006.344739. [60] L. Ladha, and T. Deepa, (2011), "Feature Selection Methods and algorithms", International Journal on Computer Science and Engineering (IJCSE), ISSN : 0975-3397, Volume 3, No. 5.

80

[61] Lioyd A. Smith, and Mark A. Hall, (1999) "Feature Selection for Machine Learning: Comparing a Correlation-based Filter Approach to the Wrapper", American Association for Artificial Intelligence. [62] Cheng-Hong Yang, Cheng-San Yang, Jung-Chike Li, and Li-Yeh Chuang, (2008),

"Information Gain with Chaotic Genetic Algorithm for Gene

Selection and Classification Problem", Systems, Man and Cybernetics, IEEE International Conference, DOI: 10.1109/ICSMC.2008.4811433 [63] Kumar V., Steinbach M., and Tan P.N, (2006) Introduction to Data Mining, Addison-Wesley. [64] Jain A. K., (2010), "Data clustering: 50 years beyond K-means in Pattern Recognition Letters", Elsevier B.V., Volume 31, Issue 8, pp. 651-666. [65] P.G Student, (2014), "Evaluation of Similarities Measure in Document Clustering", International Journal of Science and Research (IJSR), ISSN: 2319-7064, Volume 3 Issue 1, pp. 39-41. [66] Jayna Shah, Neha Soni, and Rimi Gupta, (2012),"Analytical Comparison of Some Traditional Partitioning based and Incremental Partitioning based Clustering Methods", International Journal of Computer Applications, ISSN: 0975-8887, Volume 59, No.10, pp. 8-12. [67] Dheeraj Panwar, Himadri Chauhan, and Vipin Kumar, (2013), "K-Means Clustering Approach to Analyze NSL-KDD Intrusion Detection Dataset", International Journal of Soft Computing and Engineering (IJSCE), ISSN: 2231-2307, Volume 3, Issue 4. [68] D.K.Ghosh, and P.Indira Priya, (2012), "K-means Clustering Algorithm Characteristics Differences based on Distance Measurement", International Journal of Computer Applications ISSN: 0975-8887, Volume 59, No.14. [69] Ahmed Hassan, A. M. Riad, Ibrahim Elhenawy , and Nancy Awadallah, (2013) "VISUALIZE NETWORK ANOMALY DETECTION BY USING K-MEANS CLUSTERING ALGORITHM", International Journal of Computer Networks & Communications (IJCNC) Vol.5, No.5.

81

[70] Neha Jain, and Shikha Sharma, (2012), "The Role of Decision Tree Technique for Automating Intrusion Detection System", International Journal of Computational Engineering Research, Volume 2, Issue 4. [71] Upendra, and Yogendra Kumar, (2012), "An Efficient Intrusion Detection Based on Decision Tree Classifier Using Feature Reduction", International Journal of Scientific and Research Publications, ISSN 2250-3153, Volume 2, Issue 1. [72] Article

about

Decision

Trees,

URL:

http://en.wikipedia.org/wiki/Decision_tree. [73] J. Rose Quinlan, (1993), C4.5: programs for machine learning, Morgan Kaufmann Publishers, Inc. San Mateo, CA. [74] Jeff Markey, (2011), "Using Decision Tree Analysis for Intrusion Detection: A How-To Guide", SANS Institute InfoSec Reading Room, URL: http://www.sans.org/reading-room/whitepapers/detection/decision-treeanalysis-intrusion-detection-how-to-guide-33678 [75] ID3 (Iterative Dichotomiser 3) is an algorithm invented by J. Ross Quinlan, URL: http://en.wikipedia.org/wiki/ID3_algorithm. [76] J. L. Rana, Manasi Gyanchandani, and R. N. Yadav, (2010) "Intrusion Detection

using

C4.5:

Performance

Enhancement

by

Classifier

Combination", ACEEE Int. J. on Signal & Image Processing, Volume 01, No. 03. [77] Eibe Frank, Ian H. Witten, and Mark A. Hall, (2011), Data Mining Practical Machine Learning Tools and Techniques, Copyright © Elsevier Inc. [78] Gaffney John E., and Ulvila, J.W., (2001), "Evaluation of intrusion detectors: a decision theory approach", Security and Privacy, S&P Proceedings, IEEE Symposium on. [79] Asha

Gowda

Karegowda,

A.

S.

Manjunath,

and

M.A.Jayaram,

(2010),"COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO AND CORRELATION BASED FEATURE SELECTION",

82

International

Journal

of

Information

Technology

and

Knowledge

Management, Volume 2, No. 2, pp. 271-277. [80] Knowledge Discovery in Database, KDD Cup 99 benchmark dataset, URL: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. [81] Ali A. Ghorbani, Ebrahim Bagheri, Mahbod Tavallaee, and Wei Lu, (2009), "A Detailed Analysis of the KDD CUP 99 Dataset", proceeding of the IEEE symposium on computational Intelligence in security

and defense

application. [82] Svein J. Knapskog, Sylvain Gombault, and Wei Wang, (2009),"Attribute Normalization in Network Intrusion Detection", IEEE 10th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN), DOI: 10.1109/I-SPAN.2009.49, pp. 448-453. [83] Article about: Euclidean Distance, URL: http://en.wikipedia.org/wiki/Euclidean_distance. [84] Article about: Manhattan Distance, URL: http://en.wikipedia.org/wiki/Manhattan_distance.

83

Appendices

84

Appendix A

Table A1: Description of the 41-Features of TCP Connection No.

Feature

Description

Type

1

Duration

2

Protocol_type

3

Service

4

Flag

5

Src_bytes

6

Dst_bytes

7

Land

8

Wrong_fragment

Length of the connection (No. of Seconds) Type of connection Protocol (tcp, udp) Network Service on the destination (talnet, ftp) Status flag of the connection No. of Data Bytes sent from source to destination No. of Data Bytes sent from destination to source 1 if connection is from/to the same host/port; 0 otherwise No. of wrong fragments

9

Urgent

No. of urgent packets

Continuous

10

Hot

No. of “hot” indicators

Continuous

11

Num_failed_logins

No. of failed logins

Continuous

12

Logged_in

1 if successfully logged in; 0 otherwise

13

Num_compromised

No. of “compromised” conditions

14

Root_shell

15

Su_attempted

16

Num_root

1 if root shell is obtained; 0 otherwise Discrete 1 if “su root” command attempted; 0 Discrete otherwise No. of “root” accesses Continuous

17

Num_file_creations

No. of file creation operations

18

Num_shells

19

Num_access_files

20

Num_outbound_cmds

21

s_host_login

22

s_guest_login

No. of shell prompts Continuous No. of operations on access control Continuous files No. of outbound commands in an ftp Continuous session 1 if the login belongs to the “hot” list; 0 Discrete otherwise 1 if the login is a “guest” login; 0 Discrete otherwise

A. 1

Continuous Discrete Discrete Discrete Continuous Continuous Discrete Continuous

Discrete Continuous

Continuous

23

Count

24

Srv_count

25

Serror_rate

26

Srv_serror_rate

27

Rerror_rate

28

Srv_rerror_rate

29

Same_srv_rate

No. of connections to the same host as the current connection in the past two seconds No. of connections to the same service as the current connection in the past two seconds % of connections that have “SYN” errors % of connections that have “SYN” errors % of connections that have “REJ” errors % of connections that have “REJ” errors % of connections to the same service

30

Diff_srv_rate

% of connections to different services

31

Srv_diff_host_rate

32 33

34 35 36 37 38 39 40 41

% of connections to different hosts Count of connections having the same Dst_host_count destination host Count of connections having the same Dst_host_srv_count destination host and using the same service % Count of connections having the Dst_host_same_srv_rate same destination host and using the same service % of different services on the current Dst_host_diff_srv_rate host Dst_host_same_src_port_rate % of connections to the current host having the same src port Dst_host_srv_diff_host_rate % of connections to the same service coming from different hosts % of connections to the current host Dst_host_serror_rate that have an S0 error % of connections to the current host Dst_host_srv_serror_rate and specified service that have S0 error % of connections to the current host Dst_host_rerror_rate that have an RST error % of connections to the current host Dst_host_srv_rerror_rate and specified service that have RST error

A. 2

Continuous

Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous

Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous

Table A2: Description of Flag Values

Flag

Description

RSTOS0 Originator sent a SYN followed by a RST, never see a SYN RSTR

Established, responder aborted

RSTO

Connection established, originator aborted (sent a RST)

OTH

No SYN seen, just midstream traffic (a “partial connection” that was not later closed)

REJ

Connection attempt rejected

S0

Connection attempt seen, no reply

S1

Connection established, not terminated

S2

S3 SF SH

Connection established and close attempt by originator seen (but no reply from responder) Connection established and close attempt by responder seen (but no reply from originator) Normal establishment and termination Originator sent a SYN followed by a FIN (finish ‘flag’) , never saw a SYN ACK from the responder (hence the connection was “half” open)

A. 3

Appendix B

List of Publications: The majority of the work used in this thesis has been published or accepted for publication:

1- Deeman Y. Mahmood, Dr. Mohammed A. Hussein, (Dec. 2013), "Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction", International Organization of Scientific Research/ Journal of Computer Engineering (IOSR/JCE), ISSN: 2278-8727 Volume 15, Issue 5. The paper has been indexed in: -

ANED (American National Engineering Database), with ANEDDDL (Digital Data link) number: 11.0661/iosr-jce-R0155107112.

-

Crossref, with DOI (Digital Object Identifier) number: 10.9790/0661155107112

2- Deeman Yousif Mahmood, and Dr. Mohammed Abdulla Hussein, (2014), "Analyzing NB, DT and NBTree Intrusion Detection Algorithms", Journal of Zankoy Sulaimani – Part A (JZS-A), Volume 16, No. 1.

B. 4

‫سيستةمى جؤراوجؤر بؤ دؤزينةوةى بةزاندنى ثرؤتؤكؤالتى‬ ‫ئينتةرنيَت‬ ‫نامةيةكة‬ ‫ثيَشكةشكراوة بة ئةجنومةنى فاكةلَتى زانست و ثةروةردة زانستةكان‬ ‫سكولَى زانست لة زانكوَى سليَمانى‬ ‫وةك بةشيَك لة ثيَداويستيةكانى بةدةستهيَنانى برِوانامةى‬ ‫ماستةرى زانست لة‬ ‫كوَمبيوتةر دا‬ ‫لةاليةن‬ ‫دميةن يوسف حممود‬ ‫بكالوريوس لة كوَمبيوتةر (‪ ,)8002‬زانكوَى كةركوك‬ ‫بةسةرثةرشتى‬ ‫د‪ .‬حممد عبداهلل حسني‬ ‫ث ِروَفيسوَرى ياريدةدةر‬

‫(‪ 8172‬ك) ثوشثةر‬

‫(‪ 8072‬ز) حوزةيران‬

‫كورتة‬

‫سيستةمةكانى دوَزينةوةى دزةكردن (‪ ) SDI‬لةم دواييةدا طرنطى و ثيَداويستييةكى زؤرى ثةيداكردووة لة بوارى‬ ‫ئاسايشى تؤِرةكان بة تايبةتى‬

‫دواى زيادبوونى تواناى هيَرشبةرةكان وة لة دواى ثيَشكةوتنى تةكنةلؤجيا و‬

‫تةكنيكةكان‪ ,‬وة زؤربةى ئةو كارطوزارييانةى كة لةسةر تؤرِةكانى ئينتةرنيَت دةردةكةون بة هةموو جؤرةكانييةوة‬ ‫طرفيت ئةوةى تيَداية كة بة ئاسانى دةست ناكةون بؤ بةكارهيَنةرةكان كة رِيَثيَدراون بة بةكارهيَنانيان بةهؤي‬ ‫هيَرشةكانى دورخستنةوةى ئةو كارطوزارييانة كة بابةتى سةرةكى ئةم تويَذينةوةيةية ‪.‬‬ ‫داتاى (‪ )KDD Cup 99‬بةكارهاتووة وةكو سةرضاوةيةكى سةرةكى بؤ طورزةكانى تؤرِةكة كة تيَيدا ئةم داتايانة‬ ‫بةكارهاتووة بة شيَوةيةكى فراوان لةطةلَ تةكنيكى فيَربوونى ئاميَرةكان‪ .‬ئةم داتايانة بة داتاى ثيَوانةيى دادةنريَن‬ ‫بةشيَوةى باش كاردةكةن لةم بوارةدا وة سةمليَنراوة كة تةكنيةكانى فيَربوونى ئاميَر دةتوانريَت بة سودبن لة‬ ‫دؤزينةوةى دزةكردن‪.‬‬ ‫لةم كارةدا دوو خةوارزم لة خةوارزمةكانى فيَربوونى ئاميَر بةكارهاتووة لةسةر منوونةى ثاراستنى ثيَشنياركراو كة‬ ‫ئةويش بريتني لة خوارزمى طرووثة بؤلَيةكان (‪ )K-snaem‬بؤ قؤناغى فيَربوون كة لةذيَر ضاوديَريدا نيية وة‬ ‫خةوارزمى درةختى برِياردان (‪ )Decision Tree‬بؤ قؤناغى فيَربوون لة ذيَر ضاوديَرى وو سةرثةرشتيداية ‪.‬ئةم‬ ‫خةوارزميانة بةكارهيَنراون لةطةلَ خةوارزمى بةدةستهيَنانى زانيةرييةكان (‪ )Seinasaoane naae‬بؤ ثؤليَنكردن و‬ ‫رِيَكخستنى تايبةمتةنديةكانى ئةو ضةثكانة‪ .‬لةطةلَ ئةوةى كة ثيَشرت خةوارزمييةكانى طروثى بؤلَيةكان (‪)K-snaem‬‬ ‫بةكارهيَنرابوون بؤ دؤزينةوةى دزةكردن بةالَم بةهؤي زيادكردنى هةلَبذاردن و رِيَكخستنى تايبةمتةندى و‬ ‫سيفةتةكان توانرا ئةجنامى باشرت بةدةستبهيَنني‪ .‬بةكارهيَنانى خةوارزمى طروثةكان (‪ )K-snaem‬تواناي داينىَ بؤ‬ ‫ثؤليَنكردنى ضةثكةكانى تؤرِةكان بؤ ضةثكى سروشتى و ضةثكى (شاردنةوةى كارطوزارى) و بة زيادكردنى خةوارزمى‬ ‫درةختى برِياردان ( ‪ )Decision Tree‬توانرا هيَرشةكانى (شاردنةوةى كارطوزارى) ثؤليَنبكريَ ‪.‬‬ ‫ئةجنامى ئةم كارة بريتيية لة سيستةميَك يان جيَبةجيَكردنى كاريطةرانة بؤ درةختى دزةكردنةكان و هيَرشةكان‬ ‫لةسةر تؤرِي ئينتةرنيَت بة تيَكرِايةكى بةرز لة دوَزينةوةى دزةكردن وة تيَكرايةكى نزم بؤ ورياكرنةوةى هةلَةكان‬ ‫(رِيذةى دوَزينةوة = ‪ ,%3512,89‬رِيذةى هةلَة = ‪ %,11581‬خوارزمى طروثةكام)‪( ,‬رِيذةى دوَزينةوة = ‪, %3313,99‬‬ ‫رِيذةي هةلَة = ‪ %818598‬خةوارزمى درةختى برِياردان)‬

‫نظام متعدد األناا كشف‬

‫روقاا بوتوكوال األنتونت‬

‫رسالة‬ ‫مقدمة إلى مجلس فاكلتي العلوم وتربية العلوم‬ ‫سكول العلوم في جامعة السليمانية‬ ‫كجزء من متطلبات نيل شهادة‬ ‫ماجستير علوم في‬ ‫الحاسبات‬

‫من قبل‬ ‫ديان يوس‬

‫محاود‬

‫بكالوريوس علوم حاسبات (‪ ,)8002‬جامعة كركوك‬

‫بإشراف‬ ‫د‪ .‬محاد عبدهللا حسين‬ ‫أستاذ مساعد‬

‫(‪ 7241‬هـ) شعبان‬

‫(‪ 8072‬م) حزيران‬

‫اكخالصة‬ ‫لقد أكتسبت أنظمة كشف التسلل ) ‪ ( IDS‬في األونة األخيرة المزيد من األهمية و الضرورة في مجال أمن‬ ‫الشبكات و خصوصأ مع زيادة كفاءة المهاجمين ومع تطور التكنولوجيا والتقنيات‪ .‬أن النمو الهائل لحركة البيانات‬ ‫على األنترنت جعلت من الصعب على أي نظام كشف جميع أنواع التسلل الموجودة و التي هي في تطور مستمر‪.‬‬ ‫و تعاني معظم الخدمات المعروضة على شبكة األنترنت بأختالف أنواعها من مشكلة عدم توفرها للمستخدمين‬ ‫المخولين والمرخصين بسبب هجمات حجب الخدمة والتي هي الموضوع الرئيسي لهذا البحث‪.‬‬ ‫وإلظهار قابلية وأمكانية النظام المقترح‪ ,‬تم أستخدام بيانات الـ ‪ KDD Cup99‬كمصدر رئيسي لحزم الشبكة‪ ،‬حيث‬ ‫تم أستخدام هذه البيانات على نطاق واسع مع تقنيات تعلم اآللة على مدى العقدين السابقين لتقييم أنظمة كشف التسلل‪.‬‬ ‫هذه البيانات تعتبر بيانات قياسية وتعمل بصورة جيدة في هذا المجال وأثبتت بأن تقنيات تعلم اآللة يمكن أن تكون‬ ‫مفيدة في مجال كشف التسلل‪.‬‬ ‫في هذا العمل تم تطبيق خوارزميتين من خوارزميات التعلم األلي على نموذج الحماية المقترح وهي خوارزمية‬ ‫مجموعة العناقيد (‪ )K-means‬لمرحلة التعلم الغير خاضع للرقابة وخوارزمية شجرة القرار)‪(Decision Tree‬‬ ‫لمرحلة التعلم الخاضع للرقابة واإلشراف‪ .‬تم أسستخدام هذه الخوارزميات مع خوارزمية كسب المعلومات‬ ‫(‪ ) Information Gain‬لتصنيف وترتيب خصائص الحزم‪ .‬وعلى الرغم من أستخدام خوارزمية مجموعة العناقيد‬ ‫(‪ )K-means‬سابقا ً لكشف التسلل‪ ،‬فإن إضافة أختيار وترتيب الخصائص والميزات مكننا من الحصول على نتائج‬ ‫أفضل مع وقت أقصر للمعالجة‪.‬‬ ‫إن استخدام خوارزمية المجاميع )‪ (K-means‬مكنتنا من تصنيف حزم الشبكة إلى حزم طبيعية أو حزم "حجب‬ ‫الخدمة" وبإضافة خوارزمية شجرة القرار (‪ )Decision Tree‬أصبح من الممكن تصنيف هجمات "حجب‬ ‫الخدمة"‪.‬‬ ‫نتيجة هذا العمل هو نظام أو تطبيق فعال لكشف التسلل والهجمات على شبكة األنترنيت ذو معدل كشف عالي‬ ‫للخروقات ومعدل منخفض للتنبيهات الخاطئة بحسب النتائج التي تم التوصل إليها (نسبة الكشف = ‪, %32,8724‬‬ ‫نسبة الخطأ = ‪ %7,1211‬خوارزمية المجاميع) و (نسبة الكشف = ‪ , %33,3749‬نسبة الخطأ = ‪%0,0292‬‬ ‫خوارزمية شجرة القرار)‪.‬‬