Universiteit van Amsterdam IAS technical report IAS-UVA-09-03

Statistical decision making for authentication and intrusion detection Christos Dimitrakakis1 and Aikaterini Mitrokotsa2 1 Intelligent

Systems Laboratory Amsterdam,, University of Amsterdam , The Netherlands 2 Faculty of EEMCS,, TU Delft, The Netherlands

User authentication and intrusion detection differ from standard classification problems in that while we have data generated from legitimate users, impostor or intrusion data is scarce or non-existent. We review existing techniques for dealing with this problem and propose alternatives based on a principled decision-making view point. We examine the general technique in some toy problems, and then validate it on real-world data from an RFID access control system. The results indicate that the approach could be useful in other decision-making scenarios where there is a lack of adversary data. Ketwords: classification, adversarial, authentication, intrusion detection, empirical Bayes.

IAS

intelligent autonomous systems

Statistical decision making for authentication and intrusion detection

Contents

Contents 1 Introduction 1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 2

2 The 2.1 2.2 2.3 2.4 2.5

proposed model framework The oracle decision rule . . . . . . . . . . Maximum likelihood adversary model . . Maximum a posteriori adversary model . 2.3.1 Practical consistency of the bound Bayesian adversary model . . . . . . . . . Prior and user model estimation . . . . .

3 Experimental evaluation 3.1 Synthetic experiments . . . . . . . . . . 3.1.1 Known user and prior over users 3.1.2 Unknown user and priors . . . . 3.2 Real data . . . . . . . . . . . . . . . . . 3.2.1 Experiments . . . . . . . . . . .

. . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 4 5 7 7 9 9

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

11 11 11 12 12 12

4 Conclusion

13

A Proof

14

Intelligent Autonomous Systems Informatics Institute, Faculty of Science University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam The Netherlands Tel (fax): +31 20 525 7461 (7490) http://www.science.uva.nl/research/ias/

Corresponding author: C. Dimitrakakis tel: +31 20 525 7517 [email protected] http://www.science.uva.nl/~dimitrak/

Copyright IAS, 2009

Section 1

1

Introduction

1

Introduction

Classification is the problem of categorising data in one of two or more possible classes. In the classical supervised learning framework, examples of each class have already been obtained and the task of the decision maker is to accurately categorise new observations, whose class is unknown. The accuracy is either measured in terms of the rate of misclassification, or in terms of the average cost, for problems where different types of errors carry different costs. In that setting, the problem has three phases: firstly the collection of training data; secondly, the estimation of a decision rule based on the training data; and thirdly, the application of the decision rule to new data. Typically, the decision rule remains fixed after the second step. Thus, the problem becomes that of finding the decision rule with minimum risk from the training data. Unfortunately, some problems are structured in such a way that it is not possible to obtain data from all categories to form the decision rule. Novelty detection, user authentication, network intrusion detection and spam filtering all belong to this type of decision problems: while the data of the “normal” class is relatively easily characterised, the data of the other class which we wish to detect is not. This is partially due to the potentially adversarial nature of the process that generates the data of the alternative class. As an example, consider being asked to decide whether a particular voice sample belongs to a specific person, given a set of examples of his voice and your overall experience concerning the voices of other persons. Intuitively, this means deciding whether the voice belongs to the specific person, or any other from all possible persons. In this paper, we shall separate the data in two conceptual classes: the “user” and the “adversary”. The main distinction is that while we shall always have examples of instances of the user class, we may not have any data from the adversary class. A simple solution is to estimate the density of the known class and then assume that any new data that is in regions of low density is generated by some other class, i.e. the adversary class. Such methods can be simple outlier detection or more sophisticated clustering techniques. The only problem is that a threshold must somehow be set. This problem is alleviated in authentication settings, where we must separate accesses by a specific user from accesses by an adversary. Such problems contain additional information: data which we have obtained from other people. This can be used to create a world model, which can then act as an adversary model, and has been used with state-of-the-art results in authentication [6, 12, 25]. Since there is no explicit adversary model, the probability of an attack cannot be estimated. Our main contribution is a decision making principle which employs a pessimistic estimate on the probability of an attack. Intuitively, this is done by conditioning the adversary model on the current observations, whose class is unknown. This enables us to place an upper bound on the probability of the adversary class, in a Bayesian framework. To the best of our knowledge, this is the first time that a Bayesian worst-case approach has been described in the literature for this problem. The proposed method is compared with both an oracle and the world model approach on a test-bench. This shows that our approach can outperform the world model under a variety of conditions. This result is validated on the real-world problem of detecting unauthorised accesses in a building. The remainder of this section discusses related work. The model framework is introduced in Sec. 2, with the proposed Bayesian estimates discussed in Sec. 2.4 and methods for estimating the prior in Sec. 2.5. The conclusion is preceded by Sec. 3, which presents experiments and results.

2

1.1

Statistical decision making for authentication and intrusion detection

Classification

In standard classification problems, there exists a prior, well-defined set of possible classes Y, and a set of observations X . We initially are presented with a set of examples (x, y) generated from a distribution µ ∈ M with conditional densities µ(x|y) ≡ p(x|y, µ).1 We usually have example pairs (x, y) for all y ∈ Y and as the number of examples increases we are able to construct a better approximation of the actual generating distribution µ ∈ M. In the Bayesian framework [c.f. 8, 14] this is done simply by starting with a (usually subjective) initial belief about which are the most probable models. This is expressed as an a priori density ξ0 over M and can be updated simply by using Bayes rule: ξt (µ) , ξt−1 (µ | xt , yt ) = R

µ(xt , yt )ξt−1 (µ) M ξt−1 (xt , yt | µ = u)ξt−1 (µ = u) du

(1)

Note that for any belief ξ over M, it holds that ξ(µ, x, y) = ξ(x, y|µ)ξ(µ) = µ(x, y)ξ(µ), since we assume that p(x, y|µ, ξ) = p(x, y|µ), i.e. that given the model, the data density does not depend Ron our belief about which model is correct. Our estimate of µ at time t is simply µ ˆt (·) = M µ(·)ξt (dµ). Finally, given a cost matrix describing the cost of each possible decision given the actual class label of an unknown example, we can derive the decision function that minimises the Bayes risk. Let Z be the set of possible decisions 2 and let L(z, y) be the loss of taking decision z ∈ Z when the class is in fact y. Then, the Bayes risk of z would be ρt (z|·) = E[L(z, y)|z, ξt ] = P y∈Y L(z, y)P(y|·, ξt ), wherein we can substitute any observables, i.e. ρt (z|x) =

X

L(z, y)P(y|x, ξt ),

(2)

y∈Y

R where P(y|x, ξt ) = µ ˆt (y|x) = µ(y|x)ξt (dµ). The optimal statistical decision z in that case, is the decision which minimises the Bayes risk.

1.2

Related work

Classification algorithms have been extensively used for the detection of intrusions not only in wired [5, 21], but also in wireless networks [9, 16, 20]. Their main disadvantage is that labelled normal and attack data must be available for training. After the training phase, the classifier’s learnt model will be used to predict the labels of new unknown data. However, such data is very hard to obtain and often unreliable. Thus, in many cases, simulation is used to generate artificial data. Unfortunately, this can be quite different from real data. Finally, there will always exist new unknown attacks for which training data are not available at all. In simple classification problems the decision concerning the class of new unknown data is based on minimising the probability of error. Nevertheless, in many cases such as intrusion detection, authentication and fraud detection the goal is not to predict the class with highest probability but to take the decision with the lowest (expected) cost. For instance, in authentication applications allowing unauthorised access to a restricted place is associated with a higher cost than denying access to authorised individuals. The same applies to intrusion detection where not detecting an attack may have a much more severe cost than triggering a false alarm. In cost sensitive classification the decision concerning the class label of new unknown data is based on the minimum expected cost rather than the lowest error probability. 1 We shall frequently use one of the conditioning variables as a function symbol for clarity. With this notation, a(b|c) ≡ c(b|a) ≡ p(b|a, c). 2 This does not even have to be the same as the set of classes. However, in most classification problems Z ≡ Y and L(i, j) = 0 for i = j.

Section 1

Introduction

3

While there has been an extensive body of work in cost-sensitive classification for optimal statistical decisions [8], very limited research has been done in cost-sensitive classification for intrusion detection [11, 19, 22]. Nevertheless, cost-sensitive classification also assumes the availability of labelled datasets that include normal and attack data that will be used for training the classifier. While we do not consider general cost matrices in this paper, the suggested approach can be used to minimise bounds on the expected cost. Clustering [23, 27] uses unlabelled data and is able to detect unknown types of attacks. Similar data instances are grouped using distance metrics. Data of the same type are situated in a feature space close to each other while different instances are far apart. Of course, clustering is closely related to outlier detection [4, 24]. Related to clustering is the approach of outlier detection. From the clustering point of view, outlier detection involves the detection of outliers (i.e data instances) that are situated outside of the clusters created by normal data. Outlier detection is usually based on probabilistic models. The main disadvantage of outlier detection is that its efficiency is decreased for multidimensional distributions of data points ([4, 15, 24]). An alternative framework is the world model approach [6, 12, 25]. This is extensively used in speech and image authentication problems, where data from a considerable number of users are collected to create a world model (also called a universal background model). This approach is closely related to the model examined in this paper, since it originates in the seminal work of [13], who employed an empirical Bayes technique for estimating a prior over models. Thus, the world model is a distribution over models, although due to computational considerations a point estimate is used instead in practice [25]. The adversary may actively try to avoid detection, through knowledge of the detection method. In essence, this changes the setting from a statistical to an adversarial one. For such problems, game theoretic approaches are frequently used. Dalvi et al. [7] investigated the adversarial classification problem as a two-person game. More precisely, they examined the optimal strategy of an adversary against a standard (adversary-unaware) classifier as well as that of a classifier (adversary-aware) against a rational adversary. This was under the assumption that the adversary has complete knowledge of the detection algorithm. In a similar vein, Lowd et al. [17] have investigated algorithms for reverse engineering linear classifiers. This allows them to retrieve sufficient information to mount effective attacks. Barreno et al. [1] described an analytical model that lower bounds an adversary’s effort to manipulate a naive outlier detection algorithm. Finally, Biggio et al. [3] investigated a strategy for hiding information about the classifier from the adversary by introducing randomness in the decisions. In our paper we do not consider repeated interactions and thus we do not follow a gametheoretic approach. We instead consider how to model the adversary, when we have a lot of data from legitimate users, but no data from the adversary. The first contribution of this paper is a simple upper bound on attack probabilities without any knowledge of the adversarial model. These can be obtained simply by using the current (unlabelled), observations to create a worstcase model of the adversary. This is the model from the set of allowed models which results in the highest estimated probability of attack. As will be seen, however, this naive approach can lead to a large number of false alarms, since the bound is not very tight. Our second contribution is to extend the approach to Bayesian estimation. This is done by conditioning the adversary model’s prior on the data of the remaining population of users. This results in an empirical Bayes estimate of the prior [26], which is the essence of the world model approach proposed in Gauvain and Lee [13]. The prior then acts as a soft constraint, when selecting the worst-case adversary model. Our final contribution is an experimental analysis on a synthetic problem, as well as on some real-world data, with promising results.

4

2

Statistical decision making for authentication and intrusion detection

The proposed model framework

In the framework we consider, we assume that the set of all possible models is M. Each model in µ ∈ M is associated with a probability measure over the set of observations X , which will be denoted by µ(x) for x ∈ X , µ ∈ M, so long as there is no ambiguity. We must decide whether some observations x ∈ X , have been generated by a model q (the user) or a model w (the adversary) in M. Throughout the paper, we assume a prior probability of the user having generated the data, P(q), with a complementary prior P(w) = 1 − P(q), for the adversary. In the easiest scenario, we have perfect knowledge of q, w ∈ M. It is then trivial to calculate the probability that the user has generated the data, P(q|x). This is the oracle decision rule, defined in section 2.1. However, such a decision rule is not realisable. Though we could accurately estimate q with enough data, in general there is no way to estimate the adversary model w. Section 2.2 discusses a model employing maximum likelihood (ML) estimation on the currently seen observations to make a worst-case guess about the adversary model. This enables us to obtain a lower bound on P(q|x), the probability that the data was generated by the user, or equivalently, an upper bound on the probability of an attack. To keep the exposition simple, we initially consider that we know the user model q. This maximum likelihood model suffers from extreme pessimism. For this reason, section 2.3 introduces a maximum a posteriori (MAP) model for the adversary. This starts with a prior density ξ(w) over the possible adversary models w ∈ M. Currently seen observations are then used to form a posterior density ξ 0 , from which we select the model which corresponds to the maximum density. Naturally, if we start with a prior over possible models then we can in fact employ the full posterior distribution for the adversary model. Section 2.4 considers the case where the user model is known and where we are given a prior density ξ(w) over the possible adversary models w ∈ M. Currently seen observations are then used to form a pessimistic posterior ξ 0 for the adversary. Section 2.5 discusses the more practical case where neither the user model q, nor a prior ξ over models M are known, but must be estimated from data. More precisely, the section discusses methods for utilising other user data to obtain a prior distribution over models. This amounts to an empirical Bayes estimate of the prior distribution [26]. It is then possible to estimate q by conditioning the prior on the user data. This is closely related to the adapted world model approach [25], used in authentication applications, which however, usually employs a point approximation to the prior (see for example [6]).

2.1

The oracle decision rule

We shall measure the performance of all the models against that of the oracle decision rule. The oracle enjoys perfect information about the distribution of both the user and the adversary, as well as the a priori probability of an attack. Thus, on average, no other decision rule can do better. As before, let M be the space of all models. Let the adversary’s model be w and the user’s model be q, with q, w ∈ M. Given some data x, we would like to determine the probability that the data x has been generated by the user, P(q|x). The oracle model has knowledge of w, q and P(q), and so enables us to calculate: P(q|x) =

q(x)P(q) . q(x)P(q) + w(x)(1 − P(q))

(3)

However, we usually have uncertainty about both the adversary and the user model. Concerning the adversary, the uncertainty is much more pronounced. For this reason, we examine a model

Section 2

The proposed model framework

5

for the probability of an attack when the user model is perfectly known but we only have a prior ξ(w) for the adversary model.

2.2

Maximum likelihood adversary model

Let as now assume that we only know the user model q. We shall need only one additional assumption: that we know the prior probability of an attack, 1 − P(q), or that we have a subjective estimate for it. The maximum likelihood adversary model is based on a simple observation. If we recalculate a new model w ˆ = arg maxw p(x|w), for each new set of observations x that we want to classify, assuming they are all generated from the same model, then it is easy to see that: p(x|q)P(q) . (4) P(q|x) ≥ PˆM L (q|x) , p(x|q)P(q) + p(x|w) ˆ (1 − P(q)) We have thus obtained a lower bound3 on the probability of the user having generated the new data by conditioning the adversary model on the new data. This is clearly a worst-case scenario. In order to see how tight this bound actually is, we present a very simple test with binomial data. We consider observations x1:t , (x1 , . . . , xt ), with xi drawn from a Bernoulli with parameter either q, or w which is fixed for all i and we calculate PˆM L (q|x1:t ) and P(q|x1:t ) for t = 1, . . . , 100. 1 P(q|x), q B(q|x), q P(q|x), w B(q|x), w 0.8

P

0.6

0.4

0.2

0 10

20

30

40

50

60

70

80

90

100

# data

Figure 1: Demonstration of the maximum likelihood adversary model. The adversary and the legitimate user generate data x ∈ {0, 1} with mean w and q respectively. Setting a prior probability of attack to 0.1, we plot the evolution of the posterior probability for the data being generated from the user when the adversary model is known (P) and a lower bound on this probability, when the adversary model is unknown (B) for the two cases where the data is generated by either the user model (q) or the adversary (w). The results are shown in Figure 1, which compares the oracle posterior probability with the bound, for both cases: when the data has been generated by the user and when the data has been generated by the adversary. As can be seen, the bound on the user’s probability (B) tracks the oracle’s probability (P) nicely when the data has been generated by the adversary (w), as we continue to collect data. However, the bound significantly underestimates the user probability when it is in fact the user (q) that has been generating the data, even though our prior is quite high at P(q) = 0.9. As can be seen, the oracle’s posterior probability quickly converges to 1, while the bound oscillates around 0.8. Thus, it is to be expected that an application of such a method would quickly lead to a lot of false alarms. 3

This lower bound is only subjective if the attack probability is a subjective estimate.

6

Statistical decision making for authentication and intrusion detection

100

100

oracle ML estimate

oracle ML

% error

% error

10

10

1

1

0.1 1

10

100

1000

1

10

100

1000

# data

# data

(a) K=2

(b) K=3

100

100

oracle ML

oracle ML

10

% error

% error

10

1

1 0.1

0.1 1

10

100 # data

(c) K=4

1000

0.01 1

10

100

# data

(d) K=16

Figure 2: Classification performance comparison between ML and oracle models for varying dimensionality of observations.

Section 2

The proposed model framework

7

Indeed, due to a complete lack of information on the adversary, the maximum likelihood (ML) model can overfit the data for the adversary. We noticed in further experiments (not shown) that this behaviour becomes worse when we perform the experiment with a multinomial distribution of degree K, especially for large K. Intuitively, the reason is that in high dimensional spaces, the likelihood function becomes extremely peaky. Then, arg maxw p(x|w) can be much larger than p(x|q), even if the data x has been generated from q.

2.3

Maximum a posteriori adversary model

The extreme pessimism of the previous approach causes a lot of false alarms compared to the oracle. This is mainly due to the extreme bias of the adapted model to the current observations. One possibility to reduce this bias is to start with a subjective prior ξ(w) over adversary models. We can condition this via the Bayes rule on the current observations, ξ(w|x) = R

p(x|w)ξ(w) , p(x|w = u)ξ(w = u) du

(5)

and select the model with the highest posterior probability: w ˆ = arg maxw ξ(w|x). We can now plug the estimated model in (4) to obtain an approximate (subjective) bound on the user’s probability. This bound should be tight as long as ξ(w) is close to the actual distribution over adversary models w. To make this more concrete, consider that the adversary generates multinomial observations of degree K. Our initial belief ξ(w) is a Dirichlet prior with parameters Φ , (φ1 , . . . , φK ) over adversary models: K

1 Y φi −1 wi , ξ(w) = B(Φ)

(6)

i=1

which is conjugate to the multinomial [8]. Given a sequence of observations x1 , . . . , xn , with P xt ∈ 1, . . . K, where each outcome i has fixed probability wi , then ci = nt=1 I(xt = i), where I is an indicator function, is multinomial and the posterior distribution over the parameters wi is also Dirichlet with parameters φ0i = φi + ci . In ˆi = (φi − P order to obtain the MAP estimate, we note that this is maximised for w 1)/( j φj − K) if φi > 1. 2.3.1

Practical consistency of the bound

Now we can apply our model to the same problem as before. For the experiments, we use a matching prior for the user and the adversary (i.e. they are both multinomial distributions drawn from the same Dirichlet prior). For each observation sequence x1:t we start with the same prior ξ and calculate ξ(w|x1:t ) to obtain the MAP model w, ˆ which is then used to calculate p(x|q) P (q) PˆM AP (q|x) = p(x|q)P(q)+p(x|w) . The results are summarised in Figure 3, which shows ˆ (1−P(q)) oracle probabilities (P) and maximum a posteriori (MAP) probability bounds (B) for K = 16 outcomes, as the number of observations increases, averaged over 10,000 trials. We see that the user probability bound tracks that of the oracle for both when the data has been generated by the user (q) and when it has been generated by the adversary (w). Thus, the behaviour of the MAP model is much more stable even with more complex distributions than the ML model was with K = 2.

8

Statistical decision making for authentication and intrusion detection

1

P(q|x), q B(q|x), q P(q|x), w B(q|x), w

0.9

P(q|x), q B(q|x), q P(q|x), w B(q|x), w

0.9 0.8

0.8

0.7 Estimated probability

0.7

P

0.6

0.5

0.6 0.5 0.4

0.4 0.3 0.3

0.2

0.2

0.1 1

10

1

100

(a) K=2 1

P(q|x), q B(q|x), q P(q|x), w B(q|x), w

P(q|x), q B(q|x), q P(q|x), w B(q|x), w

0.8

0.8

0.6

0.6

0.4

0.2

0

100

(b) K=4

Estimated probability

Estimated probability

1

10 # data

# data

0.4

0.2

0 1

10 # data

(c) K=8

100

1

10

100

# data

(d) K=16

Figure 3: Probability estimation for MAP and oracle models for varying dimensionality of observations.

Section 2

2.4

The proposed model framework

9

Bayesian adversary model

We can use a subjective prior ξ(w) over possible adversary models, to calculate R the probability of observations given that they have been generated by the adversary: ξ(x) = M w(x)ξ(w)dw.4 We can express the probability, under the belief ξ concerning the adversary model, of the user q given the observation x as: ξ(q|x) , P(q|x, ξ) =

q(x)P(q) . q(x)P(q) + ξ(x) (1 − P(q))

(7)

The difference with (3) is that, instead of w(x), we have a the marginal density ξ(x). If ξ(w) represents our subjective belief about the adversary model w, then by employing (7) we are in fact performing the Bayesian equivalent of the world model approach, where the prior over w plays the role of the world model. Now let: ξ 0 (w) , ξ(w|x) be the model posterior for some observations x. We shall need the following lemma: Lemma 2.1 For any probability measure ξ on M, where M is a space of probability distributions on X , such that each µ ∈ M defines a probability (density) µ(x) with x ∈ X , with admissible posteriors ξ 0 (µ) , ξ(µ|x), the marginal likelihood satisfies: ξ 0 (x) ≥ ξ(x), ∀x ∈ X . A simple proof, using the Cauchy-Schwarz inequality on the norm induced by the measure ξ, is presented in the Appendix. From the above lemma, it immediately follows that: ξ(q|x) ≥ ξ 0 (q|x) q(x)P(q) R , (8) q(x)P(q) + (1 − P(q)) M w(x)ξ 0 (w) dw R R since ξ 0 (x) = w(x)ξ 0 (w) dw ≥ w(x)ξ 0 (w) dw = ξ(x). Thus (8) gives us a subjective upper bound on the probability of the data x having been generated by the adversary, which can be used to make decisions. The performance of the Bayesian models in the multinomial tests gave similar results (not shown) to the MAP models. Finally, note that we can form ξ 0 (w) on a subset of x. This possibility is explored in the experiments. =

2.5

Prior and user model estimation

Specifically for user authentication, we have data from two sources. The first is data collected from the user which we wish to identify. The second is data collected from other people5 . Now, consider that the user can be fully specified in terms of a model q ∈ M, with q drawn from some unknown distribution γ over M. If we had the models µi ∈ M for all the other people in our dataset, then we would in fact obtain an empirical estimate γˆ of the prior distribution of models. Empirical Bayes methods for prior estimation [26] extend this procedure to the case where we only observe x ∼ µi . Once we have an estimate γˆ of γ, and some R data x ∼ µ, with µ ∼ γ, we can form a posterior for µ using Bayes rule: γˆ (µ|x) = µ(x)ξ(µ)/ M γˆ (x|µ)ξ(dµ), over all µ ∈ M. For a specific user k with data xk , we calculate the posterior ψk (µ) , γˆ (µ|xk ). Whenever we must decide the class of a new observation x, we set the prior over the adversary models to ξ = γˆ and then condition on part, or all, of x to obtain the posterior ξ 0 (w). We then calculate : P(qk |x, ξ 0 , ψk ) = 4

ψk (x)P(qk ) , [ψk (x)P(qk ) + (1 − P(qk ))ξ 0 (x)]

(9)

Here we used the fact that ξ(x|w) = w(x), since the probability of the observations given a specific model w no longer depends on our belief ξ about which model w is correct. 5 These are not necessarily other users.

10

Statistical decision making for authentication and intrusion detection

the posterior probability of the k-th user given the observations x and our beliefs ξ 0 and ψk over adversary and user models respectively. When ξ 0 = ξ, we obtain an equivalent to the world model approach of [25], which is an approximate form of the empirical Bayes procedure suggested in [13]. In our case, since we consider multinomial distributions drawn from a Dirichlet density, we can use a maximum likelihood estimate based on Polya distributions for γ. More specifically, we use the fixed point approach suggested in [18] to estimate Dirichlet parameters Φ from a set of multinomial observations. To make this more concrete, consider multinomial observations of degree K. Our initial belief φi −1 1 QK µ , ξ(µ) is a Dirichlet prior with parameters Φ , (φ1 , . . . , φK ) over models: ξ(µ) = B(Φ) i=1 i which is conjugate to the multinomial [8]. Given a sequence of observations x1 , . . . , xn , with P xt ∈ 1, . . . K, where each outcome i has fixed probability µi , then ci = nt=1 I(xt = i), where I is an indicator function, is multinomial and the posterior distribution over the parameters µi is also Dirichlet with parameters φ0i = φi + ci . The approach suggested in [18] uses the following P ik +φi )−Ψ(φ new k Ψ(cP P Pi ) fixed point iteration for the parameters: φi = φi , where Ψ(·) is the k Ψ(ck + i φi )−Ψ( i ai ) digamma function.

oracle world bias world f bias world p bias world n bias world

0.35

0.3

Error

0.25

0.2

0.15

0.1

0.05

5

10

15

20

25

30

# data

(a) Known user and prior oracle world bias world f bias world p bias world n bias world

0.35

0.3

Error

0.25

0.2

0.15

0.1

0.05

5

10

15

20

25

30

# data

(b) Estimated user and prior

Figure 4: The evolution of error rates as more data becomes available, when the user model and prior are either (a) or estimated (b). The points indicate means from 104 runs and the lines top and bottom 5% percentiles from a bootstrap sample.

Section 3

3

Experimental evaluation

11

Experimental evaluation

We have performed a number of experiments in order to evaluate the proposed approach and compared it to the well-known world model approach. These experiments are divided in two groups: experiments performed on synthetic data, and experiments performed on real data. The first group contains two types of experiments. In the first type the set of K-degree multinomial models M, assumes that the user prior γ is a known Dirichlet density, and that the user model is a known multinomial model q. We then compare the oracle and the world model approach (based on γ) with a number of differently biased MAP models. In the second type, we assume multinomial models again, but rather than knowing γ, we use data from other users to form an empirical estimate γˆ . Furthermore, q is itself unknown and is estimated via Bayesian updating from γˆ and some data specific to the user. We then perform the same comparison, but this time the world model is based on the estimate γˆ . In both cases, the adversary model uses the world model (γ or γˆ ) as the adversary prior (ξ). The second group concerns experiments on data gathered from an access control system. The data has been discretized into 1320 integer variables, in order for it to be modelled with multinomials. The models are of course not available so we must estimate the priors: The data of a subset of users is used to estimate γˆ . The remaining users alternatively take on the roles of legitimate users and adversaries. We compare the following types of models, which correspond to the legends in the figures of the experimental results. (a) The oracle model, which enjoys perfect information concerning adversary and user distributions. (b) The world model, which uses the prior over user models as a surrogate for the adversary model. (c) The bias world model, which uses all but the last observation to obtain a posterior over adversary models, and similarly: (d) the f bias world model, which uses all observations, (e) the p bias world model, which weighs the observations by 1/2 and (f) the n bias world model, which uses the first half of the observations. In all cases, we used percentile calculations based on multiple runs and/or bootstrap replicates [10] to assess the significance of results.

3.1

Synthetic experiments

For this evaluation, we ran 104 independent experiments and employed multinomial models. For each experiment, we first generated the true prior distribution over user models γ. This was created by drawing Dirichlet parameters φi independently from a Gamma distribution. We also generated the true prior distribution over adversary models γ 0 , by drawing from the same Gamma distribution. Then, a user model q was drawn from γ and an adversary model w was drawn from γ 0 . Finally, by flipping a coin, we generated data x1 , . . . , xn from either q or w. Assuming equal prior probabilities of user and adversary, we predicted the most probable class and recorded the error. This was done for all subsequences of the observations’ sequence x. Thus, the experiment measures the performance of methods when the amount of data that informs our decision increases. 3.1.1

Known user and prior over users

For these experiments, we use the actual Dirichlet distribution γ as the world model, and the actual user parameters q as the user model. The results for K = 4 are summarised in Figure 4a. It can be seen that while the oracle has a substantial advantage over the other models, nevertheless, the simple world model is consistently outperformed by the biased models. For larger K (not shown), the performance differences are more pronounced, while for more data, the bias and f bias models started to perform worse than the world.

12

Statistical decision making for authentication and intrusion detection

3.1.2

Unknown user and priors

For these experiments, we estimate the actual Dirichlet distribution with γˆ . This estimation is performed via empirical Bayes using data from 1000 users drawn from the actual prior γ. At the k-th run, we draw a user model qk ∼ γ and subsequently draw xk ∼ qk . We then use γˆ and the user data xk ∈ NK , to estimate a posterior over user models for the k-th user, ψk (q) , γˆ (q|xk ). The estimated prior γˆ is also used as the world model and as the prior over adversary models. The results, shown in Figure 4b, show that the biased models consistently outperform the classic world model approach, while the partially biased models become significantly better than the fully biased models when the amount of observations increases. The results in Figure 4b, do not exhibit a substantial difference compared to the case where the user is known. This is encouraging for application to real-world data.

3.2

Real data

The real world data were collected from an RFID based access control system used in two buildings of the TNO organization (Netherlands Organization for Applied Scientific Research). The data were collected during a three and a half month period, and they include successful accesses of 882 users, collected from 55 RFID readers granting access to users attempting to pass through doors in the buildings. The initial data included three fields: the time and date that the access has been granted, the reader that has been used to get access and the ID of the RFID tag used6 . In order to use the data in the experimental evaluation of the proposed model framework, we have discretized the time into hour-long intervals, and counted the number of accesses, per hour, per door for each user, in each day. This resulted in a total of ≈ 2 · 105 records. Since there are 24 hour-long slots in a day, and a total of 55 reader-equipped doors, this discretisation allowed us to model each user by a 1320-degree multinomial/Dirichlet model. Thus, even though the underlying Dirchlet/multinomial model framework is simple, the very high dimensionality of the observations makes the estimation and decision problem particularly taxing. 3.2.1

Experiments

We performed 10 independent runs. For the k-th run, we selected a random subset Uγ of the complete set of users U , such that |Uγ |/|U | = 2/3. We used Uγ to estimate the world model γˆ . The remaining users UT = U \Uγ were used to estimated the error rate over 103 repetitions. For the j-th repetition, we randomly selected a user i ∈ UT with at least 10 records Di . We ¯ i , to obtain ψi (q) , γˆ (q|D ¯ i ). By flipping a coin, we obtain either used half of those records, D ¯ i , or (b) data from some other user in UT . Let us call that data xj . (a) one record from Di \D For the biased models, we set ξ = γˆ and then used xj to obtain ξ(w|f (xj )), where f (·) denotes the appropriate transformation. Figure 5 shows results for the baseline world model approach (world), where f (x) = ∅, as the unmodified world model is used for the adversary, the full bias approach (f bias), where f (x) = x since all the data is used, and finally the partial bias approach (p bias) where f (x) = x/2. The other approaches are not examined, as the oracle is not realisable, while the half-data and the all-but-last-data biased models are equivalent to the baseline world model, since we do not have a sequence of observations, but only a single record. As can be seen in Figure 5, the baseline world model is always performing worse than the biased models, though in two runs the full bias model is close. Finally, though the two biased models are not distinguishable performance-wise, we noted a difference in the ratio of false 6

The data were sanitised to avoid privacy issues.

Section 4

Conclusion

13

positives to false negatives. Over the 10 runs, this was 0.2 ± 0.1 for the world model approach, 2.5 ± 0.5 for the fully biased model, and 0.9 ± 0.2 for the partially biased model. 0.5

0.4

Error

0.3

0.2

0.1 world f bias p bias 0 2

4

6

8

10

# run

Figure 5: Error rates for 10 runs on the TNO door data. The error bars indicate top and bottom 5% percentiles from 100 bootstrap samples from 103 repetitions per run.

4

Conclusion

We have presented a very simple, yet effective approach for classification problems where one class has no data. In particular, we define a prior over models which can be estimated from population data. This is adapted, as in the standard world-model approach, to a specific user. We introduce the idea of creating an adversary model, for which no labelled data exists, from the prior and currently seen data. Within the subjective Bayesian framework, this allows us to obtain a subjective upper bound on the probability of an attack. Experimentally, it is shown (a) we outperform the classical world model approach, while (b) a Bayesian framework is essential for such a scheme to work, in which case (c) we outperform a baseline world model approach, while (d) it is always better to only partially condition the models on the new observations. It is possible to extend the approach to the cost-sensitive case. Since we already have bounds on the probability of each class, together with a given cost matrix, we can also calculate bounds on the expected cost. This will allow us to make cost-sensitive decisions. A related issue is whether to alter the a priori class probabilities; in our comparative experiments we used equal fixed values of 0.5. It is possible to utilise the population data to tune it in order to achieve some desired false positive / negative ratio. Such an automatic procedure would be useful to do an expected performance curve [2] comparison between the various approaches. Finally, since the experiments on this relatively complex problem gave promising results, we

14

REFERENCES

plan to evaluate it on other problems that exhibit a lack of adversarial data.

A

Proof

Lemma 2.1 For discrete M, the marginal prior ξ(x) can be re-written as follows: X X X ξ(x) = ξ(x, µ) = ξ(x|µ)ξ(µ) = µ(x)ξ(µ), µ

and similarly: ξ 0 (x) = sufficient to show

P

µ

µ

1 µ(x)ξ(µ)

2 µ µ(x) ξ(µ).

P

Thus, to prove the required statement, it is

!1/2 X

µ(x)2 ξ(µ)

(10)

µ

!1/2 ≥

X

µ

µ(x)ξ(µ).

µ

X

p(x|µ)2 ξ(µ)

µ



X

p(x|µ)ξ(µ).

(11)

µ

Similarly, for continuous M, we obtain: Z

1/2 Z ≥ µ(x) dξ(µ). µ(x) dξ(µ) 2

(12)

R In both cases, the norm induced by the probability measure ξ on M is kf k2 = ( M |f (µ)|2 dξ(µ))1/2 , thus allowing us to included apply the Cauchy-Schwarz inequality kf gk1 ≤R kf k2 kgk2 . By setting f (µ) = µ(x) and g(µ) = 1, we obtain the required result, since kgk2 = ( M dξ(µ))1/2 = 1, as ξ is a probability measure. kf gk1 ≤ kf kp kgkq ,

with 1 ≤ p, q ≤ ∞, 1/p + 1/q = 1.

(13)

By setting R p = q = 2, f (µ) = p(x|µ) and g(µ) = 1, we obtain the required result, since kgkq = M dξ = 1, as ξ is a probability measure.

References [1] M. Barreno, B. Nelson, R. Sears, A.D. Joseph, and J.D. Tygar. Can machine learning be secure? In 2006 Asian ACM Symposium on Information, Computer and Communications Security (ASIACCS ’06), pages 16–25, Taipei, Taiwan, 21-24 March 2006. [2] S. Bengio, J. Mari´ethoz, and M. Keller. The expected performance curve. In ICML Workshop on ROC Analysis in MAchine Learning, 2005. [3] B. Biggio, G. Fummera, and F. Roli. Adversarial pattern classification using multiple classifiers and randomisation. In N. da Vitoria Lobo, T. Kasparis, M. Gergiopoulos, F. Roli, J. Kwok, G.C Anagnostopoulos, and M. Loog, editors, Structural, Semantic, and statistical Pattern Recognition, Proceedings of the Joint IAPR International Workshop, SSPR & SRP 2008, volume 5342 of Lecture Notes in Computer Science, pages 500–509, Orlando, USA, 4-6 December 2008. [4] M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander. LOF: Identifying density-based local outliers. ACM SIGMOD Record, 29(2):93–104, 2000. [5] A.A. C´ardenas and J. D. Tygar. Statistical classification and computer security. In Proceedings of the Workshop on Machine Learning in Adversarial Environments for Computer Security, (NIPS 2007), Whistler, BC, Canada, December 2007.

REFERENCES

15

[6] F. Cardinaux, C. Sanderson, and S. Bengio. User authentication via adapted statistical models on face images. IEEE Transactions on Signal Processing, 54(1), January 2005. [7] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99–108, Seattle, WA, USA, 2004. ACM Press. [8] M. H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, 1970, Republished in 2004. [9] H. Deng, Q. Zeng, and D.P. Agrawal. SVM-based intrusion detection system for wireless ad hoc networks. In Proceedings of the 58th IEEE Vehicular Technology Conference (VTC’03), volume 3, pages 2147–2151, Orlando, FL, USA, 6-9 October 2003. [10] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap, volume 57 of Monographs on Statistics & Applied Probability. Chapmann & Hall, November 1993. ISBN 0412042312. [11] W. Fan, W. Lee, S.J. Stolfo, and M. Miller. A multiple model cost-sensitive approach for intrusion detection. In Proceedings of the 11th European Conference on Machine Learning (ECML ’00), volume 1810 of Lecture Notes in Computer Science, pages 142–153, Barcelona, Catalonia, Spain, 2000. [12] S. Furui. Robust speech recognition. In K. Ponting, editor, Computational Models of Speech Pattern Processing, NATO ASI Series, pages 132–142. Springer-Verlag, Berlin, 1999. [13] J. Gauvain and Chin-Hui Lee. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Transactions on Speech and Audio Processing, 2:291–298, 1994. [14] M.I. Jordan. Learning in Graphical Models. Kluwer Academic Publishers, 1998. [15] E.M. Knorr and R.T. Ng. Algorithms for mining distance based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Databases (VLDB 1998), pages 392–403, New York, NY, USA, 24-27 August 1998. [16] Y. Liu, Y. Li, and H. Man. MAC layer anomaly detection in ad hoc networks. In Proceedings of the 6th Annual IEEE SMC Information Assurance Workshop (IAW ’05), pages 402–409, West Point, NY, USA, 15-17 June 2005. [17] D. Lowd and C. Meek. Adversarial learning. In Proceedings of the 11th ACM International on Knowledge Discovery and Data Mining (ACM SIGKDD ’05), pages 641–647, Chicago, IL, USA, 2005. [18] T. Minka. Estimating a Dirichlet distribution, 2003. [19] A. Mitrokotsa, C. Dimitrakakis, and C. Douligeris. Intrusion detection using cost-sensitive classification. In Proceedings of the 3rd European Conference on Computer Network Defense (EC2ND’07), LNEE (Lecture Notes on Electrical Engineering), pages 35–46, Heraklion, Greece, 4-5 October 2007. Springer-Verlag. [20] A. Mitrokotsa, N. Komninos, and C. Douligeris. Intrusion detection with neural networks and watermarking techniques for MANET. In Proceedings of the IEEE International Conference on Pervasive Services (ICPS ’07), pages 118–127, Istanbul, Turkey, 15-20 July 2007.

16

REFERENCES

[21] S. Mukkamala, A.H. Sung, and B. Abraham. Intrusion detection using an ensemble of intelligent paradigms. Journal of Network and Computer Applications, Special Issue on Computational Intelligence on the Internet, 28(2):167–182, 2005. [22] P. Pietraszek. Using adaptive alert classification to reduce false positives in intrusion detection. In Proceedings of Recent Advances in Intrusion Detection 7th International Symposium (RAID04), volume 3224 of Lecture Notes in Computer Science, pages 102–124, Sophia, Antipolis, France, 2004. Springer - Verlag. [23] L. Portnoy, E. Eskin, and S.J. Stolfo. Intrusion detection with unlabeled data using clustering. In Proceedings of the ACM CSS Workshop on Data Mining Applied to Security (DMSA ’01), pages 5–8, Philadelphia, PA, USA, 5-8 November 2001. [24] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 427–438, Dallas, TX, USA, 14-19 May 2000. [25] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10:19–41, 2000. [26] H. Robbins. An empirical bayes approach to statistics. In J. Neyman, editor, Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 157–163. University of California Press, Berkeley, CA, 1955. [27] K. Sequeira and M. Zaki. ADMIT: Anomaly-based data mining for intrusions. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’02), pages 386–395, Edmonton, Alberta, Canada, 23-26 July 2002.

Acknowledgements This work was partially supported by the Netherlands Organization for Scientific Research (NWO) under the RUBICON grant “Intrusion Detection in Ubiquitous Computing Technologies” awarded to A. Mitrokotsa and the ICIS project, supported by the Dutch Ministry of Economic Affairs, grant nr: BSIK03024. Many thanks to Norman Poh and Zhou Fang for comments and discussions and to Karel van Houten and Dr. Thijs Veugen for their help in obtaining the real world data.

IAS reports This report is in the series of IAS technical reports. The series editor is Bas Terwijn ([email protected]). Within this series the following titles appeared: [28] P. Oude, G. Pavlin Dependence discovery in modular Bayesian networks. Technical Report IAS-UVA-09-02, Informatics Institute, University of Amsterdam, The Netherlands, 2009. [29] C. Dimitrakakis Complexity of Stochastic Branch and Bound for Belief Tree Search in Bayesian RL Technical Report IAS-UVA-09-01, Informatics Institute, University of Amsterdam, The Netherlands, April 2009. [30] C. Dimitrakakis and M.G. Lagoudakis Algorithms and bounds for rollout sampling approximate policy iteration Technical Report IAS-UVA-08-03, Informatics Institute, University of Amsterdam, The Netherlands, July 2008. All IAS technical reports are available for download at the ISLA website, http: //www.science.uva.nl/research/isla/MetisReports.php.

Statistical decision making for authentication and ...

estimation of a decision rule based on the training data; and thirdly, the ... In this paper, we shall separate the data in two conceptual classes: the “user” and the ...... In Proceedings of the 3rd European Conference on Computer Network ...

394KB Sizes 1 Downloads 130 Views

Recommend Documents