Combining Time and Space Similarity for Small Size ...

Viewer
Transcript

Combining Time and Space Similarity for Small Size Learning under Concept Drift ˇ Indr˙e Zliobait˙ e Vilnius University, Faculty of Mathematics and Informatics, Naugarduko st. 24, LT-03225 Vilnius, Lithuania, [email protected]

Abstract. We present concept drift responsive method for classifier training for sequential data. Relevant instance selection for training is based on similarity to the target observation. Similarity in space and in time is combined. The algorithm determines an optimal training set size. It can be used plugging in different base classifiers. The proposed algorithm shows the best accuracy in the peer group. The algorithm complexity is reasonable for the field applications.

1

Introduction

Classification of sequential data is often a non stationary problem. Changes in underlying distribution are referred to as concept drift. Concept drift occurs when one data source S1 (underlying probability distribution) gets replaced by S2 . For example, in spam categorization new ways of spam are invented, personal interpretation, what is spam, might change. Concept drift is assumed to be unpredictable with certainty, but can be expected depending on the data domain. An optimal classifier after the drift would model S2 . The data from the source S2 is scarce after the drift. If concept drift is observed, heuristics suggests dropping the old data (originating from S1 ) out of the training sample, leaving a training window of the N most recent sequential instances. Another approach is to relearn selecting only relevant instances into the training set. We introduce a method which builds the classifier using the nearest neighbors (in time and space) of the observation in question. The method is expected to demonstrate a competitive advantage under gradual non uniform drift scenarios in small and moderate size data sequences. In Section 2 we outline related work and map the proposed method. In Section 3 the proposed method is presented in detail. Section 4 gives experimental setup and the results. Sections 5 and 6 conclude.

2

Related work

Concept drift is assumed being not predictable with confidence, classification methods need to have adaptation mechanisms. One group of concept drift responsive methods is based on change detection [22, 6, 16, 4, 9]. After the drift is

detected, a portion of the old data is dropped out of the training sample. These methods are usually based on sudden drift scenario. Another group of methods build multiple classifiers and then try to replicate the distribution of concepts via fusing or selecting the classifier outputs. This group includes heuristic window search algorithms [11, 17] as well as classifier ensembles [18, 21, 12, 19]. The training data is formed including instances, which are sequential in time. Unselective use of the old data might lead to gain in accuracy by chance, depending on consistency between the old and the new concept. When concept drift is gradual several data sources might be active at a time interval, systematic training data selection is needed. Systematic data selection issue has been brought up in [7, 5, 15, 19, 3]. Ganti et al. [7] describe generic state of the art framework for systematic training data selection without real plug-and-play algorithm. Lazarescu et al. in [15] extend their previous algorithm with relevance (similarity to current concept) based forgetting. Tsymbal et al. [19] test the ensemble classifiers on the nearest neighbors of the target observation. They select the classifier, which will make the final decision, based on this test. They do not use instance selection for building the classifiers. Fan in [5] proposes decision tree specific plug-and-play method. They build multiple classifiers and include systematic training set selection. [3] organize training data into prototype clusters, referred to as case bases. In contrast to the above works, they exclude the instances which are too similar to the ones already present in the training set. In this study we propose an algorithm FISH, which systematically selects training instances. Similarity in time and space is addressed. The method can be used with different base classifiers. In FISH method the size of a training set is prefixed and set in advance, while the extension FISH2 operates using variable training set size, which is reset in each step of the training process. FISH builds a classifier based on similarity to target instance. Other multiple classifier methods [19] employ similarity aspect but only in the classifier selection phase. However, the needed classifier might not be present among the ensemble members. Our approach is similar to lazy learning [2], but the main difference is that the latter does not construct explicit generalization and makes classification based on direct comparison of the target and training instances.

3 3.1

The FISH methods Scenario set up

Consider a streaming data for classification. One data point x ∈ ℜp is received at a time. At time t + 1 the task is to predict the class label for the target observation xt+1 . Any selected or all the historical labeled data x1 , . . . , xt can be used to build a classifier. At time t + 2 after the classification decision and receiving the true label, we could add xt+1 to the training data. At time t + 1 one data source is active, but the data instance can come from several sources with a probability. Assume the prior probabilities of the classes are equal.

Euclidean distance

time

Fig. 1. (a) Gradual drift scenario. (b) Electricity data example, ◦ and ∗ mark classes.

Concept drift means that i.i.d. may not hold for the data x1 , . . . , xt . For example consider a gradual drift scenario, illustrated in Figure 1. Up to time t1 data generating source S1 is active. In time interval t1 + 1, . . . , t2 both sources are active and an instance comes from either one or the other source with a probability. The probability of sampling from S2 increases with time. A designer does not know when the sources switch. We aim to select a training set, consisting of the instances, which are similar to the target observation. We can find, how similar xt+1 is to the training instances available, even though the label of xt+1 is not known. 3.2

Similarity in Time and Space

Similarity between two instances is usually defined as a function of distance in space. If the problem is non stationary, similarity in time is relevant as well. Consider a rotating hyperplane example in Figure 2. A binary classification problem is represented by black and gray dots. There are three data generating sources S1 , S2 and S3 . In (a) the instances which coming from the source S1 are depicted. Later in (b) the source S2 becomes active, the discriminant line for there new instances has rotated 45o . Finally, in (c) the source S3 is active, corresponding to another rotation of 45o . A circle in each of the subfigures defines neighborhood (similarity) in space. In (a) only instances from class 2 are within the circle, in (b) there is a mix of both classes, while in (c) only class 1 instances are within the circle. The circle is fixed in space, but the nearest neighbors within the circle would depend on which data source was active at that time. Let Dij be a similarity measure between two instances xi and xj . We combine similarities in space and time as a sum of the distances: (s)

(t)

Dij = a1 dij + a2 dij ,

(1)

where d(s) indicates distance in space, d(t) indicates distance in time, a1 , a2 are (s) (t) ∗ = dij +Adij . the weight coefficients, for simplicity A = aa21 can be used, then Dij In order to manage the balance between the time and space distances, d(s) and d(t) need to be normalized. We suggest scaling the values of each feature

1.5

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

−1

−1

−1.5

−1

0

1

−1.5

−1

0

1

−1.5

−1

0

1

Fig. 2. Rotating hyperplane example: (a) initial source S1 , (b) source S2 after 45o rotation, (c) source S3 after 90o rotation. Black and grey dots represent the two classes.

to form an interval [0, 1]. Similarity in time between the instances xi and xt is (t) defined here as dij = |i − j|. We scale the time distances so that dij ∈ [0, 1]. We justify this simple setting by the following property. If only similarity in space is addressed in Equation (1) a2 = 0, the measure turns to instance selection. If only similarity in time is addressed a1 = 0, we end up with the commonly used training window approach. 3.3

The FISH algorithm

FISH selects training instances based on similarity to the target observation. We present two versions of the algorithm, FISH requires training sample size as an input, while an extension FISH2 allows variable training sample size. To implement a variable sample size we incorporate the principles from two windowing methods [11] (KLI) and [19] (TSY). In the presentation we focus on more advanced FISH2, more details on FISH can be found in the technical report [20]. FISH2 is presented in Figure 3. First we calculate the similarities in time and space between the target observation and the historical instances (Equation (1)) and sortpthem Pp from minimum 2 to maximum. We use the Euclidean distance: d(x, y) = i=1 |xi − yi | as a simple measure. The effect of different similarity measures is a subject of further investigation. The features are scaled to the interval [0, 1]. Next, the closest k instances to the target observation are selected as a validation set. The training sets are formed using the closest s instances. The method works similarly to windowing in [11] (KLI). They use sequential instances in time to form the windows. We use combined time and space similarity. We select the training size L, which has given the best accuracy on the validation set. Cross validation needs to be employed when testing on the validation set. That means we take validation instances for testing one by one. At testing time the one is excluded from the training set. Without cross validation the training set of size k is likely to give the best accuracy, because in that case training set would be equal to the validation set. The outcome of the algorithm is a set of L training indices {i1 , . . . , iL }. For the final classification decision regarding the observation xt+1 , a set of L original

The Instance Selection algorithm (FISH2) Input: labeled observations in ℜp : x1 , . . . , xt , target observation xt+1 with unknown label, neighborhood size k, time/space similarity weight A (Equation (1)). 1. Calculate distances Dj = d(xj , xt+1 ) for j = 1, . . . , t (Equation (1)). 2. Sort the distances from minimum to maximum: Dz1 < Dz2 < . . . < Dzt . 3. For s = k : step : t (a) select s instances having the smallest distance D, (b) using cross-validation a build a classifier Cs using the instances indexed {i1 , . . . , is } and test it on the k nearest neighbors indexed {i1 , . . . , ik }, record testing error es . 4. Find the minimum error classifier CL , where L = arg mintL=k (eL ). 5. Output the indices {i1 , . . . , iL }. Output: indices {i1 , . . . , iL } to form a training set using x ∈ ℜp observations. a

when test on the instance ik , this instance is excluded from the validation set Fig. 3. The Instance selection algorithm (FISH2)

training instances are used {xi1 , . . . , xiL }. The method can be used plugging in various base classifiers. FISH2 differs from FISH in two main aspects. FISH uses prefixed training size N , while training set size in FISH2 is variable. FISH forms training set handling the instances coming from different classes separately, while FISH2 collects training sample from all the closest instances.

4

Experimental evaluation

To support the viability of FISH2/FISH, we test them using real datasets along with the peer group of algorithms: Klinkenberg et al. [11] (KLI) and Tsymbal et al. [19] (TSY). Klinkenberg algorithm tries out a set of different training windows and selects the one which shows the best accuracy on the most recent training data. Tsymbal algorithm builds a number of classifiers on different consecutive training subsets used and uses similarity to target observation to select the final classifier. Both use windowing to form the base classifiers. In contrast, FISH 2 builds the multiple classifiers using systematic instance selection. The motivation for choosing this peer group is to highlight the effect of systematic instance selection which is done by FISH and FISH2. The chosen algorithms use similar framework, based on multiple classifiers, they use no explicit change detection and are base classifier independent We include all the history method (ALL) as a benchmark in testing. Every time step the classifier is retrained using all the past data. If the data happens to be stationary, ALL should be the most accurate.

We test the methods using five base classifiers. Parametric Nearest Mean classifier (NMC), non-parametric k Nearest Neighbors classifier (kNN) (for which we take k = 7), Parzen Window classifier (PWC) and not pruned decision tree (tree). 4.1

Datasets

We use three real datasets with expected concept drift, three real dataset with artificial drift and one synthetic dataset for illustration of concept. We made Luxembourg data (LU) using European Social Survey 1 [10] 20022007. The task is to classify a subject with respect to the internet usage ‘high’ or ‘low’. We use 20 features (31 after transformation of the categorical variables), which were selected as general demographic representation. The set is balanced 977 + 924. Ozone level detection [1] (Ozone) represents local ozone peak prediction, that is based on eight hours measurement. Data size 24 × 2534. The set is highly imbalanced 160 + 2374, with only 160 ozone peaks. Electricity market data (Elec), first described by Harries [8]. We use the time period with no missing values comprised of 6 × 2956 instances collected from May 11 to July 11, 1997. Labels ‘up’ or ‘down’ indicate the change of the price. The set is moderately balanced 1673 + 1283. German credit approval (Cred) [1] classifies customers as having good or bad credit risks. Following [13], a gradual concept change was introduced artificially as a hidden context. We sort the data using one of the features (feature ‘age’ was chosen) and then eliminate this feature from the dataset. Data size 23 × 1000. The set is imbalanced 700 + 300. Vote data [1] (Vote) represents 1984 United States Congressional Voting Records. The instances represent 435 congressmen (267 democrats, 168 republicans). There are 16 features (votings). The data is categorical, we coded ‘no’ as −1, ‘yes’ as 1, missing value as 0. Ionosphere data [1] (Iono) represents Johns Hopkins University Ionosphere data. Binary classification task of the radar returns into ‘good’ and ‘bad’ signals. Data size 43 × 351. The set is moderately balanced 136 + 215. In Iono and Vote datasets the concept drift is artificial, since we assume the data is presented in a time order and use sequential training-testing procedure. How can we know that there is a concept drift in these datasets? If using this procedure concept drift responsive methods outperform ALL the history method, this can be treated as the drift evidence (non stationarity of the sequential data). We generate Hyperplane data (Hyp) rotating the decision boundary, as shown in Section 3.2. We use 30o rotation for each concept. We have five concepts in total 0o , . . . , 90o . We generate 10 instances from each concept and 40 instances from each mix of neighboring concept with linearly increasing sampling probability. The size of the data is 65 + 95. 1

Norwegian Social Science Data Services (NSD) acts as the data archive and distributor of the ESS data.

4.2

Implementation details

For FISH and TSY we use training set size N = 40, FISH2 and KLI has adaptable set size. KLI and TSY operate in batch mode, we use batch size 15 for both. For TSY we use the following parameters: maximum ensemble size = 7, number of the nearest neighbors = 7 for error estimation. The weights used for time and space similarity for FISH/FISH2 were a1 : a2 = 1 : 1 for all the data. For Ozone and Elec data backward search for FISH, FISH2, KLI and TSY was limited to 1100 instances to reduce the complexity of the experiment. For testing with the decision tree using Elec and Ozone data we subsampled taking every 5th instance to speed up the experiments. 4.3

Algorithm Evaluation

We evaluate FISH performance based on the testing error and complexity. To evaluate the accuracy, we calculated the ranks of the peer methods. The best method for a given data set was ranked 1, the worst method was ranked 5. The ranks for each data set sum to 15. An average rank over all the datasets was calculated for each classifier and used as performance measure. We exclude the synthetic dataset (hyperplane) from ranking. To evaluate the applicability, we calculated the worst case and the average complexity of all five peer methods. We counted the number of data passes required to make a classification decision for one observation at time t. The results (approximations) are presented in Table 1. We also present the hyperparameters that needed to be preset for each algorithm. Table 1. Algorithm complexities. b - batch size; M - ensemble size; k - testing neighborhood size; N - training set size; A - time/space weight; t - time since the start of the sequence. Method Worst case Average Hyperparameters ALL t the same − tb t t2 t2 + + KLI b 2 2 2b 2 TSY t(k + 1) + N t(k + 1) + Nb b,M ,k,N ) FISH t(N + 2) + N (2−N the same A,N 4 t(k+2) t2 k + FISH2 the same A, k 2 2

The run time is reasonable for sequential data, for all five algorithms it takes up to 1 min for NMC, kNN and PWC and for the decision tree it is ∼ 5 times longer to cast a classification decision for one time observation on a 1.46 GHz PC, 1GB RAM. For implementation MATLAB 7.5 was used. Finance, biomedical applications are the domains where the data is scarce, imbalanced, while concept drift is very relevant. For example, in bankruptcy prediction an observation might be received once per day or even per week, while the model needs to be constantly updated and economic cycles imply

Table 2. Testing errors. The best accuracy for each column is underlined. Symbol ‘•’ indicates that the method is significantly worse than FISH2, ‘◦’ indicates that the method is significantly better than FISH2, and ‘−’ indicates no difference (at α = 0.05). method FISH2 FISH KLI TSY ALL FISH2 FISH KLI TSY ALL FISH2 FISH KLI TSY ALL FISH2 FISH KLI TSY ALL

base Hyp 13.21 11.95 NMC 15.09 13.84 14.47 13.21 12.58 kNN 16.35 14.47 15.72 13.21 12.58 PWC 16.35 15.72 15.72 15.09 15.72 tree 16.98 15.72 17.61

LU 8.05 10.37• 29.63 • 35.89 • 39.68 • 13.58 16.47 • 16.26 • 28.79 • 11.84 ◦ 11.63 15.16 • 14.16 • 26.42 • 11.68 − 0.37 0.37 − 0.37 − 0.37 − 0.37 −

Ozone 20.77 8.49 ◦ 32.04 • 56.45 • 88.24 • 6.83 6.83 − 7.11 − 7.03 − 6.99 − 77.99 86.89 • 65.08 ◦ 73.33 ◦ 86.42 • 13.24 17.57 • 15.12 − 14.82 − 13.24 •

Elec 17.60 17.66− 19.22 • 15.47 ◦ 24.84 • 17.70 17.70 − 17.56 − 13.16 ◦ 19.86 • 41.18 43.62 − 44.53 • 43.62 − 43.62 − 22.00 23.01 − 21.66 − 21.15 − 21.32 −

Cred 29.03 28.83 − 36.24 • 40.64 • 37.84 • 29.03 28.93 − 30.03 − 31.43 − 28.83 − 34.33 34.43 − 34.63 − 36.54 − 34.43 − 31.33 32.13 − 36.34 • 37.04 • 32.83 −

Vote 7.60 7.14 − 11.29 • 11.29 • 11.52 • 8.53 8.06 − 9.86 − 10.60 − 8.29 − 8.53 8.76 − 10.37 − 9.68 − 8.53 − 7.14 7.14 − 9.68 • 10.14 • 7.83 −

Iono 16.29 16.57 − 21.71 • 23.43 • 31.71 • 21.71 21.71 − 22.86 − 23.14 − 22.29 − 12.86 12.57 − 15.14 • 19.71 • 12.86 • 15.43 21.43 • 20.86 • 20.57 • 18.57 −

rank 1.67 1.67 3.25 3.58 4.83 2.42 2.25 3.67 4.17 2.5 1.67 3.08 3.67 4.00 2.58 2.00 3.58 3.67 3.33 2.42

concept drift. In supermarket stock management, stock quantity needs to be predicted once per week, thus only 52 observations are received per year. In such application cases even one hour of the algorithm run for the decision would not be an issue.

5

Results and Discussion

The experimental results of the five algorithms using four alternatives base classifiers with seven datasets (5 × 4 × 7 = 140 experiments) are provided in Table 2. In order to estimate a statistical significance of the differences between the error rates of two methods, we used the McNemar [14] paired test, which does not require assumption about i.i.d. origin of the data. The five methods were ranked as presented in Section 4.3 with respect to each data set, and the ranks were then averaged (last column in Table 2). FISH2 has the best rank by a large margin with PWC and tree classifiers, in kNN FISH prevails and for NMC FISH and FISH2 perform equally. The final scores averaged over all three base classifiers are: 1.94 for FISH2, 2.65 for FISH, 3.56 for KLI, 3.77 for TSY and 3.08 for ALL. Using kNN, PWC and tree as a base classifier, ALL method outperform TSY and KLI according to the rank score. It implies, that under this setting

there would be little point in employing those concept drift responsive methods and increasing complexity, as simple retraining (ALL) would do well. FISH is outperformed by ALL using a tree. The results are the worst for Ozone and Iono datasets which are highly unbalanced. FISH handles class imbalance explicitly, but tree classifier itself handles class imbalance well. The results in favor of FISH2 are mostly significant using PWC and tree as base classifiers. Some of the results indicate no statistical difference, the datasets are not large. FISH2 method is designed to work where concept drift is not clearly expressed. These are the situations of gradual drift, reoccurring contexts. ALL method outperforms all the drift responsive methods but not FISH2 with kNN as a base classifier. Ozone data is highly unbalanced, FISH method selects training samples from the two classes separately, it significantly outperforms other methods when using parametric base classifier (NMC). Ozone classification according to the highest prior would give 6.74% missclassification rate, however, this would make no sense, as no ozone peaks would be identified at all this way. Windowing methods work well on Elec data, because the drifts in this data are more sudden. Elec data shows the biggest need for concept drift adaptive methods, because for these datasets ALL method performs relatively the worst from the peer group. The credits for FISH2 performance in the peer group shall be given to similarity based training set selection. KLI addresses only similarity in time (training window). TSY uses only similarity in time for classifier training, but then they use similarity in space for classifier selection. We employ a combination of time and space similarity already in classifier building phase.

6

Conclusion

We present concept drift responsive method for classifier training based on selecting similar instances to form a training set. An integration of time and space similarity is a conceptual contribution to concept drifting data mining. Giving 0 weight to the distance in space gives sliding window approach, while 0 weight to time gives instance selection approach, used in multiple classifier systems for stationary cases. In this study 1 : 1 balance is used, weight selection is subject to further investigations. In the future the role of similarity measure in the FISH algorithm family will be addressed as well. The FISH2 method shows the best accuracy in the peer group on the datasets exhibiting gradual drifts and a mixture of several concepts. The algorithm complexity is reasonable for field applications. In this pilot study we did not focus on the distance measure. The study can be extended looking at what similarity measure is particularly suitable for the drifting data.

References 1. A. Asuncion and D. Newman. UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, http://www.ics.uci.edu/∼mlearn/MLRepository.html, 2007. 2. C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1-5):11–73, 1997. 3. J. Beringer and E. H¨ ullermeier. Efficient instance-based learning on data streams. Intell. Data Anal., 11(6):627–650, 2007. 4. A. Bifet and R. Gavalda. Learning from time-changing data with adaptive windowing. In SDM, pages 443 – 448. SIAM, 2007. 5. W. Fan. Systematic data selection to mine concept-drifting data streams. In Proc. 10th ACM SIGKDD, pages 128–137. ACM, 2004. 6. J. Gama, P. Medas, G. Castillo, and P. P. Rodrigues. Learning with drift detection. In 17th Brazilian Symposium on AI, LNCS, pages 286–295, 2004. 7. V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining data streams under block evolution. SIGKDD Explor. Newsl., 3(2):1–10, 2002. 8. M. Harries. Splice-2 comparative evaluation: Electricity pricing. Technical report, The University of South Wales, 1999. 9. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD ’01: Proc. of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 97–106, New York, NY, USA, 2001. ACM. 10. R. Jowell and the Central Co-ordinating Team. European social survey 2002/2003; 2004/2005; 2006/2007. Technical Reports, London: Centre for Comparative Social Surveys, City University, 2003, 2005, 2007. 11. R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In Proc. 17th ICML, pages 487–494. Morgan Kaufmann, 2000. 12. J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. J. of Machine Learning Research, 8:2755–2790, Dec 2007. 13. I. Koychev. Gradual forgetting for adaptation to concept drift. In ECAI 2000 Workshop on Current Issues in Spatio-Temporal Reasoning, pages 101–106, 2000. 14. L. I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. WileyInterscience, 2004. 15. M. Lazarescu and S. Venkatesh. Using selective memory to track concept effectively. In The Proc. of the Int. Conf. on Intell. Sys. and Control, pages 14–20, 2003. 16. K. Nishida and K. Yamauchi. Detecting concept drift using statistical testing. In Discovery Science, LNCS, pages 264–269, 2007. 17. S. Raudys and A. Mitasiunas. Multi-agent system approach to react to sudden environmental changes. In Proc. 5th MLDM, LNCS, pages 810–823, 2007. 18. W. N. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proc. 7th ACM SIGKDD, pages 377–382. ACM, 2001. 19. A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen. Dynamic integration of classifiers for handling concept drift. Inf. Fusion, 9(1):56–68, 2008. ˇ 20. I. Zliobait˙ e. Instance selection method (fish) for classifier training under concept drift. In Technical report, Vilnius University 2009-01, 2009. 21. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. 9th ACM SIGKDD, pages 226–235. ACM, 2003. 22. G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69–101, 1996.

Combining Similarity in Time and Space for ... - Semantic Scholar