Learning theory and algorithms for shapelets and ...

Viewer
Transcript

Learning theory and algorithms for shapelets and other local features Daiki Suehiro Kyushu University and AIP, RIKEN Fukuoka, Japan, 8190395 [email protected] Eiji Takimoto Kyushu University Fukuoka, Japan, 8190395 [email protected] Kenichi Bannai Keio University and AIP, RIKEN Kanagawa, Japan, 2238522 [email protected]

Kohei Hatano Kyushu University and AIP, RIKEN Fukuoka, Japan, 8190395 [email protected] Shuji Yamamoto Keio University and AIP, RIKEN Kanagawa, Japan, 2238522 [email protected]

Akiko Takeda The Institute of Statistical Mathematics and AIP, RIKEN Tokyo, Japan, 1908562 [email protected]

Abstract We consider binary classification problems using local features of objects. One of motivating applications is time-series classification, where features reflecting some local similarity measure between a time-series and a pattern sequence called a shapelet are useful. Despite the empirical success of such approaches using local features, the generalization ability of the resulting hypotheses is not fully understood, and previous work relies on some heuristics. In this paper, we formulate a class of hypotheses using local features, where the richness of features is controlled by kernels. We derive generalization bounds of sparse ensembles over the class which is exponentially better than a standard analysis in terms of the number of possible local features. The resulting optimization problem is well suited to the boosting approach and the weak learning problem is formulated as a DC program, for which practical algorithms exist. In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms with small parameter-tuning cost.

1

Introduction

Classifying objects using their “local” patterns is often effective in various applications. For example, in time-series classification problems, a local feature called shapelet is shown to be quite powerful in the data mining literature [13, 7, 5, 3]. More precisely, a shapelet z = (z1 , . . . , z` ) is a real-valued “short” sequence in R` for some ` > 1. Given a time-series x = (x1 , . . . , xL ) 2 RL , a typical measure of similarity between the time-series x and the shapelet z is minQ zk2 , j=1 kxj:j+` 1 where Q = L ` + 1 and xj:j+` 1 = (xj , . . . , xj+` 1 ). Here, the measure focuses on “local” similarity between the time-series and the shapelet. In many time-series classification problems, sparse combinations of features based on the similarity to some shapelets are useful [4, 8, 6]. Similar situations could happen in other applications. Say, for image classification problems, template 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

matching is a well-known technique to measure similarity between an image and “(small) template image” in a local sense. Despite the empirical success of applying local features, theoretical guarantees of such approaches are not fully investigated. In particular, trade-offs of the richness of such local features and the generalization ability are not characterized yet. In this paper, we formalize a class of hypotheses based on some local similarity. Here, the richness of the class is controlled by associated kernels. We show generalization bounds of ensembles of such local classifiers. Our bounds are exponentially tighter in terms of some parameter than typical bounds obtained by a standard analysis. Further, for learning ensembles of the kernelized hypotheses, our theoretical analysis suggests an optimization problem with infinitely many parameters, for which the dual problem is categorized as a semi-infinite program [see 10]. To obtain approximate solutions of the problem efficiently, we take the approach of boosting [9]. In particular, we employ LPBoost [2], which solves soft margin optimization with 1-norm regularization via a column generation approach. As a result, our approach has two stages, where the master problem is a linear program and the sub-problems are difference of convex programs (DC programs), which are non-convex. While it is difficult to solve the sub-problems exactly due to non-convexity, various techniques are investigated for DC programs and we can find good approximate solutions efficiently for many cases in practice. In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms using shapelets. While the previous algorithms need careful parameter tuning and heuristics, our method uses less parameter tuning. Our technical contributions are as follows: • We give a general framework for learning classification model based on local features.

• We give theoretical generalization bound of hypothesis classes based on local features. • We give a formulation and algorithm for learning local features.

• We show the competitive performance through experiment of time-series classification.

2

Preliminaries

Let P ✓ R` be a set, in which an element is called a pattern. Our instance space X is a set of sequences of patterns in P. For simplicity, we assume that every sequence in X is of the same length. That is, X ✓ P Q for some integer Q. We denote by x = (x(1) , . . . , x(Q) ) an instance sequence in X , where every x(j) is a pattern in P. The learner receives a labeled sample (1) (Q) (1) (Q) S = (((x1 , . . . , x1 ), y1 ), . . . , ((xm , . . . , xm ), ym )) 2 (X ⇥ { 1, 1})m of size m, where each labeled instance is independently drawn according to some unknown distribution D over X ⇥ { 1, +1}. Let K be a kernel over P, which is used to measure the similarity between patterns, and let : P ! H denote a feature map associated with the kernel K for a Hilbert space H. That is, K(z, z0 ) = h (z), (z0 )i for patterns z, z0 2 P, where h·, ·i denotes the innerpproduct over H. The norm induced by the inner product is denoted by k · kH and satisfies kukH = hu, ui for u 2 H.

For each u 2 H, we define the base classifier (or the feature), denoted by hu , as the function that maps a given sequence x = (x(1) , . . . , x(Q) ) 2 X to the maximum of the sim(j) ilarity scores over all patterns x(j) in x. More specifically, hu (x) = ⌦ between ↵ u and x (j) maxj2[Q] u, (x ) , where [Q] denotes the set {1, 2, . . . , Q}. For a set U ✓ H, we define the class of base classifiers as HU = {hu | u 2 U } and we denote by conv(HU ) the of base More precisely, conv(HU ) = U. P set of convex combinations P classifiers in H 0 0 w h (x) | 8u 2 U , w 0, w = 1, U ✓ U is a finite support . The goal of u u2U 0 u u u2U 0 u the learner is to find a final hypothesis g 2 conv(HU ), so that its generalization error ED (g) = Pr(x,y)⇠D [sign(g(x)) 6= y] is small. Example: Learning with time-series shapelets For a typical setting of learning with timeseries shapelets, an instance is a sequence of real numbers x = (x1 , x2 , . . . , xL ) 2 RL and a base classifier hs , which is associated with a shapelet (i.e., a “short” sequence of real num2

bers) s = (s1 , s2 , . . . , s` ) 2 R` , is defined as hs (x) = max1jL `+1 K(s, x(j) ), where x(j) = (xj , xj+1 , . . . , xj+` ) is the subsequence of x of length ` that begins with jth index. In our framework, this corresponds to the case where Q = L ` + 1, P ✓ R` , and U = { (s) | s 2 P}.

3

Risk bounds of the hypothesis classes

We show our main theoretical result on the generalization bound of conv(HU ). The details of our theorem (lemmas and the proofs) are shown in supplementary material. Given a sample S, let PS be the set of patterns appearing in S and let (PS ) = { (z) | z 2 PS }. Let di↵ (PS ) = { (z) (z0 ) | z, z0 2 PS , z 6= z0 }. By viewing each instance v 2 di↵ (PS ) as a hyperplane {u | hv, ui = 0}, we can naturally define a partition of the Hilbert space H by the set of all hyperplanes v 2 di↵ (PS ). Let I be the set of all cells of the partition. Each cellTI 2 I is a polyhedron which is defined by a minimal set VI ✓ di↵ (PS ) that satisfies I = v2VI {u | hu, vi 0}. Let µ⇤ = minI2I maxu2I\U minv2VI |hu, vi|. Let d⇤ ,S be the VC dimension of the set of linear classifiers over the finite set di↵ (PS ), given by FU = {f : v 7! sign(hu, vi) | u 2 U }. Then, our main result is stated as follows: Theorem 1. Suppose that for any z 2 P, k (z)kH  R. Then, for any ⇢ > 0 and (0 < < 1), with probability at least 1 , the following holds for any g 2 conv(HU ) with U ✓ {u 2 H | kukH  ⇤}: q 0 1 s R⇤ d⇤ ,S log(mQ) log 1 A. p ED (g) E⇢ (g) + O @ + (1) m ⇢ m where E⇢ (f ) is the empirical margin loss of f over S of size m, i.e., the ratio of examples for which f has margin yi f (xi ) < ⇢. In particular, (i)for any , generally holds d⇤ ,S  R⇤/µ⇤ , (ii) if is the identity mapping (i.e., the associated kernel is the linear kernel), or (iii) if satisfies that h (z), (x)i is monotone decreasing with respect to kz xk2 (e.g., the mapping defined by the Gaussian kernel) and U = { (z) | z 2 P ✓ R` , k (z)kH  ⇤}, then d⇤ ,S can be upper bounded by `. The above bounds are exponentially tighter than typical bounds obtained by a standard analysis (see suplementary material). This result shows that large number of Q does not significantly affect generalization performance because Q is only loarithmic. For example, when we consider time-series classification problem, length of time series does not affect generalization much.

4

Optimization problem formulation

In this section, we formulate an optimization problem to learn ensembles in conv(HU ) for U ✓ {u : kukH  ⇤} using hypotheses in HU . Particularly, in this paper, our problem formulation we suggest is based on the soft margin maximization and 1-norm regularization. The soft margin optimization with 1-norm regularization is a standard formulation which tries to minimize the generalization bound [see, e.g., 2]. The optimization problem is a linear programming problem as follows: m m X X min sub.to yi di hu (xi )  , u 2 U, 0  di  1/⌫m (i 2 [m]), di = 1, 2 R. (2) ,d

i=1

i=1

Our approach is to approximately solve the problem for U = {u : kukH  ⇤} by solving the sub-problems over a finite subset U 0 ⇢ U . Such approach is called column generation in linear programming, which add a new constraint (column) to the dual problems and solve them iteratively. LPBoost [2] is a well-known example of the approach in the boosting setting. At each iteration, LPBoost chooses a hypothesis hu so that hu maximally violates the constraints in the current dual problem. This sub-problem is called weak learning in the boosting setting and formulated as follows:

max ↵

m X p=1

yp dp max

j2[Q]

Q m X X

i=1 k=1

Q m X ⇣ ⌘ ⇣ ⌘ X (k) (j) (l) ↵ik K xi , x(j) sub.to ↵ ↵ K x , x  ⇤2 . ij kl p i k i,k

(3)

j,l

The optimization problem (3) is difference of convex functions (DC) programming problem and we can obtain local optimum ✏-approximately by using DC Algorithm [11]. The details of the derivation of our formulation and our algorithms are shown in supplementary material. 3

5

Experiments

We use UCR datasets [1], that are often used as benchmark datasets for time-series classification methods. The detailed information of the datasets is described in the left side of Table 1. The details of our experiments are shown in supplementary material. Classification Accuracy The results of classification accuracy are shown in the right side of Table 1. We referred to the experimental result of [6] with regard to the accuracies of the following three state-of-the-arts algorithms, LTS [3], IG-SVM [5], and FLAG [6]. It is reported that the above algorithms achieve high accuracy in practice. As shown in Table 1, our algorithm had competitive performance with the algorithms. Parameter-tuning cost We briefly mention parameter-turning costs. For example, as seen in Table 1, LTS is known as one of the most accurate algorithm in practice. LTS requires seven input hyper-parameters, however, it is said that LTS is highly sensitive to several input parameters [see also 12]. LTS is a gradient-based optimization algorithm, which needs proper initial shapelets, which are heuristically calculated by the k-means of subsequences. It is quite unstable since the initial shapelets depend on the k-means algorithm. We also need to fix k, which specifies the number of shapelets. On the other hand, in our algorithm, the shapelet-like local patterns are automatically computed via the theoretically motivated formulation. Furthermore, number of iterations and learning rate of the gradient-decsent in LTS is also a delicate parameter. These parameters are hard to control since the gradient can explode or vanish. Our optimization problem can be solved stably because the boosting approach always converges to a solution even if the DC weak learning algorithm returns local optimums. Our algorithm requires only three parameters, ` (in common with LTS), ⌫, and of Gaussian kernel (⇤ and ✏ are fixed). The parameter ⌫ is easy to tune since it gives an upper bound of the training error. Thus, we can say that our algorithm stably runs with lower parameter-tuning cost. Table 1: The detailed information of the datasets and classification accuracies (%). The accuracies of IG-SVM, LTS and FLAG are referred to the result of [6]. dataset #train #test length #classes IG-SVM LTS FLAG our method Adiac 390 391 176 37 23.5 51.9 75.2 64.9 Beef 30 30 470 5 90.0 76.7 83.3 66.7 Chlorine. 467 3840 166 3 57.1 73.0 76.0 68.7 Coffee 28 28 286 2 100.0 100.0 100.0 100.0 ECGFiveDays 23 861 136 2 99.0 100.0 92.0 100.0 FaceFour 24 88 350 4 97.7 94.3 90.9 89.1 Gun-Point 50 150 150 2 100.0 99.6 96.7 99.3 ItalyPower. 67 1029 24 2 93.7 95.8 94.6 93.8 Lightning7 70 73 319 7 63.0 79.0 76.7 57.3 MedicalImages 381 760 99 10 52.2 71.3 71.4 62.7 MoteStrain 20 1252 84 2 88.7 90.0 88.8 76.5 Sony. 20 601 70 2 92.7 91.0 92.9 94.8 Symbols 25 995 398 6 84.6 94.5 87.5 92.3 SyntheticC. 300 300 60 6 87.3 97.3 99.7 97.8 Trace 100 100 275 4 98.0 100.0 99.0 98.6 TwoLeadECG 23 1139 82 2 100.0 100.0 99.0 92.7

References [1] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive, July 2015. www.cs.ucr.edu/ ~eamonn/time_series_data/. [2] A Demiriz, K P Bennett, and J Shawe-Taylor. Linear Programming Boosting via Column Generation. Machine Learning, 46(1-3):225–254, 2002. 4

[3] Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learning time-series shapelets. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 392–401, 2014. [4] Josif Grabocka, Martin Wistuba, and Lars Schmidt-Thieme. Scalable discovery of time-series shapelets. CoRR, abs/1503.03238, 2015. [5] Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classification of time series by shapelet transformation. Data Mining and Knowledge Discovery, 28(4):851– 881, July 2014. [6] Lu Hou, James T. Kwok, and Jacek M. Zurada. Efficient learning of timeseries shapelets. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence,, pages 1209–1215, 2016. [7] Eamonn J. Keogh and Thanawin Rakthanmanon. Fast shapelets: A scalable algorithm for discovering time series shapelets. In Proceedings of the 13th SIAM International Conference on Data Mining, pages 668–676, 2013. [8] Xavier Renard, Maria Rifqi, Walid Erray, and Marcin Detyniecki. Random-shapelet: an algorithm for fast shapelet discovery. In 2015 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA’2015), pages 1–10. IEEE, 2015. [9] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wen Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. [10] Alexander Shapiro. Semi-infinite programming, duality, discretization and optimality conditions. Optimization, 58(2):133–161, 2009. [11] Pham Dinh Tao and El Bernoussi Souad. Duality in D.C. (Difference of Convex functions) Optimization. Subgradient Methods, pages 277–293. Birkhäuser Basel, Basel, 1988. [12] Martin Wistuba, Josif Grabocka, and Lars Schmidt-Thieme. Ultra-fast shapelets for time series classification. CoRR, abs/1503.05018, 2015. [13] Lexiang Ye and Eamonn Keogh. Time series shapelets: A new primitive for data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 947–956. ACM, 2009.

5

Supplementary material (full version): Learning theory and algorithms for shapelets and other local features Daiki Suehiro Kyushu University and AIP, RIKEN Fukuoka, Japan, 8190395 [email protected] Eiji Takimoto Kyushu University Fukuoka, Japan, 8190395 [email protected] Kenichi Bannai Keio University and AIP, RIKEN Kanagawa, Japan, 2238522 [email protected]

Kohei Hatano Kyushu University and AIP, RIKEN Fukuoka, Japan, 8190395 [email protected] Shuji Yamamoto Keio University and AIP, RIKEN Kanagawa, Japan, 2238522 [email protected]

Akiko Takeda The Institute of Statistical Mathematics and AIP, RIKEN Tokyo, Japan, 1908562 [email protected]

Abstract We consider binary classification problems using local features of objects. One of motivating applications is time-series classification, where features reflecting some local similarity measure between a time-series and a pattern sequence called a shapelet are useful. Despite the empirical success of such approaches using local features, the generalization ability of the resulting hypotheses is not fully understood, and previous work relies on some heuristics. In this paper, we formulate a class of hypotheses using local features, where the richness of features is controlled by kernels. We derive generalization bounds of sparse ensembles over the class which is exponentially better than a standard analysis in terms of the number of possible local features. The resulting optimization problem is well suited to the boosting approach and the weak learning problem is formulated as a DC program, for which practical algorithms exist. In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms with small parameter-tuning cost.

1

Introduction

Classifying objects using their “local” patterns is often effective in various applications. For example, in time-series classification problems, a local feature called shapelet is shown to be quite powerful in the data mining literature [23, 10, 7, 5]. More precisely, a shapelet z = (z1 , . . . , z` ) is a real-valued “short” sequence in R` for some ` > 1. Given a time-series x = (x1 , . . . , xL ) 2 RL , a typical measure of similarity between the time-series x and the shapelet z is minQ zk2 , j=1 kxj:j+` 1 where Q = L ` + 1 and xj:j+` 1 = (xj , . . . , xj+` 1 ). Here, the measure focuses on “local” similarity between the time-series and the shapelet. In many time-series classification problems, sparse combinations of features based on the similarity to some shapelets are useful [6, 14, 8]. Similar 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

situations could happen in other applications. Say, for image classification problems, template matching is a well-known technique to measure similarity between an image and “(small) template image” in a local sense. Despite the empirical success of applying local features, theoretical guarantees of such approaches are not fully investigated. In particular, trade-offs of the richness of such local features and the generalization ability are not characterized yet. In this paper, we formalize a class of hypotheses based on some local similarity. Here, the richness of the class is controlled by associated kernels. We show generalization bounds of ensembles of such local classifiers. Our bounds are exponentially tighter in terms of some parameter than typical bounds obtained by a standard analysis. Further, for learning ensembles of the kernelized hypotheses, our theoretical analysis suggests an optimization problem with infinitely many parameters, for which the dual problem is categorized as a semi-infinite program [see 18]. To obtain approximate solutions of the problem efficiently, we take the approach of boosting [15]. In particular, we employ LPBoost [4], which solves soft margin optimization with 1-norm regularization via a column generation approach. As a result, our approach has two stages, where the master problem is a linear program and the sub-problems are difference of convex programs (DC programs), which are non-convex. While it is difficult to solve the sub-problems exactly due to non-convexity, various techniques are investigated for DC programs and we can find good approximate solutions efficiently for many cases in practice. In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms using shapelets. While the previous algorithms need careful parameter tuning and heuristics, our method uses less parameter tuning and parameters can be determined in an organized way. In addition, our solutions tend to be sparse and could be useful for domain experts to select good local features. 1.1

Related work

Approaches based on similarity between instances Classification methods based on similarity between instances are quite standard. Such methods include the nearest neighbors using the Euclidean distance, the Dynamic time warping (DTW) and so on. Kernel-based methods, such as SVMs are also examples of the similality-based approach. However, in many domains, not the similarity between instances, but ca similarity between a “subsequence” in the instance and a pattern does matter for classification. For example, template matching is a common way to classify image data, and the bag-of-words model which is often used for document classification. In such case, classification methods based on similarity between instances might not work well. There are previous results using kernels between local features [e.g., 13, 19, 24]. However, their approaches use the sum of kernels over all local features as the similarity between instances, which might not work as well, when a particular pattern in the instance is relevant. Shapelet-based approaches As a method based on local features for time-series classification, the concept of time-series shapelets was first introduced by [23]. Their algorithm finds shapelets by using the information gains of potential candidates associated with all the subsequences of the given time series and constructs a decision tree. Shapelet transform [7] is a technique combining with shapelets and machine learning. The authors consider the time-series examples as feature vectors defined by the set of local similarity to some shapelets and in order to obtain classification rule, they employed some effective learning algorithms such as linear SVM or random forests. Note that shapelet transform completely separate the phase searching for shapelets from the phases of creating classification rules. Afterward, many algorithms have been proposed to search the good shapelets efficiently keeping high prediction accuracy in practice [10, 6, 14, 9]. The algorithms are based on the idea that discriminative shapelets are contained in the training data. This approach, however, might overfit without a regularization. Learning Time-Series Shapelets (LTS) algorithm [5] is a different approach from such subsequence-based algorithms. LTS approximately solves an optimization problem of learning the best shapelets directly without searching subsequences in a brute-force way. In contrast to the above subsequence-based methods, LTS finds nearly optimal shapelets and achieves higher prediction accuracy than the other existing methods in practice. However, there is no theoretical guarantee of its generalization error. 2

1.2

Our contributions

Our technical contributions are as follows: • We give a general framework for learning classification model based on local features. • We give theoretical generalization bound of hypothesis classes based on local features, the proof of which involves non-trivial argument. • We give a formulation and algorithm for learning local features, which contains an interesting collaboration between boosting and kernel. • We introduce several techniques for efficiency of our algorithm, and show the competitive performance through experiment of time-series classification problem.

2

Preliminaries

Let P ✓ R` be a set, in which an element is called a pattern. Our instance space X is a set of sequences of patterns in P. For simplicity, we assume that every sequence in X is of the same length. That is, X ✓ P Q for some integer Q. We denote by x = (x(1) , . . . , x(Q) ) an instance sequence in X , where every x(j) is a pattern in P. The learner receives a labeled sample (1) (Q) (1) (Q) S = (((x1 , . . . , x1 ), y1 ), . . . , ((xm , . . . , xm ), ym )) 2 (X ⇥ { 1, 1})m of size m, where each labeled instance is independently drawn according to some unknown distribution D over X ⇥ { 1, +1}. Let K be a kernel over P, which is used to measure the similarity between patterns, and let : P ! H denote a feature map associated with the kernel K for a Hilbert space H. That is, K(z, z0 ) = h (z), (z0 )i for patterns z, z0 2 P, where h·, ·i denotes the inner pproduct over H. The norm induced by the inner product is denoted by k · kH and satisfies kukH = hu, ui for u 2 H. For each u 2 H, we define the base classifier (or the feature), denoted by hu , as the function that maps a given sequence x = (x(1) , . . . , x(Q) ) 2 X to the maximum of the similarity scores between u and x(j) over all patterns x(j) in x. More specifically, D E hu (x) = max u, (x(j) ) , j2[Q]

where [Q] denotes the set {1, 2, . . . , Q}. For a set U ✓ H, we define the class of base classifiers as HU = {hu | u 2 U }

and we denote by conv(HU ) the set of convex combinations of base classifiers in HU . More precisely, ( X conv(HU ) = wu hu (x) | 8u 2 U 0 , wu 0, u2U 0

X

u2U 0

)

wu = 1, U 0 ✓ U is a finite support .

The goal of the learner is to find a final hypothesis g 2 conv(HU ), so that its generalization error ED (g) = Pr(x,y)⇠D [sign(g(x)) 6= y] is small. Example: Learning with time-series shapelets For a typical setting of learning with time-series shapelets, an instance is a sequence of real numbers x = (x1 , x2 , . . . , xL ) 2 RL and a base classifier hs , which is associated with a shapelet (i.e., a “short” sequence of real numbers) s = (s1 , s2 , . . . , s` ) 2 R` , is defined as hs (x) =

max

1jL `+1

K(s, x(j) ),

where x(j) = (xj , xj+1 , . . . , xj+` ) is the subsequence of x of length ` that begins with jth index1 . In our framework, this corresponds to the case where Q = L `+1, P ✓ R` , and U = { (s) | s 2 P}. 1

In previous work, K is not necessarily a kernel. For instance, the negative of Euclidean distance is often used as a similarity measure K.

3

ks x(j) k

3

Risk bounds of the hypothesis classes

In this section, we give generalization bounds of the hypothesis classes conv(HU ) for various U and K. To derive the bounds, we use the Rademacher and the Gaussian complexity [2]. Definition 1 (The Rademacher and the Gaussian complexity [2]). Given a sample S = (x1 , . . . , xm ) 2 X m , the empirical Rademacher Pmcomplexity R(H) of a class H ⇢ {h : X ! R} 1 w.r.t. S is defined as RS (H) = m E [suph2H i=1 i h(xi )], where 2 { 1, 1}m and each i is

an independent uniform random variable in { 1, 1}. The empirical Gaussian complexity GS (H) of H w.r.t. S is defined similarly but each i is drawn independently from the standard normal distribution.

The following bounds are well-known. Lemma 1. [2, Lemma 4] RS (H) = O(GS (H)). Lemma 2. [12, Corollary 6.1] For fixed ⇢, > 0, the following bound holds with probability at least 1 : for all f 2 conv(H), s log 1 2 ED (f )  E⇢ (f ) + RS (H) + 3 , ⇢ 2m where E⇢ (f ) is the empirical margin loss of f over S of size m, i.e., the ratio of examples for which f has margin yi f (xi ) < ⇢. To derive generalization bounds based on the Rademacher or the Gaussian complexity is quite standard in the statistical learning theory literature and applicable to our classes of interest as well. However, a standard analysis provides us sub-optimal bounds. For example, let us consider the simple case where the class HU of base classifiers is defined by the linear kernel with U to be the set of vectors in R` of bounded norm. In this case, HU can be viewed as HU = {max{h1 , . . . , hQ } | h1 2 H1 , . . . , hQ 2 HQ }, where Hj = {h : x 7! hu, x(j) i | u 2 U } for every j 2 [Q]. ⇣Then,⌘ by a standard analysis [see, e.g., 12, Lemma 8.1], we have R(HU )  PQ pQ . However, this bound is weak since it is linear in the number Q of local j=1 R(Hj ) = O m p ˜ patterns. In the following, we will give an improved bound of O(log Q/ m). The key observation is that if base classes H1 , . . . , HQ are “correlated” somehow, one could obtain better Rademacher bounds. In fact, we will exploit some geometric properties among these base classes. 3.1

Main theorem

First, we show our main theoretical result on the generalization bound of conv(HU ). Given a sample S, let PS be the set of patterns appearing in S and let (PS ) = { (z) | z 2 PS }. Let di↵ (PS ) = { (z) (z0 ) | z, z0 2 PS , z 6= z0 }. By viewing each instance v 2 di↵ (PS ) as a hyperplane {u | hv, ui = 0}, we can naturally define a partition of the Hilbert space H by the set of all hyperplanes v 2 di↵ (PS ). Let I be the set of all cells of the partition. EachTcell I 2 I is a polyhedron which is defined by a minimal set VI ✓ di↵ (PS ) that satisfies I = v2VI {u | hu, vi 0}. Let µ⇤ = min max min |hu, vi|. I2I u2I\U v2VI

Let d⇤ ,S be the VC dimension of the set of linear classifiers over the finite set di↵ (PS ), given by FU = {f : v 7! sign(hu, vi) | u 2 U }. Then, our main result is stated as follows: Theorem 1. Suppose that for any z 2 P, k (z)kH  R. Then, for any ⇢ > 0 and (0 < < 1), with probability at least 1 , the following holds for any g 2 conv(HU ) with U ✓ {u 2 H | kukH  ⇤}: q 0 1 s R⇤ d⇤ ,S log(mQ) log 1 A. p ED (g) E⇢ (g) + O @ + (1) m ⇢ m In particular, (i)for any , generally holds d⇤ ,S  R⇤/µ⇤ , (ii) if is the identity mapping (i.e., the associated kernel is the linear kernel), or (iii) if satisfies that h (z), (x)i is monotone decreasing 4

with respect to kz xk2 (e.g., the mapping defined by the Gaussian kernel) and U = { (z) | z 2 P ✓ R` , k (z)kH  ⇤}, then d⇤ ,S can be upper bounded by `. In order to prove the Theorem 1, we show several definitions, lemmas and the proofs as following subsection. 3.2

Proof sketch

Definition 2 (The set ⇥ of mappings from an instance to a pattern). Given a sample S = (x1 , . . . , xm ), and let (PS ). For any u 2 U , let ✓u : [m] ! [Q] be a mapping defined by D ⇣ ⌘E (j) ✓u, (i) := arg max u, xi , 8i 2 [m], j2[Q]

and we define the set of all ✓u, for S as ⇥S, = {✓u, | u 2 U }. For simplicity, we denote by ✓u and ⇥. Lemma 3. Suppose that for any z 2 P, k (z)kH  R. Then, the empirical Gaussian complexity of HU with respect to S for U ✓ {u | kukH  ⇤} is bounded as follows: qp R⇤ ( 2 1) + 2(ln |⇥|) p GS (H)  . m Proof. Since U can be partitioned into [✓2⇥ {u 2 U | ✓u = ✓}, " # m D ⇣ ⌘E X 1 (✓(i)) GS (HU ) = xi E sup sup i u, m ✓2⇥ u2U :✓u =✓ i=1 " * !+# m ⇣ ⌘ X 1 (✓(i)) = u, xi E sup sup i m ✓2⇥ u2U :✓u =✓ i=1 " * !+# " m m ⇣ ⌘ X X 1 ⇤ (✓(i)) (✓(i))  xi  E sup sup u, E sup i i xi m m ✓2⇥ u2U ✓2⇥ i=1 i=1 v 2 3 2v u m u 2 m ⇣ ⌘ ⇣ ⌘ u X u X ⇤ 4 (✓(i)) (✓(i)) 5 = ⇤ E 4tsup = xi xi E sup t i i m m ✓2⇥ ✓2⇥ i=1 i=1 H v 2 3 u m u ⇣ ⌘ 2 X ⇤u 4 (✓(i)) 5.  tE sup xi i m ✓2⇥ i=1

H 2

H

#

3 5 (2)

H

The first inequality is derived from the relaxation of u, the second inequality is due to CauchySchwarz inequality and the fact kukH  ⇤, and the last inequality is due to Jensen’s inequality. We (✓) (✓(i)) (✓(j)) denote by K(✓) the kernel matrix such that Kij = h (xi ), (xj )i. Then, we have 2 3 2 3 m m ⇣ ⌘ 2 X X (✓(i)) (✓) 5 5 = E 4sup xi . (3) E 4sup i i j Kij ✓2⇥

i=1

✓2⇥ i,j=1

H

We now derive an upper bound of the r.h.s. as follows. For any c > 0,

0

2

exp @c E 4sup 2

✓2⇥ i,j=1

0

= E 4sup exp @c ✓2⇥

m X m X

i,j=1

31

(✓) 5A i j Kij

13

(✓) A5 i j Kij

2

0

 E 4exp @c sup 

5

X

✓2⇥

2

m X

✓2⇥ i,j=1

0

E 4exp @c

m X

i,j=1

13

(✓) A5 i j Kij

13

(✓) A5 i j Kij

The first inequality is due to Jensen’s inequality, and the second inequality is due to the fact that the supremum is bounded by the sum. By using the symmetry property of K(✓) , we have Pm (✓) > (✓) K , which is rewritten as i,j=1 i j Kij = 0 1 0 1 B C > > (✓) .. K = (V> )> @ AV , . 0 m

where 1 · · · 0 are the eigenvalues of K(✓) and V = (v1 , . . . , vm ) is the orthonormal m matrix such that vi is the eigenvector that corresponds to the eigenvalue i . By the reproductive property of Gaussian distribution, V> obeys the same Gaussian distribution as well. So, 2 0 13 m ⇣ ⌘i X X X h (✓) A5 > (✓) = E exp c K E 4exp @c i j Kij ✓2⇥

=

X

✓2⇥

=

X

✓2⇥

"

i,j=1

E exp c ⇧m k=1

✓Z

m X

> k (vk

k=1 1

exp c

k

)2 2

1

!#

✓2⇥

=

exp( p

p Now we replace by 0 = 1 c Z 1 exp( (1 c k ) p 2⇡ 1

X

⇥ ⇧m k=1 E exp c k

✓2⇥ 2

)

2⇡

2 k k

d

◆

=

X

⇧m k=1

✓2⇥

✓Z

⇤

(replace

1 1

= vk> )

exp( (1 c p 2⇡

p = 1 c k d , we have: Z 1 2 ) 1 exp( 02 ) 0 1 p d =p d =p 1 c k 1 c 2⇡ 1

k

. Since d

k)

2

)

d

◆

.

0

. k

p Now, by letting c = 2 max1i,✓ i = 1/(2 1 ) and applying the inequality that (1 1 x)  1 + 2( 2 for 0  x  12 , the bound becomes 0 2 31 m ⇣ ⌘ X X p (✓) 5A exp @c E 4sup  ⇧m 1 + 2( 2 1)c i j Kij k k=1 ✓2⇥ i,j=1

1)x

(4)

✓2⇥

p because it holds that (1 1 x)  1 + 2( 2 1)x in 0  x  12 . Further, taking logarithm, dividing the both sides by c and applying ln(1 + x)  x, we get: 2 3 m m X X p (✓) 5 K  ( 2 1) E 4sup i j ij k + 2 1 ln |⇥| ✓2⇥ i,j=1

p

=( 2 p ( 2

k=1

1)tr(K) + 2

1

ln |⇥|

1)mR2 + 2mR2 ln |⇥|,

(5)

where the last inequality holds since 1 = kKk2  mkKkmax  R . By equation (2) and (5), we have: v 2 qp 3 u m u X ⇤R ( 2 1) + 2 ln |⇥| ⇤u (✓) 5 p GS (H)  tE 4sup  . i j Kij m m ✓2⇥ i,j=1 2

Thus, it suffices to bound the size |⇥|. Naively the size |⇥| is at most Qm since there are Qm possible mappings from [m] to [Q]. However, this naive bound is too pessimistic. The basic idea (1) (Q) to get better bounds is the following. Fix any i 2 [m] and consider points (xi ), . . . , (xi ). Then, we define equivalence classes of u such that ✓u (i) is the same, which define a Voronoi (1) (Q) diagram for the points (xi ), . . . , (xi ). Note here that the similarity is measured by the 6

(1)

(Q)

inner product, not a distance. More precisely, let Vi = (Vi , . . . , Vi ) be the Voronoi diagram (j) (j ) defined as Vi = {u 2 H | ✓u (i) = j}. Let us consider the set of intersections \i2[m] Vi i for all combinations of (j1 , . . . , jm ) 2 [Q]m . The key observation is that each non-empty intersection (j ) corresponds to a mapping ✓u 2 ⇥. Thus, we obtain |⇥| = (the number of intersections \i2[m] Vi i ). In other words, the size of ⇥ is exactly the number of rooms defined by the intersections of m Voronoi diagrams V1 , . . . , Vm . From now on, we will derive upper bounds based on this observation. Lemma 4. ⇤ |⇥| = O((mQ)2d ,S ). Proof. We will reduce the problem of counting intersections of the Voronoi diagrams to that of counting possible labelings by hyperplanes for some set. Note that for each neighboring Voronoi regions, the border is a part of hyperplane since the closeness is defined in terms of the inner product. So, by simply extending each border to a hyperplane, we obtain intersections of halfspaces defined by the extended hyperplanes. Note that, the size of these intersections gives an upper bound of intersections of the Voronoi diagrams. More precisely, we draw hyperplanes for each pair of points (j) in PS = { (xi ) | i 2 [m], j 2 [Q]} so that each point on the hyperplane has the same inner product between two points. Note that for each pair z, z0 2 PS , the normal vector of the hyperplane is given as z z0 (by fixing the sign arbitrary). So, the set of hyperplanes obtained by this procedure 2 2 is exactly di↵ (PS ). The size of di↵ (PS ) is mQ 2 , which is at most m Q . Now, we consider a “dual” space by viewing each hyperplane as a point and each point in U as a hyperplane. Note that points u (hyperplanes in the dual) in an intersection give the same labeling on the points in the dual domain. Therefore, the number of intersections in the original domain is the same as the number of the possible labelings on di↵ (PS ) by hyperplanes in U . By the classical Sauer’s Lemma and the VC ⇤ dimension of hyperplanes (see, e.g., Theorem 5.5 in [16]), the size is at most O((m2 Q2 )d ,S ). Theorem 2. ⇤

(i) For any , |⇥| = O((mQ)R⇤/µ ). (ii) If

is the identity mapping over P, then |⇥| = O((mQ)2` ).

(iii) If satisfies that h (z), (x)i is monotone decreasing with respect to kz xk2 (e.g., the mapping defined by the Gaussian kernel) and U = { (z) | z 2 P ✓ R` , k (z)kH  ⇤}, then |⇥| = O((mQ)2` ). Proof. (i) We follow the argument in Lemma 4. For the set of classifiers F = {f : di↵ (PS ) ! { 1, 1} | f = sign(hu, zi), ku|H , minz2 diff (PS ) |hu, zi| = µ}, its VC dimension is known to be at most R⇤/µ for di↵ (PS ) ✓ {z | kzkH  R} (see, e.g., [16]). By the definition of µ⇤ , for each intersections given by hyperplanes, there always exists a point u whose inner product between each hyperplane is at least µ⇤ . Therefore, the size of the intersections is bounded by the number of possible labelings in the dual space by U 00 = {u 2 H, kukH  ⇤, minzin diff (PS ) |hu, zi| = µ⇤ }}. Thus we obtain that d⇤ ,S is at most R⇤/µ⇤ and by Lemma 4, we complete the proof of case (i). (ii) In this case, the Hilbert space His contained in R` . Then, by the fact that VC dimension d⇤ ,S is at most ` and Lemma 4, the statement holds. (iii) f h (z), xi is monotone decreasing for kz xk, then the following holds: arg max h (z), (x)i = arg min kz xk2 . x2X

x2X

Therefore, maxu:kukH =1 hu, (x)i = k (x)kH , where u = (j) Vi

(x) k (x)kH .

It indicates that the number (k)

of Voronoi cells made by = {z 2 R | j = arg maxk2[Q] z · xi } corresponds to the (j) (k) ˆ Vi = { (z) 2 H | j = arg maxk2[Q] h (z), (xi )i}. Then, by following the same argument for the linear kernel case, we get the same statement. `

Now we are ready to prove Theorem 1. Proof of Theorem 1. By using Lemma 1, and 2, we obtain the generalization bound in terms of the Gaussian complexity of H. Then, by applying Lemma 3 and Theorem 2, we complete the proof. 7

4

Optimization problem formulation

In this section, we formulate an optimization problem to learn ensembles in conv(HU ) for U ✓ {u : kukH  ⇤} using hypotheses in HU . Particularly, in this paper, our problem formulation we suggest is based on the soft margin maximization and 1-norm regularization2 . The soft margin optimization with 1-norm regularization is a standard formulation which tries to minimize the generalization bound [see, e.g., 4]. The optimization problem is a linear programming problem as follows: m

max ⇢

⇢,w,⇠

Z

sub.to

1 X ⇠i ⌫m i=1

(6)

yi wu hu (xi )du

Zu2U

⇠i , i 2 [m],

⇢

0, ⇢ 2 R,

wu du = 1, w

u2U

where hu 2 HU is given base classifiers, ⌫ 2 [0, 1] is a constant parameter, ⇢ is a target margin and ⇠i is a slack variable. The dual formulation of problem (6) is given as follows. (7)

min ,d

sub.to

m X i=1

yi di hu (xi )  , u 2 U,

0  di  1/⌫m (i 2 [m]),

m X

di = 1,

i=1

2 R.

The dual problem is categorized as a semi-infinite program (SIP), since it contains possibly infinitely many constraints. Note that the duality gap is zero since the problem (7) is convex and the optimum is finite [18, Theorem 2.2]. Our approach is to approximately solve the primal and the dual problems for U = {u : kukH  ⇤} by solving the sub-problems over a finite subset U 0 ⇢ U . Such approach is called column generation in linear programming, which add a new constraint (column) to the dual problems and solve them iteratively. LPBoost [4] is a well-known example of the approach in the boosting setting. At each iteration, LPBoost chooses a hypothesis hu so that hu maximally violates the constraints in the current dual problem. This sub-problem is called weak learning in the boosting setting and formulated as follows:

max u2H

m X i=1

D yi di max u, j2[Q]

⇣

(j)

xi

⌘E

sub.to kuk2H  ⇤2 .

(8)

However, the problem (8) cannot be solved directly, since we have only access to U though the associated kernel. Fortunately, the optimal solution u⇤ can be written as a linear combination of the functions K(xi , ·) because of the following representer theorem.

Theorem 3 (Representer Theorem). The optimal solution u⇤ of optimization problem (8) has the Pm PQ (j) form of u⇤ = i=1 j=1 ↵ij K(xi , ·). Proof. We can rewrite the optimization problem (8) by using ✓ 2 ⇥ defined in Subsection 3.2 as follows: max

max

✓2⇥ u2H:✓u =✓

m X i=1

D yi di u,

sub.to kuk2H  ⇤2 . 2

⇣

(✓(i))

xi

⌘E

(9)

Of course, soft margin optimization with “2-norm” regularization such as SVMs is naturally considered. The reason not to employ “2-norm” is described in Section 7.

8

Thus, if we fix ✓ 2 ⇥, we have a sub-problem. Since the constraint ✓ = ✓u can be written as linear constraints, each sub-problem is equivalent to a convex optimization. Indeed, each sub-problem can be written as the equivalent unconstrained minimization (by neglecting constants in the objective) Q m X m D ⇣ ⌘E X D ⇣ ⌘E X (✓(i)) (✓(i)) 2 min kukH xi yi di u, xi , i,j u, u2H

i=1 j=1

i=1

where and i,j (i 2 [m], j 2 [Q]) are the corresponding positive constants. Now for each subproblem, we can apply the standard Representer Theorem argument (see, e.g., [12]). Let H1 be Pm PQ (j) the subspace {u 2 H | u = i=1 j=1 ↵ij (xi ), ↵ij 2 R}. We denote u1 as the orthogonal ? ? projection of u onto H1 and any u 2 H has the decomposition u = Since u D u1 +⇣u . ⌘E D is orthogonal ⇣ ⌘E (j) (j) 2 2 ? 2 2 w.r.t. H1 , kukH = ku1 kH +ku kH ku1 kH . On the other hand, u, xi = u1 , xi . Therefore, the optimal solution of each sub-problem has to be contained in H1 . This implies that the optimal solution, which is the maximum over all solutions of sub-problems, is contained in H1 as well. By Theorem 3, we can design a weak learner by solving the following equivalent problem: Q m X ⇣ ⌘ X X (k) min dp max ↵ik K xi , x(j) p ↵

j2[Q]

p:yp =+1

X

+

dq max

j2[Q]

q:yq = 1

sub.to

Q m X X i,k

5

Q m X X

i=1 k=1

⇣

(j)

⇣ ⌘ (k) ↵ik K xi , x(j) q

(l)

↵ij ↵kl K xi , xk

j,l

(10)

i=1 k=1

⌘

 ⇤2 .

Algorithm

The optimization problem (10) is difference of convex functions (DC) programming problem and we can obtain local optimum ✏-approximately by using⇣DC Algorithm ⌘ [20]. For the above optimization Pm PQ (k) (j) problem, we replace maxj2[Q] i=1 k=1 ↵ik K xi , xq with q , then we get the equivalent optimization problem as below. Q m X ⇣ ⌘ X X (k) min dp max ↵ik K xi , x(j) (11) p ↵,

j2[Q]

p:yp =+1

+

X

dq

i=1 k=1

q

q:yq = 1

sub.to

Q m X X

i=1 k=1

⇣ ⌘ (k) ↵ik K xi , x(j)  q

(j 2 [Q], 8q : yq = m X

Q X

i,k=1 j,l=1

q

1),

⇣

(j)

(l)

↵ij ↵kl K xi , xk

⌘

 ⇤2 .

We show the pseudo code of LPBoost using column generation algorithm, and our weak learning algorithm using DC programming in Algorithm 1 and 2, respectively. In Algorithm 2, the optimization problem (12) can be solved by standard QP solver.

6

Experiments

In the following experiments, we show that our methods are practically effective for time-series classification problem as one of the applications. 9

Algorithm 1 LPBoost with WeakLearn 1. Input: S, K, ⇤, ✏ > 0 1 1 2. Initialize: d0 (m ,..., m ) 3. For t = 1, . . . (a) ht Run WeakLearn(S, K, dt (b) Solve optimization problem: ( , dt ) sub.to

1,

⇤, ✏)

arg min ,d

m X i=1

yi di hj (xi ) 

0  di  1/⌫m (i 2 [m]), 4. w 5. g

Lagrangian multipliers of last solution PT j=1 wj hj

(j 2 [t]), m X

di = 1,

i=1

2 R.

6. Output: sign(g) 6.1

Heuristics for improving efficiency

In practice, we need to reduce computational cost of weak learning in Algorithm 2. In this subsection, we introduce three heuristic options for improving efficiency. First, in Algorithm 2, we fix an initial shapelet-like ↵0 . More precisely, we initially solve max max ↵

j2[Q]

Q m X X

i=1 k=1

⇣ ⌘ (k) ↵ik K xi , x(j) , p

sub.to. ↵ is a unit vector, (k)

and use the solution as ↵0 . The xi such that ↵ik = 1 indicates a discriminative subsequence, i.e. a shapelet. We expect that it will speed up the convergence of the loop of line 3, and the local optimum is better than the shapelets. For Gun-Point data described in Table 1, this method obtains the solution about 7 times faster than when random vectors are used as the initial ↵0 . Second, we focus on the quadratic normalization constraints of the problem (12), that have highly computational costs. Thus, in practice, we replace the constraint with the 1-norm regularization: k↵k1  ⇤ and solve the linear program (LP). Although the solution space is smaller than original, we expect to obtain sparse solutions and reduce the complexity. Finally, we reduce the high computational costs induced by considering all of subsequences. For time series classification, when we consider subsequences as patterns, we have large computational cost due to the number of subsequences of training data (e.g., about 106 for only 1000 instances of length 1000, which results in the similarity matrix of size 1012 ). However, in most cases, many subsequences in time series data are similar to each other. Therefore, we only use representative patterns instead of all of subsequences for large time-series dataset. The representative patterns can be extracted by a clustering method such as the k-means. Although this approach may decrease classification accuracy, it drastically decreases the computation cost for a large dataset. For Gun-point dataset, this approach with k = 10 achieves 98.6% classification accuracy and it is over 2000 times faster on average than when all of subsequences are used. In the following experiments, we only follow this clustering approach for multi-class datasets. 6.2

Results for time-series classification

We use UCR datasets [3], that are often used as benchmark datasets for time-series classification methods. The detailed information of the datasets is described in the left side of Table 1. In our experiments, we assume that patterns are subsequence of a time series of length `. 10

Algorithm 2 WeakLearn by DC Algorithm 1. Input: S, K, d, ⇤, ✏ (convergence parameter) 2. Initialize: ↵0 2 Rm⇥Q , f0 1 3. For t = 1, . . . (a) Get maximizer j ⇤ for current ↵t : jp⇤

arg max

j2[Q]

(8p : yp = 1)

Q m X X

↵(t

i=1 k=1

⇣ ⌘ (k) (j) K x , x 1,ik) p i

(b) Solve optimization problem: f

X

min ↵,

dp

p:yp =+1

+

X

Q m X X

i=1 k=1

dq

⇣ ⌘ (j ⇤ ) (k) ↵ik K xi , xp p

(12)

q

q:yq = 1

sub.to

Q m X X

i=1 k=1

⇣ ⌘ (k) ↵ik K xi , x(j)  q

(j 2 [Q], 8q : yq = m X

Q X

1),

⇣

(j)

(l)

↵ij ↵kl K xi , xk

i,k=1 j,l=1

q

⌘

 ⇤2 .

↵t ↵, ft f (c)If ft 1 ft  ✏, then break. Pm,Q (k) 4. Output: h(z) = maxj i,k=1 ↵ik K(xi , z(j) ) We set the hyper-parameters as follows: The length ` of the pattern was searched in {0.1, 0.2, 0.3, 0.4} ⇥ L, where L is the length of each time-series in the dataset, and we choose ⌫ 2 {0.4, 0.3, 0.2, 0.1} for the 1-norm constrained soft margin optimization. We found good ` and ⌫ through a grid search via 5-fold cross validation. We use the Gaussian kernel K(x, x0 ) = exp( kx x0 k2 ). For each `, we choose from {0.0001, 0.001, . . . , 10000}, which maximizes the variance of the values of K. We fixed ⇤ = 1. LPBoost and DC programming iterated until the difference in solution becomes sufficiently small. For multi-class datasets, we took the One-vs-One approach and our clustering technique using the k-means algorithm with k = 100 for efficiency, and evaluate the average accuracy of five results. As a LP solver for WeakLearn and LPBoost we used the CPLEX software. Classification Accuracy The results of classification accuracy are shown in the right side of Table 1. We referred to the experimental result of [8] with regard to the accuracies of the following three state-of-the-arts algorithms, LTS [5], IG-SVM [7], and FLAG [8]. It is reported that the above algorithms achieve high accuracy in practice. As shown in Table 1, our algorithm had competitive performance with the algorithms. Parameter-tuning cost We briefly mention parameter-turning costs. For example, as seen in Table 1, LTS is known as one of the most accurate algorithm in practice. LTS requires seven input hyper-parameters, however, it is said that LTS is highly sensitive to several input parameters [see also 22]. LTS is a gradient-based optimization algorithm, which needs proper initial shapelets, which are heuristically calculated by the k-means of subsequences3 . It is quite unstable since the initial shapelets depend on the k-means algorithm. We also need to fix k, which specifies the number of 3

Note that the k-means for our methods is an option for efficiency.

11

shapelets. On the other hand, in our algorithm, the shapelet-like local patterns are automatically computed via the theoretically motivated formulation. Furthermore, number of iterations and learning rate of the gradient-decsent in LTS is also a delicate parameter. These parameters are hard to control since the gradient can explode or vanish. Our optimization problem can be solved stably because the boosting approach always converges to a solution even if the DC weak learning algorithm returns local optimums. Our algorithm requires only three parameters, ` (in common with LTS), ⌫, and of Gaussian kernel (⇤ and ✏ are fixed in the experiment). The parameter ⌫ is easy to tune since it gives an upper bound of the training error. Thus, we can say that our algorithm stably runs with lower parameter-tuning cost. Table 1: The detailed information of used datasets and classification accuracies (%). The accuracies of IG-SVM, LTS and FLAG are referred to the result of [8]. dataset #train #test length #classes IG-SVM LTS FLAG our method Adiac 390 391 176 37 23.5 51.9 75.2 64.9 Beef 30 30 470 5 90.0 76.7 83.3 66.7 Chlorine. 467 3840 166 3 57.1 73.0 76.0 68.7 Coffee 28 28 286 2 100.0 100.0 100.0 100.0 ECGFiveDays 23 861 136 2 99.0 100.0 92.0 100.0 FaceFour 24 88 350 4 97.7 94.3 90.9 89.1 Gun-Point 50 150 150 2 100.0 99.6 96.7 99.3 ItalyPower. 67 1029 24 2 93.7 95.8 94.6 93.8 Lightning7 70 73 319 7 63.0 79.0 76.7 57.3 MedicalImages 381 760 99 10 52.2 71.3 71.4 62.7 MoteStrain 20 1252 84 2 88.7 90.0 88.8 76.5 Sony. 20 601 70 2 92.7 91.0 92.9 94.8 Symbols 25 995 398 6 84.6 94.5 87.5 92.3 SyntheticC. 300 300 60 6 87.3 97.3 99.7 97.8 Trace 100 100 275 4 98.0 100.0 99.0 98.6 TwoLeadECG 23 1139 82 2 100.0 100.0 99.0 92.7 6.3

Sparsity and visualization analysis

It is said that shapelet-based methods have great visibility [see, e.g., 23] and easily interpreted by domain experts. Now, we show that the solution obtained by our method has high sparsity and it induces visibility, while using non-linear map with kernel. Let us explain P the sparsity using Gun-point data as an example. Now, we focus on the final hypothesis: g = t=1 wt ht , where Pm PQ (l) ht = maxj2[Q] k=1 l=1 ↵t,kl K(xk , ·(j) ). The number of final hypotheses ht s such that wt 6= 0 is 7, and the number of non-zero elements ↵t,kl of such ht s is 30 out of 34000 (0.6%). Actually, for the other datasets, obtained numbers of hypotheses are from 1 to 40, and percentages of non-zero elements ↵t,kl are from 0.08% to 6%. It is clear that the solution obtained by our method has high sparsity. Figure 1 is an example of visualization of a final hypothesis obtained by our method for Gun-point (l) data. The colored lines shows all of the xk s in g where both wt and ↵t,kl are non-zero. Each value (l) of the legends show the multiplication of wt and ↵t,kl corresponding to xk . Since it is hard to visualize the local features over the Hilbert space, we plotted each of them to match the original time series based on Euclidean distance. In contrast to visualization analyses by shapelets [e.g., 23], our visualization, colored lines and plotted position, do not strictly represent the meaning of the final hypothesis because of the non-linear feature map. However, we can say that colored lines represent “discriminative pattern” and certainly make important contributions to classification. Thus, we believe that our solutions are useful for domain experts to interpret important patterns.

7

Discussion

Justification for shapelet-based methods Our theoretical results justify the shapelet-based approaches in some sense. Theorem 1 guarantees the generalization ability of combined classifiers based 12

Figure 1: Visualization of discriminative patterns for Gun-point data (negative class). Black line is original time series. Positive value of colored line (red to yellow line) indicates the contribution rate for positive class, and negative value (blue to purple line) indicates the contribution rate for negative class.

on the class HU of infinitely many base hypotheses. Further, under our formulation (6), Theorem 3 guarantees that a “good” base hypothesis consists of kernels with patterns in the training sample. So, simply put, if the similarity measure between patterns is given as a kernel, finding a good combination of patterns in the sample is reasonable under some formulation. Extensions Our general formulation has many possible extensions and applications. For example, shapelet with DTW distance measure [see also 8, 17] can be easily implemented by using DTW kernel [1]. Using various length of local patterns seems to be good extension because the previous methods find various length of shapelets. We think weak leaner can choose a hypothesis from the set of hypotheses defined by various length of patterns. Anomaly detection problem, semi-supervised and unsupervised learning problem [see, e.g., 21, 25] are also attractive applications. If we want to learn one-class model from only positive examples such as One-Class SVM [11], we can easily formulate the weak learning problem with a small modification in problem 11 as follows: min ↵

sub.to

m X p=1

m X

di max

j2[Q]

Q X

i,k=1 j,l=1

Q m X X

i=1 k=1

⇣ ⌘ (k) ↵ik K xi , x(j) p

⇣ ⌘ (j) (l) ↵ij ↵kl K xi , xk  ⇤2 ,

and fortunately the above optimization problem is convex and expected to be solved more efficiently. 1 versus 2-norm regularizations One might wonder if we could formulate the problem using the 2-norm regularization (such as SVMs), not 1-norm. We are aware that we can construct a similar generalization bound in terms of 2-norm for the Gaussian kernel. However, the formulation suggested by the theory involves heavy integrations over Voronoi regions, and we still did not find a method for obtaining the solution efficiently. On the other hand, the current formulation based on 1-norm regularization leads us an organized optimization strategy with a theoretical guarantee.

References [1] Claus Bahlmann, Bernard Haasdonk, and Hans Burkhardt. On-line handwriting recognition with support vector machines - a kernel approach. In Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02), pages 49–54, 2002. [2] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2003. 13

[3] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive, July 2015. www.cs.ucr.edu/ ~eamonn/time_series_data/. [4] A Demiriz, K P Bennett, and J Shawe-Taylor. Linear Programming Boosting via Column Generation. Machine Learning, 46(1-3):225–254, 2002. [5] Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learning time-series shapelets. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 392–401, 2014. [6] Josif Grabocka, Martin Wistuba, and Lars Schmidt-Thieme. Scalable discovery of time-series shapelets. CoRR, abs/1503.03238, 2015. [7] Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classification of time series by shapelet transformation. Data Mining and Knowledge Discovery, 28(4):851– 881, July 2014. [8] Lu Hou, James T. Kwok, and Jacek M. Zurada. Efficient learning of timeseries shapelets. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence,, pages 1209–1215, 2016. [9] Isak Karlsson, Panagiotis Papapetrou, and Henrik Boström. Generalized random shapelet forests. Data Min. Knowl. Discov., 30(5):1053–1085, 2016. [10] Eamonn J. Keogh and Thanawin Rakthanmanon. Fast shapelets: A scalable algorithm for discovering time series shapelets. In Proceedings of the 13th SIAM International Conference on Data Mining, pages 668–676, 2013. [11] Larry M. Manevitz and Malik Yousef. One-class svms for document classification. J. Mach. Learn. Res., 2:139–154, 2002. [12] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. [13] Francesca Odone, Annalisa Barla, and Alessandro Verri. Building kernels from binary strings for image matching. IEEE Trans. Image Processing, 14(2):169–180, 2005. [14] Xavier Renard, Maria Rifqi, Walid Erray, and Marcin Detyniecki. Random-shapelet: an algorithm for fast shapelet discovery. In 2015 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA’2015), pages 1–10. IEEE, 2015. [15] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wen Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. [16] B. Schölkopf and AJ. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, December 2002. [17] Mit Shah, Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learning dtw-shapelets for time-series classification. In Proceedings of the 3rd IKDD Conference on Data Science, 2016, CODS ’16, pages 3:1–3:8, 2016. [18] Alexander Shapiro. Semi-infinite programming, duality, discretization and optimality conditions. Optimization, 58(2):133–161, 2009. [19] Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, and Shigeki Sagayama. Dynamic timealignment kernel in support vector machine. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pages 921–928, 2001. [20] Pham Dinh Tao and El Bernoussi Souad. Duality in D.C. (Difference of Convex functions) Optimization. Subgradient Methods, pages 277–293. Birkhäuser Basel, Basel, 1988. 14

[21] Li Wei and Eamonn Keogh. Semi-supervised time series classification. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 748–753, 2006. [22] Martin Wistuba, Josif Grabocka, and Lars Schmidt-Thieme. Ultra-fast shapelets for time series classification. CoRR, abs/1503.05018, 2015. [23] Lexiang Ye and Eamonn Keogh. Time series shapelets: A new primitive for data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 947–956. ACM, 2009. [24] D. Zhang, W. Zuo, D. Zhang, and H. Zhang. Time series classification using support vector machine with gaussian elastic metric kernel. In 2010 20th International Conference on Pattern Recognition, pages 29–32, 2010. [25] Qin Zhang, Jia Wu, Hong Yang, Yingjie Tian, and Chengqi Zhang. Unsupervised feature learning from time series. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 2322–2328. AAAI Press, 2016.

15

LEARNING AND INFERENCE ALGORITHMS FOR ...

The BOSARIS Toolkit - Theory, Algorithms and Code for ...

Domain Adaptation: Learning Bounds and Algorithms

Inductive Learning Algorithms and Representations for Text ... - Microsoft

Shapelets - GitHub

Page 2 INDUCTION, ALGORITHMIC LEARNING THEORY, AND ...

Ensemble Learning for Free with Evolutionary Algorithms ?

Optimization â Theory and Algorithms Jean Cea - School of ...

Introduction to Bandits: Algorithms and Theory

Margin Based Feature Selection - Theory and Algorithms