Multitask Generalized Eigenvalue Program

Boyu Wang, Borja Balle, Joelle Pineau School of Computer Science McGill University, Montreal, Canada [email protected], {bballe,jpineau}@cs.mcgill.ca

Abstract We present a novel multitask learning framework called multitask generalized eigenvalue program (MTGEP), which jointly solves multiple related generalized eigenvalue problems (GEPs). The core assumption of our approach is that the eigenvectors of related GEPs lie in some subspace that can be approximated by a sparse linear combination of basis vectors. As a result, these GEPs can be jointly solved by a sparse coding approach. This framework is quite general and can be applied to many eigenvalue problems in machine learning and pattern recognition, ranging from supervised learning to unsupervised learning, such as principal component analysis (PCA), Fisher discriminant analysis (FDA), and so on. Empirical evaluation with both synthetic and benchmark real world datasets validates the efficacy and efficiency of the proposed techniques, especially for grouped multitask GEPs.

1

Introduction

The generalized eigenvalue problem (GEP) requires finding the solution of a system of equations: Aw = λBw, (1) with respect to the pair (λ, w), where λ is the generalized eigenvalue, w ∈ Rd , w 6= 0, is the corresponding generalized eigenvector, and A, B ∈ Rd×d . The GEP is useful as it provides an efficient approach to optimize the Rayleigh quotient w> Aw , (2) w6=0 w > Bw which arises in many pattern recognition and machine learning tasks. For example, both principal component analysis (PCA) [7] and Fisher discriminant analysis (FDA) [4], can be formulated as special cases of this problem. In most machine learning applications, A and B are estimated from data; in PCA, B = I, the identity matrix, and A is the covariance matrix estimated from data. max

Although the GEP has been well studied over the years [3], to the best of our knowledge no one has tackled the problem of how to jointly solve multiple related GEPs, by sharing the common knowledge so that learning performance is better than independently solving each single GEP. This issue is especially important when the data for each GEP is insufficient, resulting in unreliable estimates of A and B, and therefore a poor estimate of the eigenvector w . Such a scenario may arise in many machine learning applications, such as principal component analysis (PCA), Fisher discriminant analysis (FDA). Some of these problems can be handled using existing multitask learning techniques [5]. However, most previous work on multitask learning focuses only on supervised learning [6, 1, 14], and has not been extended to the GEP setting. 1

2

Method

2.1

Problem Formulation

Let S = {(A1 , B1 ), . . . , (AK , BK )} ∈ Rd×d be the matrix pairs of K related GEPs. In our application, we assume that {Ak } ∈ Sd+ and {Bk } ∈ Sd++ , ∀k = {1, . . . , K}, where Sd+ (Sd++ ) denotes the set of symmetric positive semidefinite (definite) d × d matrices defined over R. The objective is to maximize the summation of K Rayleigh quotients: max

w1 ,...,wK

K 1 X wk> Ak wk . K w> B w k=1 k k k

(3)

As Eq. 3 is decoupled with respect to wk , it can be maximized by solving K GEPs individually. However if the data available for each task is small compared to its dimension, the estimates of A and B will be unreliable. In the PCA problem for example, where B = I and A is the estimated covariance matrix, if the number of data points Nk  d for each task, Ak cannot represent the covariance of each task properly, it is unlikely that the eigenvector solved by GEP will correctly maximize the variance of the data. We tackle this problem by assuming that the K GEPs are related in a way such that their eigenvectors lie in some subspace that can be approximated by a sparse linear combination of a number of basis vectors. More formally, assume that there is a dictionary D ∈ Rd×M (M < K), and the eigenvector of each task can be represented by a subset of the basis vectors of D. In other words, let γk ∈ RM be the sparse representation of the kth task with respect to D, then the objective function Eq. 3 can be formulated as: K

1 X γ > D> Ak Dγk max k> > − ρ||γk ||0 , s.t. ||D||F ≤ µ, (4) γk γ D Bk Dγk D K k k=1 where ||γ||0 is the `0 -norm of γ, denoting the number of nonzero elements of γ, ||D||F = (tr(DD> ))1/2 is the Frobenius norm of matrix D. The `0 regularizer encourages γ to be sparse so that the knowledge embedded in D can be selectively shared. The norm constraint on D prevents the dictionary from being too large and overfitting the available data. max

We see in Eq. 4 that the K GEPs are coupled via the dictionary D that is shared across tasks, and therefore the K GEPs can be jointly learned in the context of multitask learning. 2.2

Multitask Generalized Eigenvalue Program

The objective function (Eq. 4) is not concave therefore we adopt the alternating optimization approach to obtain a local maximum [2]. We apply the following two optimization step alternately: 1. Sparse coding: given a fixed dictionary D, update sparse representation γk for each task. 2. Dictionary update: given fixed Γ = [γ1 ; . . . ; γK ], update the dictionary D. The proposed multitask eigenvalue programm (MTGEP) algorithm is outlined in Algorithm 1, with details of each optimization step described next. 2.2.1

Sparse Coding

Given a fixed dictionary D, Eq. 4 is decoupled and can be optimized by solving K individual GEPs: γ > Pk γ − ρ||γ||0 , (5) γ > Qk γ γ where Pk = D> Ak D and Qk = D> Bk D. Eq. 5 is called a sparse generalized eigenvalue problem (SGEP) and has been studied in [10, 13, 12]. In this work, we adopt the bi-directional search [10] and iteratively reweighed quadratic minorization (IRQM) algorithm [12] to solve Eq. 5, and the better empirical results between these two are reported in our experimental section. γk = arg max

2

Algorithm 1 Multitask Generalized Eigenvalue Program Input: {(A1 , B1 ), . . . , (AK , BK )}, maxIter, number of basis vectors M , regularization parameter ρ 1: Solve each GEP to obtain {w1 , . . . , wK } 2: Initialize t = 0, W (0) = [w1 ; . . . ; wK ] 3: Initialize D(0) to the first M columns of U , where U is obtained by singular value decomposition of W (0) : W (0) = U SV > . 4: while t < maxIter do 5: for k = 1, . . . , K do (t) 6: Solve the kth SGEP (Eq. 5) to obtain γk . 7: end for 8: Update D(t) by solving Eq. 7. 9: Normalize D(t) such that ||D(t) ||F = M . 10: t=t+1 11: if converge then 12: break 13: end if 14: end while (t) (t) Output: D = D(t) , Γ = [γ1 , . . . , γK ], W = DΓ

2.2.2

Dictionary Update

We initialize D using the approach proposed by [9]. We first solve each GEP individually to obtain K leading eigenvectors {w1 , . . . , wK }, one for each task. Then the dictionary D is initialized as the first M left singular vectors of W (0) ∈ Rd×K , where W (0) is constructed by {w1 , . . . , wK }, one for each column. Given a fixed Γ = [γ1 , . . . , γK ], the optimization problem (Eq. 4) becomes D(t) = arg max D

K (t)> > (t) X γ D Ak Dγ k (t)>

k=1 γk

k . (t)

D> Bk Dγk

(6)

By applying the property of vectorization operator that γ > D> ΣDγ = vec(D)> (Σ ⊗ γγ > )vec(D) to Eq. 6, we have the following equivalent objective function:   K vec(D)> A ⊗ γ (t) γ (t)> vec(D) X k k k   D(t) = arg max , (7) (t) > B ⊗ γ γ (t)> vec(D) D k k=1 vec(D) k k where vec(·) is the vectorization operator and ⊗ is Kronecker product. Eq. 7 is a nonconave unconstrained optimization problem, but a local maximum can be found by standard gradient based algorithm, using D(t−1) as a warm start for computing D(t) . As the Rayleigh quotient is invariant with respect to its argument scaling, we simply normalize D after each update step such that vec(D) ||D||F = µ with µ = M : vec(D) = M||D|| . F

3 3.1

Experiments MultiPCA for Multitask Dimensionality Reduction

We first evaluate MTGEP in the context of multitask PCA (MultiPCA) using a synthetic data set. In this dataset, we generate G = 5 disjoint groups of d-dimensional Gaussian distributed random variables. Within each group, we generate J tasks, each with 500 instances (Ntrain instances for training, the rest for testing). For all tasks, the leading eigenvalue of the covariance matrix is 5; remaining eigenvalues are sampled from a one-side normal distribution. Eigenvectors of the covariance matrices are randomly generated, and are the same within each group. We compare the average variance explained by leading eigenvectors found by MTGEP to these: 1. SinglePCA: apply traditional PCA on each individual task; PoolPCA: apply traditional PCA jointly on all tasks; and SVDPCA: the first column of the initial dictionary D(0) for MTGEP. 3

5 4.5 4 Multi Single SVD Pool True

3.5 3 2.5 2 1.5 10

20

30

40

50

60

70

80

90

100

Dimension

5.5

Variance in the First Component

Variance in the First Component

Variance in the First Component

5.5

5 4.5 Multi Single SVD Pool True

4 3.5 3 2.5 2 1.5

0

20

40

60

80

100

5.5 5 4.5 Multi Single SVD Pool True

4 3.5 3 2.5 2 1.5 1

0

Number of Training Samples

(a)

20

40

60

80

100

Number of Tasks per Group

(b)

(c)

Figure 1: Learning performances with different settings. (a): with different feature dimensions, (b): different number of training instances, (c): different number of tasks for each group. Fig. 1 shows how the learning performances of different algorithms vary with different settings. In Fig. 1(a) we set J = 50 and Ntrain = 5, and vary the dimension d of the data instances. We observe that the performance of SinglePCA decreases as the feature dimension increases due to less reliable estimates of the principal component, while MultiPCA is robust to the increase in dimension. In Fig. 1(b) we set J = 50 and d = 30, and vary the amount of training data, Ntrain . We observe that MultiPCA significantly benefits from other tasks when the number of training instances for each task is small. Finally, we consider the performances of MultiPCA with different tasks for each group. We set J = 50, Ntrain = 5, d = 30, and vary J from 1 to 100. Fig. 1(c) shows that MultiPCA does not improve the learning performance when J = 1, since in this case there is no common knowledge to be shared among tasks. As the number of tasks per group increases, the performance of MultiPCA improves by leveraging knowledge from other tasks within each group. 3.2

MultiFDA for Multitask Classification

Next, we evaluate MultiFDA algorithm on three common multitask learning benchmarks: the landmine dataset [14], and USPS and MNIST datasets [8] For more detailed description of the datasets and experimental setting, see [8, 11]. Besides the baseline approach (SingleFDA), we also include comparison with the grouping and overlapping multitask learning (GO-MTL) algorithm [9], an existing multitask supervised learning approach. Table 1 summarizes the results. We see that MultiFDA outperforms single task learning and performs comparably to GO-MTL. We note that for the digit datasets, the improvements of the multitask learning approach over single task approach is not significant, which is consistent with previous analysis [8, 9]. Table 1: Results on multitask classification tasks: the area under the ROC curve (AUC) for the landmine dataset, and classification accuracy (%) for USPS and MNIST datasets. SingleFDA MultiFDA GO-MTL

4

Landmine 74.9 77.8 78.0

USPS 90.9 91.9 92.8

MNIST 89.1 89.8 86.6

Conclusions

This paper introduces a new framework to solve multitask generalized eigenvalue problems, which can be used within a wide variety of machine learning approaches, such as multitask PCA, multitask FDA and more. The proposed algorithm is validated within several task categories (both unsupervised and supervised) using both synthetic and real datasets. The empirical results show that solving related GEPs indeed benefits from our MTGEP approach, especially for GEPs with well grouped structures.

4

References [1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, December 2008. [2] J. C. Bezdek and R. J. Hathaway. Convergence of alternating optimization. Neural, Parallel and Scientific Computations, 11:351–368, December 2003. [3] T. D. Bie, N. Cristianini, and R. Rosipal. Eigenproblems in pattern recognition. In Handbook of Geometric Computing, pages 129–170. 2005. [4] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006. [5] R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, July 1997. [6] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615–637, April 2005. [7] I. Jolliffe. Principal Component Analysis. Springer, New York, 2nd edition, 2002. [8] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In ICML, 2011. [9] A. Kumar and H. Daum´e III. Learning task grouping and overlap in multi-task learning. In ICML, 2012. [10] B. Moghaddam, Y. Weiss, and S. Avidan. Generalized spectral bounds for sparse LDA. In ICML, 2006. [11] P. Ruvolo and E. Eaton. ELLA: An efficient lifelong learning algorithm. In ICML, 2013. [12] J. Song, P. Babu, and D. P. Palomar. Sparse generalized eigenvalue problem via smooth optimization. arXiv preprint arXiv:1408.6686, 2014. [13] B. K. Sriperumbudur, D. A. Torres, and G. R. G. Lanckriet. Sparse eigen methods by d.c. programming. In ICML, 2007. [14] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8:35–63, January 2007.

5

Multitask Generalized Eigenvalue Program

School of Computer Science. McGill University ... Although the GEP has been well studied over the years [3], to the best of our knowledge no one has tackled the ...

255KB Sizes 4 Downloads 294 Views

Recommend Documents

The generalized principal eigenvalue for Hamilton ...
large, then his/her optimal strategy is to take a suitable control ξ ≡ 0 which forces the controlled process Xξ to visit frequently the favorable position (i.e., around ..... this section, we collect several auxiliary results, most of which are f

Quadratic eigenvalue problems for second order systems
We consider the spectral structure of a quadratic second order system ...... [6] P.J. Browne, B.A. Watson, Oscillation theory for a quadratic eigenvalue prob-.

EIGENVALUE ESTIMATES FOR STABLE MINIMAL ...
KEOMKYO SEO. (Communicated by H. Martini). Abstract. In this article ... negative constants, we can extend Theorem 1.1 as follows. THEOREM 1.3. Let M be an ...

Reduced Order Models for Eigenvalue Problems
number of input and output variables are denoted by m and p respectively. ... that sC − G is a regular pencil and that s0 is chosen such that G + s0C is.

Truncated Power Method for Sparse Eigenvalue ...
Definition 1 Given a vector x and an index set F, we define the truncation ..... update of x and the update of z; where the update of x is a soft-thresholding ...

Scalable Hierarchical Multitask Learning ... - Research at Google
Feb 24, 2014 - on over 1TB data for up to 1 billion observations and 1 mil- ..... Wc 2,1. (16). The coefficients λ1 and λ2 govern the trade-off between generic sparsity ..... years for each school correspond to the subtasks of the school. ID. Thus 

Generalized and Doubly Generalized LDPC Codes ...
The developed analytical tool is then exploited to design capacity ... error floor than capacity approaching LDPC and GLDPC codes, at the cost of increased.

Multitask Kernel-based Learning with Logic Constraints
ture space are to be learned by kernel machines and a higher level abstract representation consists of logic clauses on these predicates, known to hold for any ...

Multidimensional generalized coherent states
Dec 10, 2002 - Generalized coherent states were presented recently for systems with one degree ... We thus obtain a property that we call evolution stability (temporal ...... The su(1, 1) symmetry has to be explored in a different way from the previo

Oscillation Theory for a Quadratic Eigenvalue Problem
Sep 15, 2008 - For example, Roach and Sleeman [19, 20] recast (1.1. - 1.3) as a linked two parameter system in L2(0, 1)⊗C2 and set their completeness results in this space. Binding [2] establishes the equivalence of L2(0, 1)⊗C2 with L2(0, 1)⊕L2

GENERALIZED COMMUTATOR FORMULAS ...
To do this, we need to modify conjugation calculus which is used in the literature ...... and N.A. Vavilov, Decomposition of transvections: A theme with variations.

Generalized Anxiety Disorder
1997). Three classes of drugs are commonly used to treat .... of school phobia. Journal of ..... Paradoxical anxiety enhancement due to relaxation training. Journal of ..... Auto- nomic characteristics of generalized anxiety disorder and worry.

A SECOND EIGENVALUE BOUND FOR THE ...
will be presented in Section 5. .... 5 operators are defined by their quadratic forms. (5) h±[Ψ] = ∫. Ω. |∇Ψ(r)|2 ..... The first eigenfunction of −∆+V on BR is radi-.

Generalized anxiety disorder
Taken together, data suggest that GAD symptoms are likely to ... Buhr, & Ladouceur, 2004); perseverative generation of problem solutions and interpreting ...

Schwarz symmetric solutions for a quasilinear eigenvalue ... - EMIS
We begin with the abstract framework of symmetrization following Van Schaftingen [13] and in Section ...... quasi-linear elliptic problems, Electron. J. Differential ...

Using the Eigenvalue Relaxation for Binary Least ...
Abstract. The goal of this paper is to survey the properties of the eigenvalue relaxation for least squares binary problems. This relaxation is a convex program ...

Testing the Eigenvalue Structure of Integrated ...
Mar 11, 2018 - Keywords: Itô semimartingale, high-frequency data, large data cross sections, eigenfunctions of .... high dimensional data - we use our ratio test of 'unexplained' quadratic variation to select the number ..... i-th eigenvalue of IVT

The Distribution of the Second-Largest Eigenvalue in ...
Nov 21, 2006 - of the number of vertices holds in the limit, then in the limit approximately 52% of ...... support that only the β = 1 Tracy-Widom distribution models the second largest eigen- ...... F. Bien, Constructions of telephone networks by g

Specialized eigenvalue methods for large-scale model ...
High Tech Campus 37, WY4.042 ..... Krylov based moment matching approaches, that is best .... from sound radiation analysis (n = 17611 degrees of free-.

Generalized Inquisitive Logic
P to {0, 1}. We denote by ω the set of all indices. Definition 3 (States). A state is a set of indices. We denote by S the set of all states. Definition 4 (Support).

GENERALIZED MACHINE THEORY.pdf
6. a) Using generalized machine theory obtain the impedance matrix and the. equivalent circuit of a 3φ induction motor. 10. b) Show that for the stator and rotor ...

Generalized Anxiety Disorder
Aug 12, 2004 - ... at EASTERN VIRGINIA MEDICAL SCHOOL on February 18, 2007 . ..... eds. Massachusetts General Hospital hand- book of general hospital ...

Generalized Silver Codes
gineering, Indian Institute of Science, Bangalore 560012, India (e-mail: [email protected]; [email protected]). Communicated by E. Viterbo, Associate ...... 2006, he was with Robert Bosch India limited, Bangalore. Currently, he is a wor