Ivan Paskov
Google Science Fair
Nuclear Norm Multi-Task Learning We now describe multitask learning, which is a form of machine learning motivated by the idea that better predictions can be made when separate tasks are learned jointly. In other words, multitask learning assumes that there exist commonalities between tasks that can be exploited to facilitate better learning and therefore better predictions.
We are given regression tasks, each of which has feature matrices (1), (2), … , () and their corresponding response variables ( ), (), … , ( ). For = 1, … , assume that
() ⋯ ⋱ () = ⋮ () ⋯ () = (
1
(), … ,
() ⋮ is a × dimensional matrix of feature variables and ()
())
is the corresponding dimensional response variable. Note that the jth
column in each matrix (), = 1, … , corresponds to the same jth feature. It is exactly the fact that all tasks share the same feature set that allows us to share information between them.
For = 1, … , the predictions !() = ( y!1 (), … , y! ()) are computed by
y! # () = # ()$ () + #& ()$& () + ⋯ + # ()$ () + '(), ( = 1, … , .
Here *() = ($ (), … , $ ()) and '() are the regression coefficients for each task. Denote by | … | + = *( ) … *(-) | … |
the × dimensional matrix where each dimensional *() is placed as a column and . = ('(1), … , '())
as the -dimensional vector of the offsets. The multitask learning loss function becomes 6
&
/(W, .) = 1 12 # () − y!# ()4 , 5 #5
For multitask learning we try to find a solution W, . such that it minimizes the sum: min
:∈ℝ ,.∈ℝ
=/(W, .) + >?(W)@,
where ?(W) is the regularization penalty to be described below.
Let
+ = ABC ∗
be the singular value decomposition (SVD) of the matrix +, see [Wall et al., 2003]. The matrix
C ∗ denotes the transposed matrix of C. In the reduced version of the SVD, which we always use 1
Ivan Paskov
Google Science Fair
because > , A is a × dimensional matrix, B is a × diagonal matrix, and C is a
T × T dimensional matrix. The columns of U and V are denoted by IJ ∈ ℝK and LJ ∈ ℝM , j =
1, … , T and are known as the left and right singular vectors, respectively. All the left singular
vectors IJ are orthonormal, i.e., unit vectors that are orthogonal to each other. Similarly, the right
singular vectors LJ are also orthonormal.
The diagonal elements O , … , O6 of the diagonal matrix B are known as singular values
and each O ≥ 0 and O ≥ O& ≥ ⋯ ≥ O6 . The rank of W is the number of singular values greater than 0. With this in mind, the nuclear norm regularization penalty is defined as: 6
?(W) = 1 O 5
As suggested by the definition, the nuclear norm regularization penalty penalizes the singular
values of +, and as with the Lasso penalty, small singular values are set to 0. This means that
the Nuclear Norm penalty pushes + to have a smaller rank. And since we believe that tasks are
related because numerous drugs share similar targets, then W should have a rank smaller than . It can be shown directly from + = ABC ∗ that
6
R() = 12OS TS 4US S5
where LV = (T , … , T6 ). Therefore, each task’s regression coefficients are a weighted
combination of the orthonormal basis vectors IJ . Since each task represents a drug, each IJ can
be interpreted as representing a pathway that some of the drugs operate on. Consequently each drug can be represented as a weighted combination of the pathways it acts on – thereby potentially providing novel insights into these drug’s mechanisms of action as well as their efficacy. Another useful interpretation that can be deduced from the SVD is the relatedness of
drugs. Since each OS TW dictates the weight US (ideally some pathway) contributes to the
regression coefficients for task , it is reasonable that tasks that have similar OS TS are likely to be related. In other words, each drug can be represented via its TS , and hierarchical clustering
analysis can be performed to figure out the relatedness of these drugs. This is useful for a number of reasons. First, this information can help validate our current understanding of these drugs as 2
Ivan Paskov
Google Science Fair
we know which drugs are related. Second, it could suggest novel relationships. And third, it represents a novel way of extracting information from genomic data. Typically, clustering analysis is performed on the response vector, but this proves to be limited in its usefulness, as only cell lines that have pharmacological data for all 24 drugs can be used – rendering significant portions of the data useless. Our Nuclear Norm approach to clustering, however, allows us to take full advantage of all the data because we train on all of the data available for each task and use the internal representation for each task given by V to perform hierarchical clustering– thereby significantly increasing the chances of discovering something meaningful.
3