Hyperparameter Learning for Kernel Embedding ...

Viewer
Transcript

Journal of Machine Learning Research 1–8

ICML 2017 AutoML Workshop

Hyperparameter Learning for Kernel Embedding Classifiers with Rademacher Complexity bounds Kelvin Y.S. Hsu Richard Nock Fabio Ramos

[email protected] [email protected] [email protected]

The University of Sydney and Data61, CSIRO, Australia

Abstract We propose learning-theoretic bounds for hyperparameter learning of conditional kernel embeddings in the probabilistic multiclass classification context. Kernel embeddings are nonparametric methods to represent probability distributions directly through observed data in a reproducing kernel Hilbert space (RKHS). This property forms the core of modern kernel methods, yet hyperparameter learning for kernel embeddings remains challenging, often relying on heuristics and domain expertise. We begin by developing the kernel embedding classifier (KEC), and prove that its expected classification error can be bounded with high probability using Rademacher complexity bounds. This bound is used to propose a scalable hyperparameter learning algorithm for conditional embeddings with batch stochastic gradient descent. We verify our learning algorithm on standard UCI datasets, as well as to learn feature representations of a convolutional neural network with improved accuracy, demonstrating the generality of this approach.

1. Introduction Kernel embeddings are principled methods to represent probability distributions in a nonparametric setting. By transforming distributions into mean embeddings within a reproducing kernel Hilbert space (RKHS), distributions can be represented directly from data without assuming a parametric structure (Song et al., 2013). Consequently, nonparametric probabilistic inference can be carried out entirely within the RKHS, where difficult marginalisation integrals become simple linear algebra (Muandet et al., 2016). In this framework, positive definite kernels k : X × X → R provide a coherent sense of similarity between two elements of the same space through implicitly defining higher dimensional features. However, kernel hyperparameters are often selected a-priori and not learned. In this paper, we take a learning theoretic approach to learn the hyperparameters of a conditional kernel embedding in a supervised manner. We begin by proposing the kernel embedding classifier (KEC), a principled framework for inferring multiclass probabilistic outputs using conditional embeddings, and provide a proof for its stochastic convergence. We then employ Rademacher complexity as a data dependent model complexity measure, and prove that expected classification risk can be bounded by a combination of empirical risk and conditional embedding norm with high probability. We use this bound to propose a learning objective that learns the balance between data fit and model complexity in a way that does not rely on priors.

c K.Y. Hsu, R. Nock & F. Ramos.

Hsu Nock Ramos

2. Hilbert Space Embeddings of Conditional Probability Distributions To construct a conditional embedding map UY |X corresponding to the distribution PY |X , where X : Ω → X and Y : Ω → Y are measurable random variables, we choose a kernel k : X × X → R for the input space X and another kernel l : Y × Y → R for the output space Y. These kernels k and l each describe how similarity is measured within their respective domains X and Y, and are symmetric and positive definite such that they uniquely define −1 the RKHS Hk and Hl . We then define UY |X := CY X CXX where CY X := E[l(Y, ·) ⊗ k(X, ·)] and CXX := E[k(X, ·) ⊗ k(X, ·)] (Song et al., 2009). The conditional embedding map can be seen as an operator map from Hk to Hl . In this sense, it sweeps out a family of conditional embeddings µY |X=x in Hl , each indexed by the input variable x, via the property µY |X=x := E[l(Y, ·)|X = x] = UY |X k(x, ·). Under the assumption that E[g(Y )|X = ·] ∈ Hk , Song et al. (2009, Theorem 4) proved that the conditional expectation of a function g ∈ Hl can be expressed as an inner product, E[g(Y )|X = x] = hµY |X=x , gi. While the assumptions that E[g(Y )|X = ·] ∈ Hk and k(x, ·) ∈ image(CXX ) hold for finite input domains X and characteristic kernels k, it is not necessarily true when X is a continuous domain (Fukumizu −1 et al., 2004), which is the scenario for many classification problems. In this case, CY X CXX becomes only an approximation to UY |X , and we instead regularise the inverse and use CY X (CXX + λI)−1 , which also serves to avoid overfitting (Song et al., 2013). In practice, we do not have access to the distribution PXY to analytically derive the conditional embedding. Instead, we have a finite collection of observations {xi , yi } ∈ X × Y, i ∈ Nn := {1, . . . , n}, for which the conditional embedding map UY |X can be estimated by UˆY |X = Ψ(K + nλI)−1 ΦT ,

(1)

where K := {k(xi , xj )}n,n i=1,j=1 , Φ := φ(x1 ) . . . φ(xn ) , Ψ := ψ(y1 ) . . . ψ(yn ) , φ(x) := k(x, ·), and ψ(y) := l(y, ·) (Song et al., 2013). The empirical conditional embedding µ ˆY |X=x := UˆY |X k(x, ·) then stochastically converges to µY |X=x in the RKHS norm at a rate 1

1

of Op ((nλ)− 2 + λ 2 ), under the assumption that k(x, ·) ∈ image(CXX ) (Song et al., 2009, Theorem 6). This allows us to approximate the conditional expectation with hˆ µY |X=x , gi instead, where g := {g(yi )}ni=1 and k(x) := {k(xi , x)}ni=1 , E[g(Y )|X = x] ≈ hˆ µY |X=x , gi = gT (K + nλI)−1 k(x),

(2)

3. Kernel Embedding Classifier In the multiclass setting, the output label space is finite and discrete, taking values only in Y = Nm := {1, . . . , m}. Naturally, we first choose the Kronecker delta kernel δ : Nm ×Nm → {0, 1} as the output kernel l, where labels that are the same have unit similarity and labels that are different have no similarity. That is, for all pairs of labels yi , yj ∈ Y, δ(yi , yj ) = 1 only if yi = yj and is 0 otherwise. As δ is an integrally strictly positive definite kernel on Nm , it is therefore characteristic (Sriperumbudur et al., 2010, Theorem 7). As such, by definition of characteristic kernels (Fukumizu et al., 2004), δ uniquely defines a RKHS Hδ = span{δ(y, ·) : y ∈ Y}, which is the closure of the span of its kernel induced features (Xu and Zhang, 2009). For Y = Nm , this means that any real-valued function g : Nm → R that is bounded on its discrete domain Nm is in the RKHS of δ, because we can always write 2

Hyperparameter Learning for Kernel Embedding Classifiers

P g= m y=1 g(y)δ(y, ·) ∈ span{δ(y, ·) : y ∈ Y}. In particular, indicator functions on Nm are in the RKHS Hδ , since 1c (y) := 1{c} (y) = δ(c, y) are simply the canonical kernel induced features of Hδ . Such properties do not necessarily hold for continuous domains in general, and allows consistent estimations of decision probabilities used in multiclass classification. Let pc (x) := P[Y = c|X = x] be the decision probability function for class c ∈ Nm , or the probability of the class label Y being c when the example X is x. We begin by writing this probability as an expectation of indicator functions, pc (x) := P[Y = c|X = x] = P[Y ∈ {c}|X = x] = E[1c (Y )|X = x].

(3)

With 1c ∈ Hδ , we let g = 1c in (2) and 1c := {1c (yi )}ni=1 to estimate the right hand side by pˆc (x) = fc (x) := 1Tc (K + nλI)−1 k(x).

(4)

Therefore, the vector of empirical decision probability functions over the classes c ∈ Nm is ˆ (x) = f (x) := YT (K + nλI)−1 k(x) ∈ Rm , p (5) where Y := 11 12 · · · 1m ∈ {0, 1}n×m is simply the one hot encoded labels {yi }ni=1 . The KEC is thus the multi-valued decision function f (x) (5). We then proved that empirical decision probabilities (4) converge to the true decision probabilities. In fact, the inference distribution (5) is equivalent to the empirical conditional embedding. Theorem 1 (Uniform Convergence of Empirical Decision Probability Function) Assuming that k(x, ·) is in the image of CXX , the empirical decision probability function pˆc : X → R (4) converges uniformly to the true decision probability pc : X → [0, 1] (3) at a 1 1 stochastic rate of at least Op ((nλ)− 2 + λ 2 ) for all c ∈ Y = Nm .

4. Hyperparameter Learning with Rademacher Complexity Bounds Kernel embedding classifiers (4) are equivalent to a conditional embedding with a discrete target space Y. Hyperparameter learning for conditional embeddings is particularly difficult compared to joint embeddings, since the kernel kθ with parameters θ ∈ Θ is to be learned jointly with a regularisation parameter λ ∈ Λ = R+ . This implies that the notion of model complexity is especially important. To this end, we propose to use a learning theoretic approach to balance model complexity and data fit. The Rademacher complexity (Bartlett and Mendelson, 2002) measures the expressiveness of a function class F by its ability to shatter, or fit, noise. They are data-dependent measures, and are thus particularly well suited to learning tasks where generalisation ability is vital, since complexity penalties that are not data dependent cannot be universally effective (Kearns et al., 1997). We begin by defining a loss function as a measure for performance. For decision functions of the form f : X → A = Rm whose entries are probability estimates, we employ the cross entropy loss, L (y, f (x)) := − log [yT f (x)]1 = − log [fy (x)]1 ,

(6)

to express classification risk, where we use the notation [ · ]1 := min{max{ · , }, 1} for ∈ (0, 1). Under this loss, we prove a bound on the expected risk. 3

Hsu Nock Ramos

Algorithm 1 KEC Hyperparameter Learning with Batch Stochastic Gradient Updates 1: Input: kernel family kθ : X × X → R, dataset {xi , yi }n i=1 , initial kernel parameters θ0 , initial regularisation parameter λ0 , learning rate η, tolerance , batch size nb 2: θ ← θ0 , λ ← λ0 3: repeat 4: Sample the next batch Ib ⊆ Nn , |Ib | = nb 5: Y ← {δ(yi , c) : i ∈ Ib , c ∈ Nm } ∈ {0, 1}nb ×m 6: Kθ ← {kθ (xi , xj ) : i ∈ Ib , j ∈ Ib } ∈ Rnb ×nb ∈ Rnb ×nb 7: Lθ,λ ← cholesky(Kθ + nb λInb ) 8: Vθ,λ ← LTθ,λ \(Lθ,λ \Y ) ∈ Rnb ×m 9: Pθ,λ ← Kθ Vθ,λ ∈ Rnb ×m q P b T K V L ((Y )i , (Pθ,λ )i ) + 4e α(θ) trace(Vθ,λ 10: q(θ, λ) ← n1b ni=1 θ θ,λ ) ∂q ∂q θ ← θ − η ∂θ (θ, λ), λ ← λ − η ∂λ (θ, λ) (Or other gradient based updates)

∂q T ∂q

< (Or other stop criteria) 12: until ∂θ (θ, λ)T ∂λ (θ, λ)T ∞ 13: Output: kernel parameters θ, regularisation parameter λ

11:

Theorem 2 (Expected Risk Bound for KEC Hyperparameter Learning) For any n ∈ N+ and observations {xi , yi }ni=1 used to define fθ,λ (5), with probability 1−β −1 over iid samples {Xi , Yi }ni=1 p of length n from PXY , every θ ∈ Θ, λ ∈ Λ, and ∈ (0, e ) satisfies, where r(θ, λ) := trace(YT (Kθ + nλI)−1 Kθ (Kθ + nλI)−1 Y) supx∈X kθ (x, x), r n 1X 2 8 L (Yi , fθ,λ (Xi )) + 4e r(θ, λ) + log , (7) E[Le−1 (Y, fθ,λ (X))] ≤ n n β i=1

Since the training set itself is a sample of length n drawn from PXY , the inequality (7) is true with probability 1 − β when the random variables Xi , Yi are realised as the training observations xi , yi . We therefore employ this upper bound as the learning objective, n

1X q(θ, λ) := L (yi , fθ,λ (xi )) + 4e r(θ, λ). n

(8)

i=1

We employ gradient based optimisers such as Gradient descent or Adam (Kingma and Ba, n 2016). Since theorem 2 holds for any n ∈ N+ and any p set of data {xi , yi }i=1 from PXY , with the trade-off of relaxing bound tightness through 8 log (2/β)/n, the bound (7) also holds with high probability for a subset of the training data. This enables scalable hyperparameter learning through batch stochastic gradient updates, each improving a different upper bound of the generalisation risk. We present this algorithm in algorithm 1, reducing the time complexity from O(n3 ) to O(n3b ), where nb is the batch size. One particularly useful class of kernels are those constructed from neural networks ϕθ : X → Rp explicitly by kθ (x, x0 ) = hϕθ (x), ϕθ (x0 )i. We refer to KECs constructed this way as kernel embedding networks (KEN). Neural networks are typically defined by weights and biases of its hidden layers, which collectively define the parameters θ of the constructed kernel, and are thus trainable under our learning algorithm. These kernels benefit in both expressibility and scalability, where the n × n Cholesky decomposition can be replaced p × p decomposition by the Woodbury matrix inversion identity (Higham, 2002). 4

2.0 1.5 1.0 0.5 0.0 −0.5 −1.0 −1.5

Rademacher Complexity Bound Initially Overfitted Model Initially Underfitted Model 0

100

200

300

Iterations

400

140 120 100 80 60 40 20 0

Cross Entropy Loss

Training Cross Entropy Loss Test Cross Entropy Loss

Loss

log(r(θ, λ))

Hyperparameter Learning for Kernel Embedding Classifiers

500

0

100

200

300

Iterations

400

500

Figure 1: Rademacher complexity based learning balances data fit and model complexity. Table 1: Classification accuracy (%) of kernel embedding classifiers on UCI datasets Dataset (n, d, m) GKEC GKEC-SGD KEN-1 KEN-2 Others banknote (1372, 4, 2) ecoli (336, 7, 8) robot (5456, 24, 4) segment (2310, 19, 7) wine (178, 13, 3) yeast (1484, 8, 10)

99.9 ± 0.2 87.5 ± 4.4 96.7 ± 0.9 98.4 ± 0.8 97.2 ± 3.7 52.5 ± 2.1

98.8 ± 0.9 84.5 ± 5.0 95.5 ± 0.9 96.1 ± 1.5 93.3 ± 6.0 60.3 ± 4.4

99.5 ± 1.0 87.5 ± 3.2 82.3 ± 7.1 94.6 ± 1.6 96.1 ± 5.0 55.8 ± 5.0

99.4 ± 0.9 86.3 ± 6.0 94.5 ± 0.8 96.7 ± 1.1 97.2 ± 5.1 59.6 ± 4.0

99.78a 81.1b 97.59c 96.83d 100e 55.0b

5. Experiments Toy Example The first two of four total attributes of the iris dataset (Fisher, 1936) is known to have class labels that are not separable, as the same example x ∈ R2 may be assigned different output labels y ∈ N3 := {1, 2, 3} as they were only separable in the two remaining attributes. In these difficult scenarios, the notion of model complexity is extremely important. Figure 1 demonstrates algorithm 1 with full gradient updates (nb = n) to learn the corresponding conditional embedding hyperparameters. In particular, the initially overfitted model learns a simpler model at the expense of lower training performance, emphasising the benefits of complexity based regularisation, without which the learning would only maximise training performance at the cost of further overfitting. Meanwhile, the initially underfitted model learns to trade some complexity away to improve the unflattering performance on the training set. UCI Datasets We demonstrate the average performance of learning anisotropic Gaussian kernels and kernels constructed from neural networks on standard UCI datasets (Bache and Lichman, 2013), summarised in table 1. The Gaussian kernel is learned with both full 5

Hsu Nock Ramos

Ker)el E(bedd%)g Ne.1ork Or%g%)al Co)volu.%o)al Ne.1ork

0

100

200

300

400 Epochs

500

600

0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 800

Loss

Tes. Perfor(a)ce o) MNIST b2 Lear)%)g Deep Co)volu.%o)al Fea.ures

Accurac2 (%)

99.75 99.50 99.25 99.00 98.75 98.50 98.25 98.00

700

Figure 2: Test accuracy and cross entropy loss by learning deep convolutional features. n ) (GKEC) and batch stochastic gradient updates (GKEC-SGD) using a tenth (nb ≈ 10 of the training set each training iteration. For kernel embedding networks, we randomly select two simple fully connected architectures with 16-32-8 (KEN-1) and 96-32 (KEN2) hidden units respectively, and learn the conditional embedding without dropout under ReLU activation. We compare our results to other approaches using neural networks (Kaya et al., 2016; Freire et al., 2009, a; c), probabilistic binary trees (Horton and Nakai, 1996, b), decision trees (Zhou et al., 2004, d), and regularised discriminant analysis (Aeberhard et al., 1992, e). Table 1 shows that our learning algorithm achieves similar performance without any special tuning or heuristics. The stochastic and full gradient approach for Gaussian kernels performs similarly, supporting theorem 2 for n = nb .

MNIST by learning convolutional features We build from a MNIST tutorial architecture from TensorFlow (Abadi et al., 2016) 1 . The KEN employs a linear kernel on the last hidden layer to construct the conditional embedding, replacing the softmax layer and thus reducing the number of parameters. We then train both models with batches of nb = 6000 images for 800 epochs, with learning rate η = 0.01. The convolutional features of the KEN are trained jointly with the regularisation parameter, initialised to λ = 10, under our learning objective (8), while the original CNN is trained under its usual cross entropy loss. Figure 2 shows that KENs learn convolutional features at a much faster rate, achieving a test accuracy of 99.48%, compared to 99.26% from the original CNN. This demonstrates that our learning algorithm can perform end-to-end learning with convolutional features from scratch, by simply placing a conditional embedding on a neural network.

6. Conclusion We propose a hyperparameter learning framework for conditional embeddings when the target is discrete. This naturally results in a nonparametric probabilistic multiclass classifier whose convergence properties can be guaranteed. Because nonparametric models have high capacity, we propose learning-theoretic bounds to regularise the conditional embedding, which also justifies the use of stochastic gradient updates. The KEC is also inherently flexible in architecture, and in particular can perform end-to-end learning on a neural network, which we demonstrate on UCI datasets and MNIST digits, where it outperforms the original convolutional neural network in the latter. 1. https://www.tensorflow.org/get started/mnist/pros

6

Hyperparameter Learning for Kernel Embedding Classifiers

References Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA, 2016. S Aeberhard, D Coomans, and O De Vel. Comparison of classifiers in high dimensional settings. Dept. Math. Statist., James Cook Univ., North Queensland, Australia, Tech. Rep, (92-02), 1992. Kevin Bache and Moshe Lichman. UCI machine learning repository. 2013. Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002. Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936. Ananda L Freire, Guilherme A Barreto, Marcus Veloso, and Antonio T Varela. Short-term memory mechanisms in neural network learning of robot navigation tasks: A case study. In Robotics Symposium (LARS), 2009 6th Latin American, pages 1–6. IEEE, 2009. Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5(Jan):73–99, 2004. Nicholas J Higham. Accuracy and stability of numerical algorithms. SIAM, 2002. Paul Horton and Kenta Nakai. A probabilistic classification system for predicting the cellular localization sites of proteins. In Ismb, volume 4, pages 109–115, 1996. Esra Kaya, Ali Yasar, and Ismail Saritas. Banknote Classification Using Artificial Neural Network Approach. International Journal of Intelligent Systems and Applications in Engineering, 4(1):16–19, 2016. Michael Kearns, Yishay Mansour, Andrew Y Ng, and Dana Ron. An experimental and theoretical comparison of model selection methods. Machine Learning, 27(1):7–50, 1997. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. The International Conference on Learning Representations (ICLR), 2016. Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch¨olkopf. Kernel Mean Embedding of Distributions: A Review and Beyonds. arXiv preprint arXiv:1605.09522, 2016. Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968. ACM, 2009. 7

Hsu Nock Ramos

Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98–111, 2013. Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨olkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010. Yuesheng Xu and Haizhang Zhang. Refinement of reproducing kernels. Journal of Machine Learning Research, 10(Jan):107–140, 2009. Zhi-Hua Zhou, Dan Wei, Gang Li, and Honghua Dai. On the size of training set and the benefit from ensemble. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 298–307. Springer, 2004.

8

Unsupervised multiple kernel learning for ... -

Sparse distance metric learning for embedding compositional data

Learning a Large-Scale Vocal Similarity Embedding for Music

Learning Subspace Conditional Embedding Operators - Intelligent ...

Genetic Programming for Kernel-based Learning with ...

Kernel Methods for Learning Languages - NYU Computer Science

Kernel Methods for Learning Languages - Research at Google

Kernel and graph: Two approaches for nonlinear competitive learning ...

Kernel-Based Models for Reinforcement Learning

A Conscience On-line Learning Approach for Kernel ...

Multi-task, Multi-Kernel Learning for Estimating Individual Wellbeing

A Conscience On-line Learning Approach for Kernel ...

Promoting Diversity in Random Hyperparameter ...

Deep Learning via Semi-Supervised Embedding

A Multiple Operator-valued Kernel Learning Approach ...

Affinity Weighted Embedding

Two-Stage Learning Kernel Algorithms - NYU Computer Science

Handling Ambiguity via Input-Output Kernel Learning

On the Impact of Kernel Approximation on Learning ... - CiteSeerX

Multiple Kernel Learning Captures a Systems-Level Functional ... - PLOS