Proceedings of Machine Learning Research 1:1–7, 2017

ICML 2017 AutoML Workshop

Dealing with Integer-valued Variables in Bayesian Optimization with Gaussian Processes Eduardo C. Garrido-Merch´ an Daniel Hern´ andez-Lobato

[email protected] [email protected] Universidad Aut´ onoma de Madrid, Francisco Tom´ as y Valiente 11, 28049, Madrid, Spain

Abstract Bayesian optimization (BO) methods are useful for optimizing functions that are expensive to evaluate, lack an analytical expression and whose evaluations can be contaminated by noise. These methods rely on a probabilistic model of the objective function, typically a Gaussian process (GP), upon which an acquisition function is built. This function guides the optimization process and measures the expected utility of performing an evaluation of the objective at a new point. GPs assume continous input variables. When this is not the case, such as when some of the input variables take integer values, one has to introduce extra approximations. A common approach is to round the suggested variable value to the closest integer before doing the evaluation of the objective. We show that this can lead to problems in the optimization process and describe a more principled approach to account for input variables that are integer-valued. We illustrate in both synthetic and a real experiments the utility of our approach, which significantly improves the results of standard BO methods on problems involving integer-valued variables. Keywords: Parameter tuning, Bayesian optimization, Gaussian processes, Integer-valued variables.

1. Background on Bayesian Optimization Bayesian optimization (BO) methods (Shahriari et al., 2016) address the problem of optimizing a real-valued function f (x) over some bounded domain X . The objective function is assumed to lack an analytical expression (which prevents any gradient computation), to be very expensive to evaluate, and the evaluations are assumed to be noisy (i.e., rather than observing f (x) we observe y = f (x) + , with  some additive noise). At each iteration t = 1, 2, 3, . . . of the optimization process, BO methods fit a probabilistic model, typically a Gaussian process (GP) (Rasmussen and Williams, 2006), to the observations of the objective function {yi }t−1 i=1 collected so far. The uncertainty about the potential values of the objective function provided by the GP is then used to generate an acquisition function α(·), whose value at each input location indicates the expected utility of evaluating f (·) there. The next point xt at which to evaluate f (·) is the one that maximizes α(·). After collecting this observation, the process is repeated. When enough data has been collected, the GP predictive mean value for f (·) can be optimized to find the solution of the problem. The key for BO success is that evaluating the acquisition function α(·) is very cheap compared to the evaluation of f (·). This is so because the acquisition function only depends on the GP predictive distribution for f (·) at a candidate point x. Thus, α(·) can be maximized with very little cost. BO methods hence spend a small amount of time thinking c 2017 E.C. Garrido-Merch´

an & D. Hern´ andez-Lobato.

Dealing with Integer-valued Variables in Bayesian Optimization

very carefully where to evaluate next the objective function with the aim of finding its optimum with the smallest number of evaluations. This is a very useful strategy when the objective function is very expensive to evaluate and it can save a lot of computational time. Let the observed data until step t − 1 of the algorithm be Di = {(xi , yi )}t−1 i=1 . The GP predictive distribution for f (·) is given by a Gaussian distribution characterized by a mean µ(x) and a variance σ 2 (x). These values are: µ(x) = kT∗ (K + σ02 I)−1 y ,

σ 2 (x) = k(x, x) − kT∗ (K + σ02 I)−1 k∗ ,

(1)

where y = (y1 , . . . , yt−1 )T is a vector with the objective values observed so far; k∗ is a vector with the prior covariances between f (x) and each yi ; σ02 is the variance of the additive Gaussian noise; K is a matrix with the prior covariances among each f (xi ), for i = 1, . . . , t − 1; and k(x, x) is the prior variance at the candidate location x. These quantities are obtained from a covariance function k(·, ·) which is pre-specified and receives as an input xi and xj at which the covariance between f (xi ) and f (xj ) is evaluated. A typical covariance function employed in BO is the Mat´ern function (Snoek et al., 2012). A popular acquisition function is expected improvement (EI) (Jones et al., 1998). EI is given by the expected value of the utility function u(y) = max(0, ν − y) under the GP predictive distribution for y, where ν = min({yi }t−1 i=1 ) is the best value observed so far (assuming minimization). Thus, EI measures on average how much we will improve on the current best solution by performing an evaluation at each candidate point. The EI acquisition function is α(x) = σ(x)(γ(x)Φ(γ(x) + φ(γ(x)), where γ(x) = (ν − µ(x))/σ(x) and Φ(·) and φ(·) are respectively the c.d.f. and p.d.f. of a standard Gaussian.

2. Dealing with Integer-valued Variables The framework described assumes continous input variables in f (·). This is so because in a GP the variables introduced in the covariance function are assumed to be continous. A problem may arise when some of the inputs can only take values in a closed subset of a discrete set, such as the integers. If this is the case, the GP process will ignore that constraint and will place some probability mass on invalid potential values for f (·). These incorrect modeling assumptions about the objective f (·) may have a negative impact on the optimization process. Furthermore, the optimization of α(·) will typically provide candidate points at which to evaluate the objective f (·) that are invalid in the sense that integer-valued input variables will be assigned real values. In practice, some mechanism to transform real values into integer values must be implemented before the evaluation can take place. If this is not done with care, some problems may appear in the optimization process. Optimization problems involving continuous and discrete variables appear in the task of optimizing the hyper-parameters of machine learning systems (Snoek et al., 2012). For example, in a deep neural network we may be interested in adjusting the learning rate, the number of layers and the number of neurons per layer, which can only take discrete values. Similarly, in a ensemble of decision trees generated by the gradient boosting algorithm (Friedman, 2001) we may try to adjust the learning rate and the maximum depth of the trees. This last hyper-parameter can only take discrete values. A last example involves a nearest neighbor classifier. In this case we may be interested in finding the optimal number

2

Dealing with Integer-valued Variables in Bayesian Optimization

of neighbors and the optimal scaling factor per dimension to be used in the computation of the distance. The number of neighbors can only take discrete values. Round variable before evaluation. 4 evaluations.

Round variable before evaluation. 5 evaluations.





Naive











● ●

0

1

2

3

4

0

1

Round variable inside wrapper. 4 evaluations.

2

● ●



Basic





1

2

3

4





0

1

Integer Transformation. 3 evaluations.

2

3

4

Integer Transformation. 5 evaluations.



Proposed

4





0

3

Round variable inside wrapper. 5 evaluations.

● ●

● ●



● ●

0

1

2

3

4

0

1

2

3

4

Figure 1: Different methods for dealing with integer-valued variables. At the top of each image we show a GP fit to the data (posterior mean and 1-std confidence interval, in purple) that models a 1-dimensional objective taking values in the set {0, 1, 2, 3, 4} (dashed line). To display the objective we have rounded the real values at which to do the evaluation to the closest integer. Below the GP fit it is shown the acquisition function whose maximum is the recommendation for the next new evaluation. Each column show similar figures before and after evaluating a new point, respectively. The proposed approach leads to no uncertainty about the objective after two evaluations. Best seen in color.

A naive approach to consider that the objective can only be evaluated at integer values in some of the inputs is to (i) optimize α(·) assuming all variables take values in the real line, and (ii) replace all the values for the integer-valued variables by the closest integer. This is the approach followed by the popular software for BO Spearmint (https://github. com/HIPS/Spearmint). However, as shown in the first row of Figure 1 this can lead to a mismatch between the points in which the acquisition takes high values, and where the actual evaluation is performed. Furthermore, it can produce situations in which the BO method always evaluates the objective at a point where it has already been evaluated (because the next and following evaluations will not change at all the acquisition function). For this reason, we discourage the use of this approach. The previous problem can be solved by doing the rounding to the closest integer value inside the wrapper that evaluates the objective. This basic approach is shown in the second row of Figure 1. In this case the points at which the acquisition takes high values and the points at which the objective is evaluated coincide. Thus, the BO method will tend to evaluate at different locations, as expected. The problem is, however, that the actual objective is constant in the intervals that are rounded to the same integer value. This constant behavior is ignored by the GP, which can be sub-optimal.

3

Dealing with Integer-valued Variables in Bayesian Optimization

3. Proposed Approach We propose a method to alleviate the problems of the basic approach of Section 2. For this, we consider that the objective should be constant in the intervals that are rounded to the same integer. This property can be easily introduced in the GP by modifying k(·, ·). Covariance functions are often stationary and only depend on the distance between the input points. If the distance between two points its zero, the values of the function at both points will be the same (the correlation is equal to one). Based on this fact, we suggest to transform the input points to k(·, ·), obtaining an alternative covariance function k 0 (·, ·): k 0 (xi , xj ) = k(T (xi ), T (xj )) ,

(2)

where T (x) is a transformation in which all integer-valued variables of f (·) in x are rounded to the closest integer. The beneficial properties of k 0 (·, ·) when used for BO are illustrated in the third row of Figure 1. We can see that the GP model correctly identifies that the objective function is constant inside intervals of real values that are rounded to the same integer. The uncertainty is also the same in those intervals, and this is reflected in the acquisition function. Furthermore, after performing a single measurement in each interval, the uncertainty about f (·) goes to zero. This better modeling of the objective is expected to be reflected in a better performance of the optimization process. Figure 2 illustrates the modelling properties of the proposed covariance function (2). It shows the mean and standard deviation of the posterior distribution given some observations. It compares results with a standard GP that does not use the proposed transformation. In this case the data has been sampled from a GP using the covariance function in (2) with k(·, ·) the squared exponential covariance function (Rasmussen and Williams, 2006). One dimension takes continuous vales and the other values in {0, 1, 2, 3, 4}. Note that the posterior distribution captures the constant behavior of the function in any interval of values that are rounded to the same integer, only for the integer dimension (top). A standard GP (corresponding to the basic approach in Section 2) cannot capture this (bottom).

4. Experiments We compare the performance of the proposed approach for BO with the basic approach described in Section 2. Each method has been implemented in the software for BO Spearmint. We use a Mat´ern covariance function, and estimate the GP hyper-parameters using slice sampling (Murray and Adams, 2010). The acquisition function employed is EI. A first batch of experiments considers two synthetic objectives. The first objective depends on 2 variables. In this case the first variable takes values in the interval [0, 1], and the second variable takes values in the set {0, 1, 2}. The second objective depends on 4 variables. In this case the first 2 variables take values in the interval [0, 1]. The remaining 2 variables take values in the set {0, 1, 2, 3} and {0, 1, 2}, respectively. In each case, we sample the objectives from a GP prior using (2) as the covariance function. We run each BO method (proposed and basic) for 50 and 100 iterations, in the case of each objective, and report the logarithm of the distance to the minimum value of each objective as a function of the evaluations done. We consider 100 repetitions of the experiments. We also consider these experiments when the objectives are contaminated with additive Gaussian 4

Dealing with Integer-valued Variables in Bayesian Optimization

Posterior Standard Deviation

Posterior Mean

●●



1

● ●

●●

1

● ● ●

● ●

0

●● ●







0

● ● ●

or Me

● ●







● ●● ●

−1

0.8



Real Variable

● ●

osteri GP P

●●

5



an

0.6

0.4

4 −2

Re al v aria ble

Proposed Approach

1.0 ●



3

0 1

−1 0.2

2

Int 2 eg er

Va 3 ria ble

1 4

0.0

0

5

Integer Variable

Posterior Standard Deviation

Posterior Mean

1.0



2

● ● ● ● ●











● ●

or Me

● ●

● ● ●

● ●

an

−1





●●● ● ● ●

0 5

● ●

0.6

0.4

4

−2 3

Re al v aria

0

ble



1

●● ●

1

0

0.8

● ●

Real Variable

2

osteri GP P

Standard GP

● ● ● ●

1

−1 0.2

2

2 Re al V aria 3 ble

1 −2

4 5

0.0

0

Real Variable

Figure 2: (top) Posterior mean and standard deviation of a GP model over a 2-dimensional space in which the first dimension can only take 5 different integer values and when the covariance function in (2) is used. Note that the second dimension can take any real value. (bottom) Same results for a GP model using a covariance function without the proposed transformation. Best seen in color.

noise with variance 0.01 and 0.001, respectively. The results obtained are displayed in Figure 3. We observe that the proposed approach gives better results than the basic approach. In particular, it finds points that are closer to the optimal one with a smaller number of evaluations of the objective, both in the case of noisy and noiseless evaluations. A last batch of experiments considers finding the learning rate and maximum tree depth of a gradient boosting ensemble (Friedman, 2001) of 100 trees that leads to the best performance on the digits dataset from the scikit-learn python package (Pedregosa et al., 2011). We use 70% of the data for training and 30% for validation and compute the performance in terms of the test log-likelihood. We consider optimizing the logarithm of the learning rate and five different values for the maximum depth of the trees. Namely {1, 2, 3, 4, 5}. Again, we report the logarithm of the distance to best observed performance on the validation set. We run each BO method (proposed and basic) for 100 iterations. We consider 100 repetitions of the experiments. The results obtained are displayed in Figure 4. Again, the proposed approach gives better results than the basic approach. More precisely, it is able to find ensembles with better prediction properties on the validation set with a smaller number of evaluations.

5

Dealing with Integer-valued Variables in Bayesian Optimization

2−dimensions. Noiseless Observations −0.70

2−dimensions. Noisy Observations



−1.9

● ● ● ●

● ●

−3.14



−3.2



Methods

● ●

−5.58



Basic Approach Proposed Approach

● ●

● ● ●

−8.02

● ● ● ● ●







● ● ●



−10.46



● ● ●

● ●



−12.90 0

● ●

● ● ● ●

● ●







● ● ●



● ●

● ● ●

● ● ● ●

● ● ● ● ● ●

● ●

50

0

10

Methods ● ●●

−5.1



●● ● ● ●

●●

● ●



●●







●●



●●● ●

● ●

● ●●

●●

● ●●●

● ● ●●●● ● ● ●● ●

●●

● ●● ●



● ●



● ●●



−12.3 0



●● ●●●●

● ●●

●●●● ●●●●●●●●

●●● ●● ● ●●●● ●●●●●● ● ●●● ● ● ● ● ●●●

● ●



● ● ●





● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 Number of Function Evaluations

50





●●

Methods

●● ● ● ●●



● ●●



●●



● ● ● ●●

● ●●



● ●● ●●

●●

●● ● ●

● ●●

●● ●



● ● ●

● ●● ●●

●●



●●● ● ● ●●

−8.60

100

0

● ●●●●

●● ●●●●

●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●

25 50 75 Number of Function Evaluations





−6.92

● ● ● ● ●● ●● ● ● ●

Basic Approach Proposed Approach

● ●

−5.24

●● ● ● ●

● ● ● ●

● ●●

●●

−3.56

● ● ● ●



−9.9

Basic Approach Proposed Approach





−7.5

● ●



−1.88



Log Difference

Log Difference

● ●● ●

● ● ●



4−dimensions. Noisy Observations ● ●



● ●● ●

● ●

● ● ● ● ● ● ●

−0.20 ●

−2.7





−8.4

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

10 20 30 40 Number of Function Evaluations

● ●● ●



● ●

4−dimensions. Noiseless Observations −0.3





● ● ●

Basic Approach Proposed Approach

● ●

−7.1

● ●

● ●

Methods ● ●

−5.8





● ●

−4.5







Log Difference

Log Difference

● ● ●

● ● ●●●●● ● ● ●●●●● ●● ● ●●

●● ●●● ● ●●●● ●●● ●● ●● ●●●●● ●●●

●●

●●

●●

●●

●● ● ●● ● ●● ● ●●●

●●

● ● ●●● ●● ●●● ●●● ●● ● ●● ● ● ●● ●●●● ●●●●●●● ●● ●●●●●● ●●● ● ●

25 50 75 Number of Function Evaluations

100

Figure 3: Average results on the synthetic experiments with 2 and 4 dimensions. Gradient Boosting Experiments on the Digits Dataset −1.5

● ● ● ●

Log Difference

−1.9



Methods

● ● ● ●

−2.3



Basic Approach Proposed Approach

● ●● ●

−2.7

●● ● ●● ●● ●●●

●●

● ●● ●● ●● ●

● ●

−3.1

●●

●●





●●● ●

●●

●●●

● ●●●

●● ●● ●

●●

●● ● ● ●● ●

● ●

●● ● ● ●● ●●

●● ●

● ●● ●● ●● ● ●● ●

−3.5 0

●●● ● ●●

● ● ● ●● ● ● ●● ●

● ● ● ●● ●● ● ●● ●

● ● ● ● ● ● ● ●● ● ●●● ●●● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●●● ●●● ● ● ●● ●● ● ●● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ●●

25 50 75 Number of Function Evaluations

100

Figure 4: Average results on the digits dataset using gradient boosting.

5. Conclusions We have described a new approach to deal with integer-valued variables in BO methods using GPs as the underlying model. This approach consists in (i) rounding real values to integer values and (ii) modifying the covariance function of the GP to account for the fact that the objective should be constant in the interval of all the real values that are rounded to the same integer value. The proposed approach has been evaluated in both synthetic and real problems and compared with a basic approach to handle integer-valued variables. These experiments show that the better modeling properties of the proposed approach lead to better results. More precisely, a BO method using such an approach finds solutions that are closer to the optimal ones in a smaller number of iterations.

6

Dealing with Integer-valued Variables in Bayesian Optimization

Acknowledgments The authors gratefully acknowledge the use of the facilities of Centro de Computaci´on Cient´ıfica (CCC) at Universidad Aut´onoma de Madrid. The authors also acknowledge financial support from Spanish Plan Nacional I+D+i, Grants TIN2013-42351-P, TIN201676406-P, TIN2015-70308-REDT and TEC2016-81900-REDT (MINECO/FEDER EU), and from Comunidad de Madrid, Grant S2013/ICE-2845.

References J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001. D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, 1998. I. Murray and R. P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. In Advances in Neural Information Processing Systems 23, pages 1732–1740. 2010. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2006. ISBN 026218253X. B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104:148–175, 2016. J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–2959. 2012.

7

Dealing with Integer-valued Variables in Bayesian ...

predictive distribution for f(·) is given by a Gaussian distribution characterized by a mean. µ(x) and a ..... statistics, pages 1189–1232, 2001. D. R. Jones, M.

927KB Sizes 0 Downloads 234 Views

Recommend Documents

Identification in models with discrete variables
Jan 9, 2012 - Motivation - Exogeneity assumption relaxed. • To see the strength of the assumption that cannot be tested. • Sensitivity analysis θ θ θ.

Dealing with indeterminacy in spatial descriptions
Their data tend to show that ..... deal with indeterminate spatial descriptions in situations other than ..... there might not be enough room to insert the name of the.

Dealing with Conflicts in Project Management - KMCO | Kariuki ...
Sep 23, 2011 - the legal and institutional infrastructure for management of conflicts in Kenya are discussed. ... skills, tools, and techniques to a wide range of activities in order to .... It is best to have different persons mediate and arbitrate.

Dealing with spatial normalization errors in fMRI group ...
tion effect's sign in each voxel of a search volume, and discuss a Gibbs sampler to compute it. ..... The solid line corresponds to the Bayes factor accounting for.

In Sheep's Clothing: Understanding and Dealing with ...
how to protect themselves from bullies they'll meet in life. I'm teaching her fighting skills. Things that sound just plausible enough to turn aside the wrath of any ...

Procedure for dealing with DAR.PDF
copy forwarded to the General Secretaries of affiliated uirions of NFIR. C/: Media Centre/NFIR. Dated: I 2/I 2/201 7. ^ "--1. 'r,. et .--.1. (Dr M. Raghavaiah). General. Page 1 of 1. Procedure fo ... with DAR.PDF. Procedure for ... with DAR.PDF. Open

CLE07 - Dealing with Entities - Appendices.pdf
adopt and become subject to” the TBOC. Tex. Bus. Orgs. Code §§ 402.003, .004. As of January 1, 2010, all. entities, regardless of when they were formed, are.

ecornell.com-Dealing with Difference.pdf
... leadership-and-systems-design/master-certificate-in-systems-design-and-project-leadership/). 1/1. Page 1 of 1. ecornell.com-Dealing with Difference.pdf.

Presentations- Dealing with Questions and ... - UsingEnglish.com
Part One: Predicting and answering questions. Listen to your partners' presentations in small groups, asking at least two or three questions afterwards and then ...

Bayesian Basics: A conceptual introduction with application in R and ...
exploring Bayesian data analysis for themselves, assuming they have the requisite .... of the world, expressed as a mathematical model (such as the linear ..... such as N and K, can then be used subsequently, as we did to specify dimensions. From the

Dealing with precise and imprecise decisions with a ...
As it is difficult to compare the re- sults of such an algorithm to classical accuracy rates, .... teriori strategy on a classical probability distribution is most of the time ...

Bayesian Basics: A conceptual introduction with application in R and ...
CENTER FOR STATISTICAL CONSULTATION AND RESEARCH. UNIVERSITY OF MICHIGAN. BAYESIAN ... exploring Bayesian data analysis for themselves, assuming they have the requisite context to begin with. ..... and for the next blocks, we declare the type and dim