Workstation Capacity Tuning using Reinforcement ...

Viewer
Transcript

Workstation Capacity Tuning using Reinforcement Learning [Extended Abstract] Aharon Bar-Hillel

Amir Di-Nur

Liat Ein-Dor

Intel Research Israel

Intel Inc.

Intel Research Israel

[email protected] [email protected] [email protected] Ran Gilad-Bachrach Yossi Ittach Intel Research Israel

[email protected] ABSTRACT Computer grids are complex, heterogeneous, and dynamic systems, whose behavior is governed by hundreds of manuallytuned parameters. As the complexity of these systems grows, automating the procedure of parameter tuning becomes indispensable. In this paper, we consider the problem of autotuning server capacity, i.e. the number of jobs a server runs in parallel. We present three different reinforcement learning algorithms, which generate a dynamic policy by changing the number of concurrent running jobs according to the job types and machine state. The algorithms outperform manually-tuned policies for the entire range of checked workloads, with average throughput improvement greater than 20%. On multi-core servers, the average throughput improvement is approximately 40%, which hints at the enormous improvement potential of such a tuning mechanism with the gradual transition to multi-core machines.

1.

INTRODUCTION

The behaviour of High Performance Computers (HPCs)1 is determined by hundreds of hand-tuned parameters and policies. These parameters are adjusted to achieve some of the organization’s business goals. In light of the consistent increase in computer capabilities and the persistent growth of the number of nodes in these environments [7], na¨ıve, manual techniques can no longer provide an appro1 Our main focus in this paper is on batch processing systems such as Condor [12] and PBS (http://www.openpbs.org), which are not grids in the full sense [8]. However, our findings are relevant to other environments as well, and we hence use the general term ’computer grids’ throughtout this paper.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC ’07 Reno, Nevada USA Copyright 2007 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

Intel Research Israel

[email protected]

priate solution to the parameter tuning problem. Limitations of traditional optimization techniques point to the necessity of developing more sophisticated methods for learning optimal rules. Such methods must be efficient enough to cope with the ever-increasing complexity of the environment, and flexible enough for capturing transient modifications (e.g., changes in job distribution, machine properties, etc.). Machine learning algorithms, which have shown impressive achievements in coping with high-complexity problems from various areas, are natural candidates for automating the parameter-tuning procedure. The new methods can enhance the ability of the system to achieve business goals by: 1. Revealing better policies, leading to improved system performance 2. Exposing high-level parameters (such as the tradeoff between job latency and throughput) and hiding lowlevel technical parameters 3. Enabling faster and cheaper adaptation to changes in the environment In this paper, we focus on a specific parameter-tuning problem: automatically adjusting the number of concurrent running jobs on a grid workstation (server), which we term ’Workstation Capacity Tuning’. We argue for the importance of such tuning, and formulate the problem in a reinforcement learning framework. Specifically, we suggest that the problem should be solved in a distributed fashion, by running a tuning agent process on each workstation. This agent monitors the number of running jobs by choosing between three possible actions: decrease the number of concurrently running jobs, keep the number as-is, or increase it. In every time interval server information is collected (e.g., free available memory, load, etc.), and the agent selects one of the three actions based on the machine state. This results in a reward indicating the efficiency of the selected action. The agent’s aim is to perform the optimal action, i.e., the one that maximizes the future expected reward given the current state. In this framework, we have designed and tested three algorithms, representing different reinforcement learning approaches: Hidden Markov Model (HMM) + Linear

Programming (LP)[2], online learning with TD-λ[18], and fitted Q-iteration with Parzen window regression [5]. Large grids have several goals, and their performance can be measured in several dimensions. Two of the most important performance measurements are throughput—the number of successful job executions per time unit, and latency— the average time it takes to complete job execution. Increasing throughput and reducing latency are often conflicting aims[17], since running more jobs in parallel increase throughput, but it also increases the time it takes to finish each job. The goal of our learning agent is to optimize both quantities, where the trade-off between them is determined by the system administrator. The suggested algorithms were tested by applying them to a living grid, which is a small subset of a large operational grid (over 60000 servers) developed in Intel. The grid, iDCP (intel Distributed Computing Platform), is a batch processing system, serving approximately a million jobs per day submitted by thousands of users. We have recorded the resource consumption and durations of jobs running on this grid. In order to enable a controlled experiment, we have separated several machines from the main grid and “replayed” recorded jobs to this mini-grid while different capacity tuning policies are being used. We compare the learned policies to the current policy in iDCP, stating that the number of concurrently running jobs should be equal to the number of cores the workstation has. The majority of current servers have two CPUs, and so running two jobs in parallel is the most common policy. The reference policy, i.e., the policy hand-tuned by the operators of iDCP, lacks the dynamic parameter tuning property which is essential for gaining optimal usage of the machine resources. Such a static policy can only reach suboptimal resource utilization, since it cannot respond to changes in job types or machine loads. As a consequence, two heavy jobs with high memory usage are often run in parallel, causing memory swaps that lead to memory exceptions and severe slow-downs. Similarly, running “multi-core aware” jobs together, each of which fully utilizes the computer resources when run alone, reduces turnaround times in comparison with sequential runs, without any increase in throughput value. The opposite situation occurs for light jobs with low CPU and memory requirements. Many such jobs can be packed together with almost no impact on turnaround and a significant improvement in throughput times. The considerable advantage of dynamic Workstation Capacity Tuning is demonstrated by the good performances of our three algorithms on the experimental grid. All three algorithms outperformed the standard policy of one job per core. Average throughput improvement of 20 percent were achieved for two-CPU machines. Since TOC (Total Cost of Ownership) of a single server is several toushands of dollars, such improvement implies possible annual savings of many millions of dollars for organizations with tens of tounsands servers. As the computer industry shifts to multi-core technologies, the need for dynamic capacity tuning becomes acute. Achieving high utilization from distributed computers becomes a big challenge when machine diversity (single-core, dual-core, etc.) and application diversity (single-threaded, multi-threaded) grow. We have conducted capacity tuning experiments on two multi-core machines. In these experiments, capacity tuning increased grid throughput by approximately 40 percent. We believe the last result implies

that in the near future, a capacity tuning component will become indispensable for large computing environments. The paper is organized as follows. In section 2, we describe the grid environment in which we work and the current handling of capacity tuning. In section 3, we cast the capacity tuning as a reinforcement learning problem. In 4, we describe the learning algorithms that we examined. Section 5 consists of the experimental setup description followed by a summary of the experiment results. Section 6 is dedicated to a review of related studies.Finally, section 7 is devoted to a discussion and possible future research directions.

2.

IDCP DESCRIPTION

iDCP is a distributed computing platform that acts as a middleware/arbitrator between demands for computing resources and the resources available. It manages machines and provides mainly batch execution capabilities, including queuing, scheduling, match-making, and resource arbitration for applications that require these machines. iDCP, which is deployed in 45 data centers world-wide, is a mature Grid implementation, evolved over the last 20 years. It is being used by 6000 active users (often 2000 active users simultaneously), and manages 60,000+ servers from a vast variety of platforms and operating systems. It executes approximately one million jobs per day; the jobs are very heterogeneous in all aspects (execution time, resource requirements, license requirements, parallel jobs, and more).

2.1

iDCP Architecture

iDCP is mostly written in Java (J2SE ) and supports Java, Perl and C++ APIs. It relies on a network file system, though it supports a file replication mechanism and management. It uses an in-house service oriented architecture (SOA) and in-house communication protocols between its components. The main components of the system are the following functional units (see figure 1 for a structural diagram): • Physical Cluster Master (PCM)—handles resource (workstation, license) allocation among the different users. • Virtual Cluster Master (VCM)—distributes workloads between the different PCMs. • Workstation Manager (WSM)—manages the single server and the jobs allocated to it. • Flow Manager (FM)—enables a user to handle job submissions and get a personal view (including GUI) of job status and progress. The Flow Manager can interact with the PCM or the VCM. • History Tracker—enables view of historical data about jobs, machines, pools, etc. • Data Mining Agent—supports business-level decisionmaking. Each of these components is actually a collection of services, living in a common service container. Lots of these services serve in multiple components. Specifically, the PCM, VCM, and FM have a very similar architecture, and most of their services are shared.

Figure 1: iDCP schematic diagram

2.2

Execution Flow and Job Management

The user uses the Command Line Interface (CLI) or API (in some tools) to submit a job to the Flow Manager (FM). The FM queries the Virtual Cluster Managers (VCM) on their state and sends the job to one of them according to pre-defined policies. The VCM is regularly updated about the Physical Cluster Masters’ (PCM) state and it sends the job to one of them, again according to some pre-defined policies. The PCM then tries to find a free matching workstation for the job and sends the job for execution. The Workstation Manager (WSM) executes the job, monitors its state and resource consumption, and updates the PCM. (These updates are sent from the PCM back to the VCM and to the FMs.) Once the job is finished, it is recorded in the History Tracker. The WSM is configured with the number of jobs allowed to execute in parallel. It also supports a rule-based policy mechanism that allows it to suspend, resubmit, kill, or change the job nice value according to pre-defined configuration files. Examples of such rules are “suspend all jobs if the number of interactive users exceeds 3”, or “resubmit all the jobs if the amount of used virtual memory is higher then 2GB”. More generally, the WSM can “decide” to take these actions by evaluating configured boolean expressions referring to the machine state as measured by the WSM probes. In order to allow automatic tuning, we have introduced a new capacity tuning service, which may run on the same machine as the WSM or on a remote machine. If this service is alive, the WSM contacts it periodically (currently this is done every 30 seconds), and reports the machine state. The algorithm used by the capacity tuning service then chooses the next action, and asynchronously sends the command to

the WSM. The capacity tuning service may support multiple WSMs, and its commands override the default rule-based WSM policy.

3.

CAPACITY TUNING AS A LEARNING PROBLEM

Our problem of interest is maintaining the optimal load on servers belonging to a distributed computing environment. The goal is to improve the service level of the system by enhancing the performance of each of its attached servers. In essence, this is a control problem. The mechanism we suggest dynamically controls the server capacity which the WSM informs the PCM about (see Section 2). By increasing or decreasing the announced capacity, the mechanism monitors the load on the servers. Before presenting the explicit formulation of the problem, we introduce two building blocks of the formulation: Markov Decision Processes and Reinforcement Learning.

3.1

Markov Decision Processes

Markov Decision Processes (MDP) model a controlled system as a state machine. At each time point t, the controlled system is in some intrinsic state st . Based on its view of the state, the controller performs an action at , which is translated into a new state st+1 in the controlled system (this translation is stochastic in most cases). In doing so, the controller receives a reward which is a function of the state and the performed action rt+1 = r (st+1 , at ). The reward measures how desirable the action-state pair is. The controller then has to make a decision in time-step t + 1, based on the observed state st+1 , and so on. The MDP is completely char-

acterized by a quadruplet hS, A, P (st+1 |st , at ), P (rt |st , at−1 )i, where S is the set of possible states and A is the set of allowed actions. When designing a controller, the goal is to direct the controlled system such that the collected rewards are maximized. However, in most cases we are interested in the long-term rewards. The controller should be willing to suffer some low rewards in the short-term if it allows it to reach more rewarding states later on in the process. While there are several ways to define the long-term reward, in this paper we use the discounted reward. In this method, the goal of the controller at stage t is to perform the action at which P∞will Tmaximize the expected state value defined as V = T =0 γ rt+T +1 . In this term, 0 ≤ γ < 1 is a decay factor. When γ = 0, the controller is instructed to be greedy, i.e., to optimize the immediate reward. However, as γ increases, the controller becomes less and less myopic. The control task is to design a policy π : S → A, stating the best action in each state. See [18] for a comprehensive introduction to Markov decision processes.

3.2

The Capacity Problem as a Markov Decision Process

In order to use the MDP formulation, we need to cast the capacity tuning problem we are interested in, as a Markov Decision Process. Specifically, we need to define the state space S, the set of possible actions A, and the immediate reward function r(s, a).

3.2.1

The State Space

In our system we represent the state space using current measurements of the server being controlled. Therefore, in most cases, the state space is continuous.2 Our state representation consists of the following parameters: 1. Free real memory—the amount of available physical memory (in logarithmic scale) 2. Used virtual memory—the amount of virtual memory (swap space) in use (in logarithmic scale) 3. Load—the average number of processes waiting for the CPU as reported by the Unix load parameters 4. CPU idle time—the percentage of spare CPU resources during the last 30-second interval 5. CPU system time—the percentage of time the CPU spent processing system and I/O procedures (e.g., system calls) during the last 30-second interval 6. Average number of jobs—the average number of jobs concurrently running on the server during the last 30-second interval 7. The fit—the fit (see Definition 2) of the last time interval (This is the positive part of the reward, excluding the penalty term—see more in section 3.2.3.) Since the measurements have very different scales, we rescale each of them to have zero mean and unit variance, where these statistics are estimated using the training set sample (see section 5.1 for details of training sample acquisition). 2

The LP algorithm uses the Partially Observed Markov Decision Process model, and assumes the underlying state space is discrete and finite. See more details in Section 4.1

3.2.2

The Actions

For each state, there are three possible actions the controller can take. Generally speaking, the actions are: increase by one the allowed number of jobs, decrease by one the allowed number of jobs and keep the allowed number of jobs unchanged. Notice that the actions set the number of running jobs, but not their identity. For example, assume that the allowed number of jobs is 2, and 2 jobs are indeed running. When one of them finishes, the PCM may send another job to replace it without any intervention of the capacity tuning controller. A ’keep’ operation, therefore, does not keep the currently running jobs, as they may be replaced due to termination, but merely their number. The action semantics are rather straightforward, nevertheless special attention must be paid to transition cases. For example, when the controller is asked to increase the number of running jobs, it may take a while until a new job is launched. This delay might cause the controller to ask for an additional job, and therefore some caution is needed in defining the actions. For this purpose we use two quantities: the average number of running jobs and the allowed number of jobs. In a sense, the allowed number of jobs is the number desired by the controller, whereas the average number of running jobs measures the actual number. Using these two quantities, we define the actions as follows: 1. INCREASE— the allowed number of jobs is set to the smallest integer that is greater than the average number of running jobs. 2. DECREASE— the allowed number of jobs is set the largest integer that is smaller than the average number of running jobs. If the number of currently running jobs exceeds this number, the job which performed the least amount of calculations so far is terminated and resubmitted to be re-executed. 3. KEEP—if the controller performs the KEEP action, then there are two cases. If the current allowed number of jobs is greater than the average number of running jobs, then the new number is set to the smallest integer that is greater than or equal to the average number of running jobs. However, if the current allowed number of jobs is smaller than the average number of running jobs, than the new number is set to the largest integer that is smaller than or equal to the average number of running jobs. The somewhat cumbersome definition of the KEEP action is necessary to compensate for transition phases. For example, assume that there are 3 running jobs and the controller performs the INCREASE action. During the next time interval, a new job is launched to the server, and thus the next value of the average number of running jobs will be between 3 to 4. In that case, if the controller performs either KEEP or INCREASE, the allowed number of jobs will remain 4. However, consider the case in which 4 jobs are running in the first place, and the controller performs the DECREASE action. Again in the next stage, the average number of running jobs will be between 3 and 4. However, in this setting, if the controller performs KEEP or DECREASE, the allowed number of jobs will remain 3. To prevent extreme cases, which may lead to workstation crashes, we created a safety net using the following rules:

Table 1: Reward and fit characteristics. Assume jobs j1 , j2 , j3 can run a full time unit, but j1 utilizes 80% of the CPU while the other two utilize 50% of the CPU each. Right: The instantaneous rewards for running j1 or j2 for two α values. For larger α the reward advantage of j1 is more pronounced. Left: The fit scores for two cases: running only j1 , and running j2 ,j3 together, for two α values. Running j1 alone is favorable if α = 2, but running j2 and j3 is favorable if α = 1. r (j, t) α=1 α=2

j1 0.8 0.64 A

j2 0.5 0.25

f(t) α=1 α=2

j1 0.8 0.64 B

j2 , j 3 1 0.5

1. The allowed number of jobs is smaller than 3 times the number of cores. 2. The allowed number of jobs is at least 1. 3. If the load exceeds 10 times the number of cores, terminate a job. 4. If the amount of free virtual memory drops below 1GB, terminate a job.

3.2.3

The Reward

While the state space and the actions represent the controlled system and its steering mechanisms, the long-term reward is the optimization goal of the controller. Therefore, the reward should direct the controller to reach desired states while avoiding undesired states and actions. In our case, we would like to direct the controller to balance throughput and job slowdown. We introduce a single tradeoff parameter α ≥ 1 and define the following quantities: Definition 1. The instantaneous reward for job j at time interval t is α Utime (j, t) , r (j, t) = Wtime (j, t) Wtime (j, t) where Wtime (j, t) is the percentage of the time job j was allocated to the server during time interval t, and Utime (j, t) is the percentage of CPU resources allocated to job j during time interval t. The reward is the product of two terms: the ’resource exploitation rate’ Utime (j, t) /Wtime (j, t) < 1, measuring the percentage of CPU time exploited by job j, and the running duration Wtime (j, t) which is almost always 1 (unless a job has started or finished in the middle of the interval). Changing α, we can decrease or increase the emphasis on the resource exploitation rate. Specifically, when α = 1, the instantaneous reward is simply Utime (j, t)—the percentage of CPU resources consumed by the job. However, for large values of α, only jobs achieving high resource exploitation rate get significant rewards. Notice that using high α reduces the reward for all jobs, but the effect is more dramatic for jobs with low Utime (j, t) /Wtime (j, t). See table 1A for an example of this behavior. The next definition evaluates the optimality of a server state.

Definition 2. The fit at time interval t is α X X Utime (j, t) f (t) = r (j, t) = Wtime (j, t) Wtime (j, t) j j where the sum is over all the jobs assigned to the server during time interval t. The fit is just the sum of the instantaneous rewards, and the role α has in it is demonstrated in Table 1B. When α = 1 the fit is the sum of Utime (j, t), amounting to the total percentage of CPU utilization. In this case, the emphasis is on throughput and the fit is higher when the CPUs are more crowded. As α increases, the fit prefers jobs getting high portion of CPU time. This works in favor of reduced slowdown, since it prefers allocating jobs as much CPU time as they can use, and hence they complete earlier. Finally we define the reward itself. The reward is composed of two terms: a positive term and a penalty. The positive term is simply the fit as defined in Definition 2. The penalty term penalizes for jobs that were prematurely terminated due to a “Decrease” command issued by the controller. In this penalty term all the instantaneous rewards gained in the past by the relevant jobs are reduced. Definition 3. The reward at time t is X X r (t) = f (t) − r j, t0 0 j∈Term(t) t ≤t where Term (t) is the set of all jobs which prematurely ended during time interval t. The penalty term is relevant only for the “Decrease” operation, which ends the job with the smallest penalty. If this job was running for a long time the penalty may be orders of magnitude higher then the momentary fit, and so such resubmissions are highly undesirable.

3.3

Reinforcement Learning

In Section 3.2, the capacity problem is presented as a Markov Decision Process. However, so far we did not discuss the dynamics of the system—how do the actions performed by the controller affect the states and the rewards? The controlled system is highly stochastic. Furthermore, the nature of the system depends on the workloads it serves. For example, when the controller asks to increase the capacity, i.e., accept an additional job, the exact nature of the job that will arrive has cardinal influence on the new state and on the reward. A possible way to summarize the rewardrelated aspect of the stochastic dynamics is the Q-function. This function, Q : S × A → R, maps each (s, a) pair to the long-term reward expected when action a is performed in state s. Once this function is well estimated, the optimal policy can be easily found by choosing π(s) = maxa Q(s, a). It is natural to use machine learning algorithms to “learn” the policy from observations, either by estimating the Qfunction or by using other techniques. The area in machine learning which focuses on learning such policies is Reinforcement Learning (RL). (See [18] for a comprehensive introduction) There are plenty of RL algorithms. In this work we experimented with three different such algorithms, which differ in some key areas: 1. Online vs. Batch—the TD(λ) algorithm (see 4.2) is an online algorithm, which means that the underlying

policy is constantly evolving. In contrast, the fitted Q iteration and the HMM-LP algorithms (see 4.3 and 4.1 respectively) are batch algorithms in the sense that once trained, they use the same policy until trained again. 2. MDP vs. POMDP(Partially Observed Markov Decision Process)—the TD(λ)and the fitted Q iteration algorithms work within the MDP framework, in which the current state is assumed to be explicitly known to the controller. Conversely, HMM-LP works in the POMDP setting, in which the true state is assumed to be hidden. The goals of this study are to verify the suitability of RL algorithms to generate policies for the capacity tuning problem, and to discover the most suitable RL algorithm family.

learning in parallel. Every time a new workstation state is observed, a control decision is made, and the resultant reward r is used to update the Q-function. Let 0 < γ < 1 and 0 < λ < 1 be the discount rate (defined in 3.1) and a decay factor respectively, and denote by Na the number of actions and by Ns the size of the state vector. In our implementation we assume that the Q-function is linear, i.e. the Q-function is represented by a matrix ΘNs ×Na such that Q (s, a) = Θa s, where Θa is the ath row of Θ, and s is a state vector. The aim of TD-λ is to continuously improve this estimate, by updating Θ at every time step t, according to δ = [rt + γQ (st , at )] − Q (st−1 , at−1 ), which is indicative to the gap between the current estimate and the actual value of Q. The update is performed by the following formula Θt = Θt−1 + ηδEt , where η > 0 is the step size, and the eligibility matrix Et is an Ns × Na matrix which is iteratively updated as follows:

4.

Reinforcement learning is an active research field that includes several approaches, whose relative merits are still being investigated. Hence we applied three different RL algorithms to the capacity tuning problem, each representing a different approach.

4.1

HMM-LP : a POMDP Approach

The HMM-LP algorithm works in the POMDP framework. It assumes that the underlying state space is finite, and that the states we see are noisy observations emitted by the pure underlying states. The hidden state space is assumed to be discrete and finite. Therefore, learning is done in two phases. In the first stage, the underlying states are discovered, and the dynamics of the system are estimated. In the second stage, an optimal policy is generated for the model learned in the first stage (see [2] for a similar approach). The first stage, i.e., learning the state space and dynamics, is a problem of learning a Hidden Markov Model (HMM) [22]. While finding the optimal model is NP-hard, good models can be found using the Baum-Welch algorithm. Once the underlying model has been revealed, the optimal policy is found by solving a Linear Program (LP) [6]. The learning process is done offline. In real time, the controller holds a belief-state, which is a probability vector over the underlying (hidden) states. Given the measurements of the workstation, it revises its belief, and chooses the action that maximizes the expected long-term reward. This term is computed by averaging, according to the belief-state, the expected reward of a state-action pair (computed when the LP problem was solved). In our experiments, we have used a state space of 10 − 50 states. For each experiment, the optimal value was computed by cross validation to avoid over-fit. As a result, the amount of memory used in real-time is very small, typically less than 10KB. Furthermore, the real time calculation is very efficient as well; the main calculation done is multiplying an n × n matrix with an n size vector, where n is the number of underlying states (i.e., 10 ≤ n ≤ 50).

4.2

a Eta = γλEt−1 + st−1 δa,at−1 , ∀a,

THE PROPOSED ALGORITHMS

TD-λ: an Online Approach Unlike HMM-LP and fitted Q iteration, TD-λ is an online algorithm[18], which performs job monitoring and policy

where δx,y is the Kronecker delta. Note that E0 is initialized as a zero matrix. Given a state st , the value of Q (st , a) is calculated for the three actions, and Boltzman exploration with inverse temperature β is applied for selecting the next action[18]. We use this exploration to avoid convergence to suboptimal solutions. The user-defined parameters of this algorithm are γ, λ, β and η whose values are selected to optimize the results on the validation set.

4.3

Fitted Q Iteration: Reduction to Regression Approach

Fitted Q iteration [4, 5] is a batch mode reinforcement learning algorithm which yields an approximation of the infinite horizon Q-function by iteratively extending the optimization horizon. The reinforcement learning problem is reduced to a series of supervised regression problems, successively regressing the optimal Q-function for a K-turns future. The Q-function is estimated in the original continuous state space, and any supervised regression algorithm can be used for this task. Using the notation from section the algorithm learns from a sample of 4-tuples i 3.1, n i st , ait , rt+1 , sit+1 i=1 , obtained by running an exploration algorithm on the experiment pool (see 5.1 for more details). Starting from Q0 (s, a) = 0, we learn Qk+1 (s, a) by fitting a function mapping i (sit , ait ) → rt+1 + γmaxa Qk (sit+1 , a)

In [5] fitted Q iteration was successfully used with regression trees for Q-function estimation, and achieved considerable performance improvement over online Q-learning. In our experiments, we have used a simple Parzen windowbased regressor. Given a set of points with known function value {(si , f (si ))}n i=1 , this regressor interpolates a function value to a new point s by f (s) =

Pn

i=1

wi f (si ) , wi = exp − ks − si k2 /σ 2

The learned policy is sensitive to the value of the std parameter σ, and we choose this value by learning several policies and testing them on the validation set.

5.

EXPERIMENTS

We experimented with the suggested capacity tuning mechanisms by running simulated workloads on a small iDCP pool. We describe the experiment setup in section 5.1, and the results in section 5.2.

5.1

Experimental Setup

C version, the Java virtual machine environment includes dynamic processes such as the garbage collector, leading to more complex memory access patterns. This simulator explicitly accesses four random addresses in each allocated megabyte of memory (in a 10-second period), but the total amount of accessed memory induced by Java mechanisms is much higher.

Exploration and Validation. Machines and Pools. For the experiments reported in sections 5.2.1,5.2.2 we created a small iDCP pool with a PCM and three dualCPU servers with 4GB physical memory. The multi-core experiments reported in section 5.2.3 were run on another pool, with a PCM and two dual-CPU dual-core machines, with 16GB memory. All the experiments were carried out by running simulations of previously recorded workload on the experimental pools.

Recorded Load Ensembles. The algorithms were trained, tuned, and tested in a threephase procedure which included exploration, validation, and testing. The three phases required three distinct job ensembles, and each of the three had to represent the entire job population. In order to produce these ensembles, we collected working parameters from all the jobs running in a large data center during a 24-hour period. The running jobs were sampled every 3 minutes and records of their CPU, memory, and system time requirements were produced. A total of 70000 jobs were recorded and divided into 70 workload units of 1000 jobs each. The jobs were ordered according to their arrival times, and since consecutive jobs are often similar, there are large differences between the 70 workload units. We applied K-means clustering to these units, and divided them into 4 clusters based on the average and variance of their resource consumption behavior. An ensemble was created by selecting one workload unit from each cluster, thus ensuring representation of all unit types.We created three different ensembles for the exploration,validation and test stages.

Workload Simulators. We used two synthetic job simulators to produce a fastforward simulation of the recorded jobs. The simulators simulate a job by running a process which replicates its resource consumption. Each recorded job sample, representing three minutes of actual execution, is simulated for 10 seconds. The job simulator used in our main experiments is written in C. While simulating memory allocation is relatively easy, we do not have information regarding the exact memory access pattern of jobs and hence such simulation is more involved. Following [11], we use a Markov chain to produce the memory access-pattern. In order to simulate locality of reference in memory access we use normal transition probabilities, i.e., the next address to be accessed is chosen from a Gaussian distribution centered around the last accessed address. This model ensures that the working set size (defined and characterized in [3]) will be a concave function of the running time interval, as predicted and empirically measured in [3, 1]. We use a second job simulator, written in Java, in order to simulate more demanding memory access patterns. While resource consumption is simulated in a similar fashion to the

Exploration was carried out by applying an exploration policy to the four workload units of the exploration ensemble. This policy selected the actions with probabilities 0.8, 0.1, and 0.1 for KEEP, INCREASE, and DECREASE respectively. During exploration, the safety rules stated at section 3.2.2 were enforced, as well as one additional rule—when the used virtual memory exceeded 2GB, the INCREASE action was disabled. The data obtained from the exploration phase was processed by the algorithms to produce a collection of policies, each corresponding to a different set of values of the free parameters. To converge into a single optimal policy, the different policies were applied to the validation ensemble, and the set of parameters that yielded the best results was selected. For each algorithm, only the selected policy was applied to the test ensenble. Note that no job appeared in more than one ensemble, and no ensemble was used by more than one phase.

5.2 5.2.1

Results C-based Workloads

Tables 2 through 5 summarize the results for the three algorithms in comparison to iDCP policy results (denoted by “Reference”). The test ensemble consists of four workload units with 1000 jobs each, simulated using the C-based job simulator. We used four different criteria to evaluate the algorithms’ performance, each representing a different aspect of the system requirements. The first criterion is duration, defined as the time required by the algorithm for successfully running all the jobs in the unit. This criterion is inversely related to throughput. The second criterion is slowdown, defined as the average ratio between the time it took the algorithm to run a job, and the time it takes to run that job alone on a dedicated machine. Another criterion is the average reward gained by the algorithm. The last criterion, number of resubmissions, is the number of times a job was prematurely terminated (due to a DECREASE action) and had to be re-executed. Table 2 shows that all algorithms achieve improvements of 15-20% in throughput with respect to the standard iDCP policy. This improvement is gained by running more than one job per core, thus utilizing the machine’s resources more efficiently. The algorithms learn to identify states with poor utilization of machine resources, and to respond by accepting jobs. Consequently, the average slowdown increases, as can be seen in Table 3. Table 4 shows the number of times each of the algorithms had to reduce the capacity of a server by terminating the execution of a job prematurely. The goal of the learning algorithms was to find the best tradeoff between slowdown and throughput as reflected by the reward function (see Definition 3.2.3). In Table 5 the average reward is measured. It is clearly seen that in almost all cases, the learning algorithms were able to find a better

Table 2: Total duration. The time it took each of the algorithms to complete each of the tested workloads (less is better).

Workload Workload Workload Workload Total

1 2 3 4

HMM-LP 2h:27m:00s 22h:00m:03s 6h:48m:55s 13h:45m:38s 45h:01m:36s

Iterative Q-fitting 2h:32m:29s 22h:29m:01s 6h:53m:33s 13h:26m:32s 45h:21m:35s

TD-λ 2h:25m:20s 21h:58m:38s 6h:41m:22s 13h:22m:32s 44h:27m:52s

Reference 3h:02m:50s 27h:06m:21s 8h:07m:07s 15h:31m:07s 53h:47m:25s

Table 3: Slowdown. The average ratio between the running time of a job using the denoted policy, and the running time of the same job using a dedicated server (less is better). Workload Workload Workload Workload

1 2 3 4

HMM-LP 1.727 2.071 1.677 2.362

Iterative Q-fitting 1.388 1.748 1.409 1.698

TD-λ 1.631 2.417 1.720 2.373

Reference 1.112 1.167 1.113 1.188

Table 4: Number of resubmit events. The number of times a job was resubmitted to be re-executed (less is better). HMM-LP Iterative Q-fitting TD-λ Reference Workload 1 0 58 0 0 Workload 2 23 70 9 0 Workload 3 21 51 17 0 Workload 4 39 97 31 0

Table 5: Average reward. The average reward of each of the algorithms on each of the workloads (more is better). HMM-LP Iterative Q-fitting TD-λ Reference Workload 1 0.598 0.634 0.607 0.578 Workload 2 0.640 0.668 0.624 0.613 Workload 3 0.635 0.664 0.626 0.613 Workload 4 0.627 0.6712 0.624 0.642

Table 6: Java workload durations. The time it took each of the algorithms to complete the different workloads HMM+LP Iterative Q-fitting TD-λ Reference Workload 5 2h:19m:45s 2h:20m:13s 2h:05m:34s 2h:21m:22s Workload 6 9h:07m:59s 8h:27m:53s 8h:32m:51s 9h:09m:09s Workload 7 38h:27m:28s 35h:46m:5s 40h:54m:35s 152h:07m:34s Workload 8 18h:43m:16s 17h:43m:8s 19h:53m:8s 28h:16m:46s Total 68h:38m:28s 64h:17m:19s 71h:26m:8s 191h:54m:51s

Table 7: Multi-core durations. multi-core machines Workload Workload Workload Workload Total

1 2 3 4

The time it took each of the algorithms to complete the C-based workloads on

HMM+LP 0h:57m:53s 6h:06m:54s 2h:23m:06s 3h:50m:48s 13h:18m:41s

Iterative Q-fitting 1h:13m:03s 6h:09m:03s 2h:40m:03s 3h:51m:07s 13h:53m:16s

TD-λ 0h:57m:29s 6h:06m:04s 2h:21m:38s 3h:49m:38s 13h:14m:49s

Reference 1h:11m:10s 9h:49m:33s 3h:15m:57s 5h:15m:55s 19h:32m:35s

5.2.2

Java Based Workloads

In a second experiment we used a Java-based simulator to generate the workloads. While the statistics of the jobs in terms of memory and CPU usage are similar to the ones in the C-based simulator, there is a significant difference in the ways C and Java address memory (mainly due to the garbage collection mechanism in Java). This difference makes a significant impact on the way jobs interact when running on the same machine. To enhance this effect, we doubled the memory consumption of some of the jobs. Again, we compared the learned policies to the iDCP standard policy using four workloads. The results are presented in Table 6. Outstanding improvements in throughput were obtained on Workloads 7 and 8. (in some cases a four-fold speedup). The substantial superiority of our algorithms in comparison to iDCP policy can be explained by swap effects. These workloads are enriched with jobs with high memory consumption. When two such jobs run together, as dictated by the iDCP policy, swap events occur causing severe slowdown of running times. The best policy in such cases is to run a single job per server, regardless of the number of its cores. The three algorithms learn to adopt this policy when necessary, thus avoiding the swap traps which the static iDCP policy is compelled to enter. In addition, fitted Q iteration and TD-λ achieve nice improvements of 8-12% in throughput also for the lighter workloads 5 and 6. This improvement is a result of running more than one job per core when jobs have relatively low CPU and memory usage. The improvement in all workload units reflects successful adjustment of the learned policies to different types of job distributions. The highest improvement in duration times is gained by Iterative Q-fitting which outperforms the other algorithms in three out of the four workload units. The policy learned by TD-λ is rather liberal, namely it accepts jobs more easily than the other policies, which is beneficial in light workload units but may be problematic in heavier units. This explains the deterioration in TD-λ performance in the heavier job units in comparison to the light ones. Conversely, HMM-LP is rather conservative, so its performance on the heavy units is much better than on the light ones. Iterative Q-fitting has a more balanced policy, which allows it to perform well in all load scales.

5.2.3

Results for the Multi-core Experiment

The upgraded resources of multi-core machines in comparison to their single-core counterparts extend the state space of multi-core grids, and allow a larger manipulation region for the learning algorithms. As a result, a fertile ground is created for our algorithms to demonstrate their adaptive skills, which are expected to yield even larger per-

Figure 2: The role of α. The graph shows how the slowdown (on the X-axis) and the duration (on the Yaxis) change as α varies. Results were obtained using the Iterative Q-fitting algorithm. 4

5.5

x 10

α=4.0 α=3.0

5

4.5

duration

tradeoff than the hand-tuned policy (the reference). Iterative Q-fitting uses the largest number of resubmissions, while TD-λ and HMM-LP are less lethal (Table 4). No jobs are resubmitted in the iDCP policy. Comparison between the different algorithms reveals that each of them puts an emphasis on a different aspect of grid utilization. Iterative Q-fitting achieves higher reward scores than TD-λ and HMM-LP. Its advantage is attained by running fewer concurrent jobs, and frequent resubmit operations are required to achieve this goal. The two other algorithms focus on minimizing duration times, and are more careful in performing resubmissions.

4

α=2.5

3.5

α=2.0

3

α=1.5 α=1.3

2.5

2

1

1.1

1.2

1.3

1.4

1.5

α=1.0 1.6

1.7

1.8

slowdown formance improvements with respect to results obtained for the singl-core grid. Applying TD-λ, HMM-LP, and Iterative Q-fitting to servers with two dual-core CPUs and 16GB of RAM, all three algorithms attained impressive improvements in comparison to iDCP performance. As expected, the performance enhancements are even more significant than those obtained for the single-core machines. The algorithms achieved an average of ∼ 40% improvement in throughput as shown in Table 7. Again, minor superiority in shortening duration times was observed for HMM-LP and TD-λ in comparison with Iterative Q-fitting, while the latter has a slight advantage in slowdown.

5.2.4

The role of α As defined in section 3.2.3, the WSM capacity tuning problem involves optimization of two conflicting quantities: throughput (i.e., workload completion duration) and average slowdown. In our formulation, the trade-off between these two goals is determined using the reward parameter α. Theoretically, one can consider the optimal tradeoff curve between workload duration and average slowdown. This monotonic curve is a Pareto-optimality graph, where every (duration,slowdown) point above the curve is achievable by some policy, and every point below the curve is not achievable. The following lemma states a basic property of the achievable region: Lemma 1. The achievable region in the optimal durationslowdown curve is convex. Proof Sketch: Let (d1 , l1 ) and (d2 , l2 ) be two achievable duration-slowdown points with policies p1 and p2 respectively. A convex combination of the points ( βd1 + (1 − β)d2 , βl1 + (1 − β)l2 ) is achievable by time-sharing between the policies p1 and p2 with proportions (β , 1−β). Similarly, a convex combination of the two points is achievable for a large pool with N servers by running βN servers with policy p1 and (1 − β)N servers with policy p2 . In Figure 2, we plot the empirical duration-slowdown curve obtained for the Iterative Q-fitting algorithm. This plot was

produced by training the Iterative Q-fitting algorithm with different values of α, and measuring the slowdown and duration of the trained policies over a single C-based workload. It is clearly seen that increasing α reduces the slowdown, but increases the total duration of processing the workload. α thus provides an elegant way for the grid administrator to control the grid performance according to preferred business goals. While Figure 2 shows the trade-off between throughput and slowdown, there is no real symmetry between the two, since increased throughput is a natural business goal while reduced slowdown is not. Instead, job turnaround time, i.e., the time between job submission and job completion, is the natural goal in some applications. The turnaround time includes both the time a job waits for execution and the actual execution time, and only the latter is proportional to job slowdown. In contrast, the waiting time is inversely related to throughput. Turnaround time therefore depends on both throughput and slowdown, with the exact tradeoff between them depending on the system load. If the system is loaded and waiting time is dominant, increased throughput in the key for reduced turnaround. For systems with low load, job slowdown is more important.

6.

RELATED WORK

Performance optimization in grid environments presents the challenge of a large-scale resource allocation problem, to be solved in a dynamic system with limitations on the information accuracy. Several types of learning and adaptation techniques were suggested in this context. A useful categorization of such techniques is according to location of the learning service in the grid architecture—is it a user-side service, a resource side service, or a central control service? In grids with a hierarchical structure resource allocation is determined in central nodes, and learning services operating in these nodes can optimize it. In [13] a reinforcement learning mechanism is suggested, which learns how to balance task load between available machines in the context of a specific molecular dynamics application. Learning central resource allocation is studied in a wider context in [19, 20, 21]. In these papers a two-stage hierarchy is suggested. In the lower hierarchy machines are grouped and managed by “application manager” services. Resources are distributed between the application managers by a central “resource arbiter”. Reinforcement learning services in the application manager and the resource arbiter allow optimization of the resource allocation by the resource arbiter. In a distributed grid, which lacks central control, resource allocation can be improved by introducing machine learning tools to the user-side or the workstation (resource) side. In [9, 16] a user-side agent is suggested, learning which machines should be asked to execute the the user’s task. Specifically, it is shown in [16] that the joint activity of such agents, which do not transfer information between themselves, can lead to effective and adaptive load balancing. A different kind of user-side agent, proposed by [23], tries to predict the memory requirements of the user’s jobs. In this way excessive requirements, which lead to poor utilization of the grid, are diminished. Learning on the workstation side has been widely considered in the distributed computer systems community[10,

14, 15]. In these papers, each machine runs an adaptive decision-making service, which decides whether submitted tasks should be accepted to the machine queue, or sent to other machines. Load balancing is achieved through the joint operation of these agents, with limited inter-agent communication. Most of the literature mentioned above (with the exception of [23]) employs machine learning techniques in order to solve the load balancing problem. The capacity tuning problem we consider in this paper is very different from traditional load balancing. In input-output terms, capacity tuning accepts information regarding the local machine state: CPU and memory utilization, load, system time etc., and makes decisions regarding the number of jobs the machine should run concurrently on the machine. In the load balancing problem the input is information regarding the length of the waiting jobs queue in other resources, and the decisions are concerning the number of jobs sent to the machine’s queue. The two problems improve system efficiency at different stages in the process of machine dispatch. The load balancing problem is the problem of efficiently transferring jobs from users to machine queues. Capacity tuning controls the next stage, of moving from the machine queue to actual execution. Viewed more abstractly, there is an important structural difference between load balancing and capacity tuning. In load balancing, there are multiple resources, and a constant number of jobs has to be split between them. This is an inherently global problem in the sense that information regarding multiple workstations is required to make an informed decision. In capacity tuning, there is a single resource and the numer of jobs to process is the variable to be tuned. This problem can be naturally encountered using local machine information alone.

7.

DISCUSSION

Distributed computers operate in a dynamic environment, where agility is essential for an efficient functioning of the system. This is usually accomplished by tuning numerous parameters, which require the operators of these systems to keep track of changes in the environment and to tune these parameters accordingly. In this paper, we present an alternative approach. We argue that machine learning based techniques can be used to monitor the environment and tune the system, allowing operators of distributed systems to focus on service-level tradeoffs and release them from the burden of tweaking low-level parameters. Specifically in this paper we present the capacity tuning mechanism that manages the number of simultaneously running jobs on a server. We suggest three different machine learning algorithms to control this parameter and show the superior performance of these auto-tuned servers over handtuned servers. It is interesting to note that the performance gap increases when multi-core machines are being used. This implies that the need for such a mechanism will increase as more and more cores are packed into a single CPU. Comparing each of the three aforementioned algorithms to the standard hand-tuned policy reveals that all algorithms are capable of considerably improving performance over the reference. The performance differences between the algorithms suggest that batch-based learning approaches, such as the

ones used in the iterative Q-fitting algorithm, achieve better results. However, online algorithms such as the TD(λ) are easier to apply from the engineering aspect. Therefore, there is no clear “winner”. We currently work on a practical, large scale implementation of the capacity tuning system for real working pools of iDCP, based on the iterative Q-fitting algorithm. Our suggested solution to the capacity tuning problem is essentially distributed, hence scalable. The system is composed of a controller process, running on each server and making the capacity tuning decisions, and a learner process which is activated periodically to update the policy for all servers. The communication load of the system is small, since the controllers are autonomous, and only the learner makes use of the network (in order to acquire data and dispatch policies). In this paper we assume that the capacity controller is ignorant w.r.t. any information regarding the jobs waiting in the queue, and its only input is the local machine state. This poses capacity tuning as a local problem, with minimal communication between workstations and other components of the system. While this formulation is relatively clean and simple, it is clearly sub-optimal from a wider perspective. Given reliable information regarding the expected resource consumption of the next job in the queue, the capacity tuning agent can make better choices on its actions. This is a promising research direction, but it requires a solution to the difficult problem of estimating the resource consumption of a job. Human estimates of expected resource consumption are notoriously unreliable[23], and the research regarding automatic prediction is in its first steps.

8.

REFERENCES

[1] S. Albers, L. Favrholdt, and O. Giel. On paging with locality of reference. In Annual ACM symposium on theory of computing, volume 34, pages 258–267, 2002. [2] L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In National Conference on Artificial Intelligence, pages 183–188, 1992. [3] P. J. Denning. The working set model for program behavior. Communications of the ACM, 11(5):323 – 333, 1968. [4] D. Ernst, P. Geurts, and L. Wehenkel. Iteratively extending time horizon reinforcement learning. In European Conference on Machine Learning (ECML), pages 96–107, 2003. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research (JMLR), 6:503–556, 2005. [6] E. A. Feinberg and A. Shwartz, editors. Handbook of Markov Decision Processes methods and applications. Kluwer, 2002. [7] D. G. Fietelson. The supercomputer industry in light of the top500 data. Comput. in Science and Engineering, 7(1), 2005. [8] I. Foster. What is the grid? a three point checklist. In GRIDToday, 2002. [9] A. Galstyan, K. Czajkowski, and K. Lerman. Resource allocation in the grid using reinforcement learning. In International Joint Conference on Autonomous Agents and Multiagent Systems, volume 3, pages 1314 – 1315, 2004.

[10] A. Glockner and J. Pasquale. Coadaptive behaviour in a simple distributed job scheduling system. IEEE transactions on Systems, Man and Cybernetics, 23(3):902–907, 1993. [11] A. R. Karlin, S. J. Phillips, and P. Raghavan. Markov paging. Siam J. Compute., 30(3):906–922, 2000. [12] M. Litzkow, M. Livny, and M. Mutka. Condor - a hunter of idle workstations. In Int. Conference of Distributed Computing Systems, June 1988. [13] S. M. Majercik and M. L. Littman. Reinforcement learning for selfish load balancing in a distributed memory environment. In International Conference on Information Sciences, pages 262–265, 1997. [14] R. Mirchandaney, D. Towsley, and J. A. Stankovic. Analysis of the effects of delays on load sharing. IEEE Trans. Comput., 38(11):1513–1525, 1989. [15] S. Pulidas, D. Towsley, and J. Stankovic. Imbedding gradient estimators in load balancing algorithms. In Int. Conference of Distributed Computing Systems, pages 482–489, 1988. [16] A. Schaerf, Y. Shoham, and M. Tennenholtz. Adaptive load balancing: A study in multi-agent learning. pages 475–500, 1995. [17] A. Snavely and J. Kepner. 99% utilization—is 99% utilization of a supercomputer a good thing? In ACM/IEEE conference on Supercomputing, page 37, New York, NY, USA, 2006. ACM Press. [18] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [19] G. Tesauro. Online resource allocation using decompositional reinforcement learning. In AAAI, pages 886–891, 2005. [20] G. Tesauro, N. K. Jong, R. Das, and M. N. Bennani. Improvement of systems management policies using hybrid reinforcement learning. In European Conference on Machine Learning (ECML), pages 783–791, 2006. [21] W. E. Walsh, G. Tesauro, J. O. Kephart, and R. Das. Utility functions in autonomic systems. In Int. Conference of Autonomic Computing, pages 70–77, 2004. [22] L. R. Welch. Hidden markov models and the baum-welch algorithm. IEEE Information Theory Society Newsletter, 53(4), December 2003. [23] E. Yom-Tov and Y. Aridor. Improving resource matching through estimation of actual job requirements. In IEEE Int. Symposium on High Performance Distributed Computing (HPDC), pages 367– 368, 2006.

Reinforcement Learning for Capacity Tuning of Multi ...