A Flexible Approach to Efficient Resource Sharing in ...

Viewer
Transcript

A Flexible Approach to Efficient Resource Sharing in Virtualized Environments Hui Wang, Peter Varman Rice University, USA [email protected], [email protected] Abstract Cloud-based storage and computing are growing in popularity due to economies of using a centralized infrastructure that can be leased at low cost. The increased use of virtual-machine-based server consolidation in such data centers introduces new challenges for resource management, capacity provisioning, and guaranteeing application performance. The bursty nature of workloads means that the peak capacity required to handle short-duration bursts may be an order of magnitude or more than the long-term average requirement. Providing capacity based on the peak rate can result in significant over-provisioning and low utilization, leading to higher infrastructure and energy costs. In this paper we present an efficient method for multiplexing multiple concurrent bursty workloads on a shared storage server. Such a situation is common in Virtual Machine (VM) environments where the hypervisor needs to dynamically allocate IO bandwidth among multiple VMs sharing a SAN array. We consider the problem of dynamically scheduling the IO among competing VMs to meet QoS requirements. Our solution employs two strategies together: systematically decomposing bursts to provide each workload with a graduated Quality of Service (QoS) and flexibly scheduling the decomposed portions of all the workloads. Categories and Subject Descriptors: D.4.2 [Operating Systems] [Storage Management] : Secondary storage General Terms: Algorithms, Design, Management, Performance Keywords: VM, QoS, Storage performance virtualization , Performance isolation, Resource allocation

1.

Introduction

Server and storage consolidation using virtualization technologies are becoming increasingly common in data centers, due to the economies of shared infrastructure and the benefits of centralized management. This, in turn, is driving the development of

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CF’11, May 3–5, 2011, Ischia, Italy. Copyright 2011 ACM 978-1-4503-0698-0/11/05 ...$10.00.

elastic QoS models that allow clients flexibility in choosing SLAs (Service Level Agreements) tailored to their workload characteristics and performance requirements, while allowing the service provider to optimize provisioning and scheduling decisions. This paper deals with the issue of providing flexible QoS performance guarantees to VM clients sharing a storage service, with significantly smaller capacity provisioning requirements than standard approaches. Typical performance SLAs provide clients with minimum throughput guarantees [8, 16] (IOPS) or response time bounds [11, 19] for rate-controlled clients. The data center needs to provision enough resources for each client application to satisfy its SLAs, and schedule client requests appropriately in order to meet response time and bandwidth requirements. Several challenges need to be overcome to satisfactorily share the server. First, storage workloads tend to be very bursty [17, 18, 7]: i.e., the instantaneous arrival rate during some time intervals is significantly higher than the long-term request arrival rate. A large amount of capacity needs to be provisioned in order to meet the SLA requirements when a burst occurs; however, at other times the capacity is significantly underutilized resulting in low server efficiencies, as has been confirmed by measurements in actual data centers [3]. Furthermore, run-time capacity management is only partially successful since the workload is unpredictable and can change drastically over small time intervals; for storage devices which incur appreciable latency in transitioning between states and where idle power consumption is significant, the operational and energy costs of over provisioning can be very high. Secondly, mechanisms to handle excess traffic from VM clients that exceed their SLA stipulation need to be handled carefully by the hypervisor. Unlike the case of communication protocols, dropping requests when the system is oversubscribed is not a viable option for storage systems, since storage IO protocols do not generally support automatic retransmission mechanisms. In addition, strict throttling policies often result in underutilized resources and are unsuitable for open workloads. In this paper we describe a framework for efficiently and flexibly sharing a storage server among multiple concurrent clients. Our approach combines two orthogonal techniques to significantly reduce the capacity requirements of the server: (i) use of workload decomposition to provide graduated QoS guarantees to clients, and (ii) scheduling algorithms that flexibly multiplex server capacity between the fragments of the decomposed workloads to meet the SLAs. We begin by identifying three properties (P1 through P3) that such QoS schedulers should satisfy. • P1: Inter-client Isolation (Do not harm others) – This is the fundamental requirement of performance isolation which requires that VM clients be insulated from the behavior of

• P2: Intra-client Isolation (Do not harm oneself) – This is simply the performance isolation requirement applied to a single client. It asserts that well-behaved and ill-behaved portions of a workload should also be isolated from each other, so that the effects of bad behavior are temporally localized. In other words, if the client exceeds its SLA during some time interval, it may be penalized during this period but once its behavior becomes compliant it should start receiving its SLA guarantees again. • P3: Capacity Requirements – The service provider should provide a rich set of SLA specifications with different cost and performance QoS guarantees to satisfy a diverse set of client needs. Scheduling algorithms should minimize the capacity required to meet the set of SLAs and support accurate estimation of capacity requirements. The solution in this paper is compared against two well-known schedulers, Proportional Share (PS) and pClock as summarized in Table 1. PS allocates clients weighted bandwidth at a fine granularity so that the difference between the allocations of any two backlogged clients is always bounded by 1. In contrast, pClock and our scheduler are designed to handle short-term bursts of requests by allowing temporary unfairness in the bandwidth allocations; these two schedulers therefore provide latency control for workloads independent of their throughput requirements, unlike PS where response time and throughput are coupled. The issues related to workload isolation, bad-region isolation, and capacity requirements are discussed in Section 2.

2.

Overview

Client 1

SLA1 (ı1, ȡ1, į1)

Client 2

SLA2 (ı2, ȡ2, į2)

Scheduler

Client n

Server

SLAn (ın, ȡn, įn)

Figure 1. Scheduling Framework for Sharing Server Figure 1 shows a generic scheduling framework for sharing a server among multiple VM clients. There are n VMs V Mi , i =

ȡ

Cumulative Arrival

other concurrently executing VM clients sharing the resource. Specifically the system should encapsulate a VM client so that any bad behavior on its part is not allowed to adversely affect other well-behaved VM clients. If a client exceeds its stipulated SLA and send more requests in a time interval than allowed by its contractual agreement, it should not be allowed to garner additional service at the expense of other well-behaved VM clients.

Upper Bound ȡ

Violation

ı Arrival Curve

ı a

0

b

Time

Figure 2. Upper Bound on Arrivals imposed by the Token Bucket

1, 2, . . . n whose requests are routed to the hypervisor [12], which must schedule them to meet performance QoS goals. Each VM has a SLA that specifies the performance it will receive provided its input traffic satisfies stipulated restrictions on burst and arrival rates. One of the most popular constraints on traffic arrivals is provided by a token bucket specification, which asserts that for any time interval of length T the number of requests sent by the client should be no more than σ + ρT , where (σ, ρ) are the token bucket parameters [20]. A usual implementation is to assume a reservoir (the bucket) initially filled with σ tokens that is fed fresh tokens at a constant rate of ρ; however the maximum number of tokens in the reservoir is capped at σ. Whenever a new request arrives it removes a token from the bucket if it is not empty. As long as every request finds a token when it arrives, the traffic meets its constraint and is considered well behaved. A request that arrives when there is no token in the bucket is said to be a bad request, and the client is said to be ill behaved. Figure 2 shows an example of an upper bound (heavy line) on the arrival traffic (dashed line) induced by a (σ, ρ) token bucket. The client is well behaved in the interval [0, a) since all arrivals lie below the upper bound restriction, but ill-behaved between times a and b, where the arrivals exceed the upper bound constraint. In traditional QoS models [11, 19], the SLA guarantees the requests of client i a response time limit of δi provided its arrival traffic is within the stipulated (σi , ρi ) token bucket arrival constraint. Inter-Client Isolation: A simple approach to isolation is to police the traffic of each VM and then simply drop the requests that exceed the upper bound. These discarded bad request will need to be submitted again later. Such an approach may be suitable in some environments like in computer networks where the protocols automatically provide mechanisms for retransmission and recovery from dropped packet transmissions. However, storage protocols are not designed to handle lost requests; dropping requests from oversubscribed client VMs in this situation will result in a cascading series of undesirable events, possibly culminating in the eventual failure of the application or guest OS. The pClock algorithm [11] provides a second approach to this problem based on delaying the requests of the ill-behaved client. A

Algorithms Proportional bandwidth allocation Fairness Latency Control Workload Isolation (P1) Bad-Region Isolation (P2) Capacity Saving (P3)

Proportional Share Yes Yes (Fine grained) No Yes No No

pClock Yes Yes (Coarse grained) Yes Yes No No

This Paper Yes Yes (Coarse grained) Yes Yes Yes Yes

Table 1. Comparison of Scheduling Algorithms bad request arriving at time t is treated as if it had actually arrived at a later time t0 > t, such that at t0 the arrival would meet the upper bound. Figure 3 shows the arrival traffic of a hypothetical workload. In Figure 3(a) there are three bad requests arriving at time p. These will be treated as if they had actually arrived at the later times p1, p2, and p3 respectively as shown in Figure 3(b). Instead of scheduling them to finish by their true deadlines p + δ, they will instead be treated as if their deadlines were delayed to p1+δ, p2+δ and p3+δ respectively. The delay makes the requests appear to satisfy the arrival specification and thereby protects other clients from being affected. However the assigned deadlines for these bad requests are later than their true deadlines, and hence are not guaranteed to meet the response time SLA. Intra-Client Isolation: A drawback of delaying bad requests as described above is the cascading effect this can have on subsequent requests. Continuing with the previous example, suppose that the subsequent requests of the client after time p are at times r, s, t and so on, at a rate ρ = 1 as stipulated by the SLA (see Figure 3(c)). Because the three bad requests at p were delayed, the requests following them will be also need to be delayed to remain within their SLA constraint; hence, these will be treated as arriving at times r0 , s0 and t0 with correspondingly delayed deadlines r0 + δ, s0 + δ and t0 + δ respectively. Potentially all future requests of this client could miss their deadlines because of the small overburst that occurred in the past. Hence, while the technique of delaying bad requests can insulate other clients from the ill-behaved one, it violates our second principle P2: it does not isolate the good portions of a client’s workload from the effects of its own badly behaved parts. If we remove the three bad requests arriving at time p from the input stream, the requests arriving at r, s, t will be good and will be finished by true deadlines r +δ, s+δ and t+δ as shown in Figure 3(d). In our proposed approach, the removed requests will not be discarded, but will instead be rescheduled with a later deadline as part of a secondary request stream. The method described here addresses P2 by decomposing the workload into good and overflowing parts and independently scheduling them with differing QoS requirements. We refer to this as providing graduated QoS guarantees. Capacity Requirements: The server capacity required to meet QoS guarantees depends upon the capacity requirements of the individual VM clients and the scheduling policy. A typical performance SLA guarantees a client’s requests a maximum response time of δ provided it is "well-behaved", i.e. it conforms to the upper bound implied by a specified token bucket. Each client i estimates its capacity µi based on its maximum burst size σi , response time δi and average rate ρi . Completing the largest burst by the deadline implies a lower bound on capacity of σi /δi , while the long-term average arrival rate ρi implies another lower bound. Hence µi ≥ max{ρi , σi /δi }.

As mentioned earlier, storage workloads are very bursty, and guaranteeing a small response time to all requests requires very high capacity. However, by relaxing the tight response time bound for a small fraction of the workload there is a very sharp reduction in the capacity needed [14, 15]. In our approach we exploit this characteristic to offer a graduated QoS, where the client can specify a distribution of response times for different fractions of the workload. A typical SLA in an n-level graduated QoS model is as follows: the portion of the workload conforming to a (σ1 , ρ1 ) token bucket will be guaranteed a response time δ1 , the portion satisfying a more relaxed constraint (σ2 , ρ2 ) will be guaranteed a response time δ2 , δ2 > δ1 , and so on for up to n levels. By profiling the client’s workload, the token bucket parameters for a desired fraction of requests in each response time category can be determined [14]. As an example, a SLA based on graduated QoS may require 90% of a conforming workload to have a response time limit of 10ms, 95% to have a response time limit of 50ms, 99.5% to have a limit of 150ms, and the remaining requests to be done within 500ms. As our experiments indicate, the capacity required to meet the graduated QoS bounds is significantly smaller than requiring 100% of the workload to meet the strict response time. The run-time system decomposes the workload in accordance with the n-level QoS model. In addition to using decomposition to limit individual capacity requirements, capacity can be reduced by using scheduling to exploit the heterogeneity in the QoS requirements of concurrent clients. We discuss two scheduling policies Fair Queuing (FQ) and Earliest Deadline First (EDF) below. An FQ scheduler divides the available capacity among the n clients in a fine-grained manner P in proportion to their weights. The system capacity is CF Q = i µi . Client i is assigned a weight wi = µi /CF Q so that it receives at least µi capacity at a finegrained intervals during operation. In contrast to FQ, the EDF scheduling policy minimizes the capacity requirements needed to meet a set of deadlines, by exploiting the differences in the response times of different clients. The pClock scheduler [11] uses EDF scheduling and always selects the request with the smallest (earliest) deadline. A simple example to illustrate the potential benefit follows. Consider two clients that each send a burst of 50 requests every 100ms. The first client requires a response time of 50ms for its requests and the second requires 100ms. The capacity needed for the first client is 50 requests/50 ms = 1000 IOPS, while that for the second client is 50 requests/100 ms = 500 IOPS. A fair scheduler would use a server of 1500 IOPS and multiplex the two workloads evenly in a 2 : 1 ratio; in the first 50ms it would complete 50 requests of client 1 and 25 requests of client 2, while in the next 50ms it would do the remaining 25 requests of client 2. Both clients meet the deadlines for all their requests.

An EDF scheduler would change the order of service so that it does all 50 of client 1’s requests first (since they have a smaller deadline), followed by the 50 requests of client 2. This requires a capacity of only 1000 IOPS to finish all requests by their deadlines. Due to this potential for reduced capacity we will use an EDF-based scheduler in this paper. However, simple direct use of EDF will not work, in the sense that isolation properties P1 and P2 can be violated if not done correctly. Intuitively this is because clients are much more closely coupled under EDF scheduling so the requirements for flexibility clash with the need for strict regulations. For instance, suppose client 1 misbehaved and sent 100 requests instead of 50. Since all 100 requests will have a shorter deadline than client 2’s requests, they will all be served first in an EDF schedule, completing after 100ms. All the requests of client 2 will have missed their deadline. In contrast, a fair scheduler will not delay any of client 2’s request past their deadline, and delay only the requests of the offending client, client 1. The capacity estimation for an EDF scheduler is set by the following equations (see [11] for details); the formula assumes that the clients have been ordered so that δi ≤ δi+1 . The right-hand side of ith equation is the maximum number of requests that need to complete in an interval of length δi , based on the token bucket parameters (σj , ρj ) for the j th client. The total capacity CEDF should at least be enough to finish these requests to avoid missing a deadline.

Client has requirements (V=10, U=1, G=100ms)

arrivals

U=1

G

V=10

a

c+G

b c p Time

(a) Three bad requests arrive at p

G

arrivals

U=1

V=10

a

b c p

p1

p2

p3 + G

p3 Time

∀i : CEDF

(b) Bad requests delayed to conform to SLA U=1 Bad requests at p cause several following requests to also miss deadlines s r

3.

G

t

Scheduler Framework

b c p

r’

s’

t’

t’+G

Time

(c) Later requests delayed by bad requests at p t s

arrivals

r

G

V=10 b c p

VM 1

VM 2

Request Classifier

Request Classifier

Q1

Q1

Q1 Q2 Q3

Q1 Q2 Q3

. . . . . .

VM n Request Classifier

. . . . . .

Q1 Q1 Q2 Q3

Request Scheduler

Storage Server

U=1

a

Scheduler in Hypervisor

arrivals

The overall architecture of our system is shown in Figure 4.

V=10 a

P CEDF P ≥ i ρiP × δi ≥ j≤i σj + j≤i ρj (δi − δj )

r

s

t

t+G

Time

(d) Later requests meet deadline if requests at p are removed

Figure 3. Effect of Bad Requests on Later Requests

Figure 4. Scheduler Framework The storage scheduler in the hypervisor is responsible for multiplexing the server among the different VMs so that response times of individual requests in the VMs can be guaranteed. Request classifier decomposes the workload of VM i into several queues Qi,j each of which provides a different response time guarantee δi,j . The queues are exposed to the scheduler, which uses an Earliest Deadline First (EDF) policy to select a request to dispatch to the server, among the requests that satisfy eligibility and violation-free criteria as discussed later in this Section.

Queues of VMi Expiration Check į0 Requests Arrival

Classifier (ı0, ȡ0)

No Yes

No

Classifier (ı1, ȡ1)

Expiration Check į1 Yes

Classifier (ın, ȡn)

Figure 5. Queues maintained for each VM Each VM handles n queues, Qi,1 , Qi,2 through Qi,n . Figure 4 describe the case of three-level queues, and the queues organization are shown in Figures 5. A request is marked as good or bad based on the the token bucket, and is inserted in the first-level queue Qi,0 . Although a request may be marked as bad, it may still be able to finish by its real deadline if there is sufficient unused capacity in the system at this time (for instance, if other VM clients are currently underutilized or the request stream has high locality). It avoids premature classification of a request based on static worst-case estimates of traffic patterns and server capacity, but instead defers classification to the last possible instant. If it later finds that the request cannot finish in time, it will mean that the violation is a real violation; in this case the performance of subsequent requests will be affected unless the request is moved out of this stream. The request is then transferred to the next-level queue where it will try to meet a more relaxed deadline. Symbols Qi,j σi,j , ρi,j sri,j r fi,j M inSi,j M axSi,j Φ

Meaning Queue at level j of VM i Token Bucket parameters of Qi,j Start tag of request r in Qi,j Finish tag of request r in Qi,j Minimum Start tag of pending requests in Qi,j Maximum Start tag of pending requests in Qi,j + 1/ρi,j Set of eligible requests for system scheduler

Table 2. Symbols Request Arrival: When the rth request from VM i arrives at time t it will be placed in the first-level queue, Qi,0 , and timestamped with three tags: a start tag sri , a finish tag fir and its realtime deadline dri , which are set as follows: dri = t + δi,0 is the latest time by which the request should complete to satisfy the response time bound δi,0 for queue Qi,0 ; fir = sri + δi,0 ; finally, sri depends on whether the request is good or bad. First the classifier invokes TokenUpdate(), which updates the number of tokens at the current time; it then invokes RequestTagging() which tags the request as mentioned above. If there is at least one token in the

1. Request Arrival(request r, VM i, time t): begin TokenUpdate(Qi,0 , t); /* Tag and insert request in queue Qi,0 of VM i*/; Insert r into Qi,0 ; RequestTagging(r, Qi,0 , t); end 2. Scheduler: begin If (no requests in the system) return; Let t be current time; /* Flush requests that have missed their deadline */ If request in Qi,j has missed its deadline Insert request into Qi,j+1 and remove from Qi,j ; /* Find set of eligible clients Φ */ Let Φ = {(i, j): Qi,j has request r with sri,j ≤ t}; If(Φ is empty) Sync(t); Select r in Φ with the smallest finish tag ; Dispatch request r to server; end Algorithm 1: Algorithm Structure

TokenUpdate(Qi,j , t): begin /*The biggest burst allowed is bounded by σi,j */ Let ∆ be the time difference between current time and previous time that Qi,j was updated; tokens(i, j) += ∆ ∗ ρi,j ; If (tokens(i, j) > σi,j ) tokens(i, j) = σi,j ; end RequestTagging(r, Qi,j , t): begin If (tokens(i, j) >= 1) sri,j = t; else M axSi,j += 1/ρi,j ; sri,j = max{M axSi,j , t}; endif r fi,j = sri,j + 1/δi,j ; tokens(i, j) -= 1; end Sync(time t): begin /* Find smallest start tag in all non empty queues */; min_s = min{M inSi,j : Qi,j has pending request}; shift = min_s - t; /* Shift all tags by shift */; For all non empty Qi,j r Decrease sri,j and fi,j by shift. end Algorithm 2: Algorithm Components

bucket (request is good) then sri is set to the current time t, so that the finish tag fir matches the real-time deadline dri . Otherwise, the request is bad and is delayed: sri is set to the earliest time at which the request can be considered good, which is 1/ρi,0 beyond the largest start tag currently in the queue. Scheduler: The system scheduler selects the request to dispatch to the server based on an Earliest Deadline First (EDF) policy among the requests that satisfy eligibility and violation-free criteria. When the server becomes available, the scheduler first checks the requests in the queue for expiration, It checks if the deadline for the request at the head of a queue will be violated even if scheduled immediately, by comparing the deadline tag dri with the current time; if so, it is moved from its current queue Qi,j to the next-level Qi,j+1 . While moving the request, its start, finish and deadline tags will be changed to match the QoS parameters of the new queue. Secondly, the scheduler checks for requests that are eligible for scheduling at the current time based on their start tag values. In order to ensure that response times are not compromised the scheduler must give priority to requests whose start tags lag the current real time over requests whose start tags are in the future, even if the latter has a shorter finish tag; a request is said to be eligible at t if its start tag is less than t. The scheduler will select the request whose finish tag is the smallest among all eligible requests (set Φ). In case there are no eligible requests at time t, the tags of the requests are adjusted so that the smallest start tag coincides with the current time t, but keeping the relative spacing among the tags unchanged. That is all start and finish tags are decreased by a fixed amount equal to the difference between the currently smallest tag and the current time. This ensures that at least one request is eligible, without disturbing the relative order of the requests. This step is necessary to synchronize the tags with that of future requests coming from a currently inactive client, and avoid possible starvation in the future. A complete description is provided in Algorithms 1 and 2. Summary: We summarize the salient features of our method. First, the use of graduated QoS allow the system to provide very good QoS guarantees at a fraction of the capacity required to provide 100% guarantees. The use of an EDF scheduler allows the capacity requirements to be further reduced by exploiting the heterogeneity in the client QoS requirements. The scheduler thereby addresses property P3. Next, the use of decomposition rather than simple delay allows our scheduler to uphold property P2 of isolating the good and bad portions of an individual workload. The use of the back-end shaper allows the scheduler to provide the best response time to a request based on the current capacity availability, rather than based on a static classification set by the token bucket regulator.However, the back-end shaper needs to ensure that its opportunistic use of current excess capacity to serve clients with immediate deadlines does not end up hurting other clients with later deadlines when the spare capacity reduces or disappears. The use of good and bad classification of requests within a single queue allows us to guarantee the isolation property P1, while the use of violation-based transfer between queues allows us to maintain property P2. In the next section we provide experimental validation of our method using several block-level real storage traces to validate specific features of our method, as well as the benefits possible in practice as well.

4.

Performance Evaluation

In this section, we describe the results of an empirical evaluation of our scheduling method (referred to as Graduated QoS) using both the storage system simulation tool DiskSim [1] and a process-driven system simulator Yacsim [13]. In the experiments, we used three types of real block-level storage application traces from the UMass Storage Repository [2]. We conducted experiments focused on illustrating the properties P1 to P3 detailed in Section 1: (i) Can we isolate badly-behaved workloads from good ones so that they do not affect the performance of the latter? (ii) Can we localize regions of bad behavior of a single workload so as to avoid affecting its well-behaved regions? (iii) Can we provide high quality of service with low provisioned capacity?

4.1

Workload Isolation

Workload isolation is a basic requirement in shared storage systems. In this experiment we explore how well our method, Graduated QoS, can isolate badly-behaved workloads from well-behaved ones. Our experiment uses two block-level workloads: W 1 and W 2 . W 1 is a WebSearch workload with a long term average arrival rate of 330 IOPS; W 2 is a Financial Transaction workload with a long term average arrival rate of 120 IOPS. The arrival patterns are shown in Figure 6(a). By profiling the workloads, the token bucket parameters for W 1 and W 2 are set to (20, 330IOP S) and (13, 120IOP S) respectively. We compared three schedulers: Graduated QoS, pClock and WF2Q [4]. A system capacity of 670 IOPS is provisioned for the two traces. With this capacity, all methods can guarantee that 100% of the requests of both workloads will meet the deadline of 50ms. Figure 6(b) and (c) show the performance of W 1 and W 2 when both workloads are well behaved, while Figure 7 shows the behavior when W 1 violates its SLA. In the experiment W 1 increases its instantaneous arrival rate to around 700 IOPS between time 600 and 700 seconds, as shown in Figure 7. From Figure 6(c) and Figure 7(c) we can see that the well-behaved workload W 2 is isolated from the bad behavior of W 1. The performance of W 2 does not change when W 1 sends more requests. Figure 6(b) and Figure 7(b) shows the performance of W 1 without and with violation. Performance of W 1 is degraded because it send more requests during 600s − 700s. A notable fact in Figure 6(b) and Figure 7(b) is that all the three methods show a performance degradation for W 1, but the degradation is different in the three cases. Our Graduated QoS method can still guarantee that 96.5% of the requests meet their deadline, while pClock and WF2Q are noticeably degraded to 78.3%. Theoretically, pClock using EDF scheduler should have better performance than WF2Q, because the EDF scheduler is able to use the deadline difference from W 2 to reduce the response time of W 1, without affecting the performance of W 2, while the WF2Q scheduler strictly allocates the capacity in proportion to the weights; hence the excess capacity is used to decrease the response time of the well-behaved flow even below its required value, and is not used to reduce the penalty faced by W 1. In this experiment, the deadline is set to be 50 ms for both W 1 and W 2, and pClock cannot use the spare capacity from W 2. So the performance of pClock and WF2Q are almost the same. In Section 4.3 we provide the comparison of pClock and WF2Q with different deadline limits for the multiplexed workloads, and the results show the advantage of pClock over the WF2Q scheduler. In general, our method

1000

1

W1 (Without Violation) W2

800

0.8

Fraction

600 400 200 0 0

200

400

(a) Arrival pattern for W1 and W2

0.7

0.6

0.6

0.4

0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

600 800 1000 1200 1400 Time (s)

0.8

0.7

0.5

<=50

50~100 100~200 200~500

Graduated Qos pclock WF2Q

0.9

Fraction

Response Time (ms)

1 Graduated Qos pclock WF2Q

0.9

0

>500

<=50

Response Time (ms)

50~100 100~200 200~500

>500

Response Time (ms)

(b) Response Time Distribution of W1

(c) Response Time Distribution of W2

Figure 6. W1 and W2 are well-behaved workloads. Performance of W1 and W2 with three scheduling methods: Graduated QoS, pClock, WF2Q.

1000

1

W1 (With Violation) W2

800

0.8

Fraction

600 400 200 0 0

200

400

600 800 1000 1200 1400 Time (s)

(a) Arrival pattern for W1 and W2

0.8

0.7

0.7

0.6

0.6

0.5 0.4

0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

<50

50~100 100~200 200~500

>500

Graduated Qos pclock WF2Q

0.9

Fraction

Response Time (ms)

1 Graduated Qos pclock WF2Q

0.9

0

<=50

50~100 100~200 200~500

>500

Response Time (ms)

Response time (ms)

(b) Response Time Distribution of W1

(c) Response Time Distribution of W2

Figure 7. W1 violates its SLA but W2 is well behaved. W1 sends more requests between 600 and 700 seconds. W1 gets penalized differently by the three schedulers while W2 is always isolated.

outperforms the other two methods because of the its ability to isolate bad regions. We will explain that in detail next.

4.2

Bad-Region Isolation

Next we will explore how our method isolates the bad-region of a workload without affecting the good-regions, and maximizes the number of requests that meet their deadline. We use the same workloads described in Section 4.1; W 1 has an average arrival rate 330 IOPS and W 2 an average rate 120 IOPS. The deadlines for W 1 and W 2 are 50ms. The sever capacity of 670 IOPS is sufficient to guarantee the response times of both workloads for all the three scheduling methods. In the experiment W 2 is always well behaved, while W 1 violates its SLA by sending requests at a rate of about 700 IOPS during the 600s-700s interval (as shown in Figure 7 (a)). This corresponds to exceeding the stipulated arrivals by about 6% for the whole trace. As shown in Figure 7 (b), Graduated QoS allows a much greater fraction of W 1 (about 96.5%) to meet its deadline compared to 78.3% achieved by pClock and WF2Q. The measured response times during and after the badly-behaved region are shown in Figures 8(a) and (b) for the Graduated QoS and pClock schedulers respectively. As can be seen, with Graduated QoS most of the requests during this interval still meet their

deadline, and only a few of them have longer response time. The requests after this region (after t = 700 s) are not affected. In contrast, pClock delays all the requests of W 1 not only during the interval (600 − 700)s, but all the way after the burst to about 790s. This is because when the violation happens Graduated QoS isolates the bad-behaved requests by moving them out of this request stream allowing well-behaved requests after the violation to meet their guaranteed deadlines. The performance of W 2 is the same with or without the violation by W 1. We do not show the performance of W 2 because it is isolated from W 1 . We also compared the performance W 2 using the WF2Q scheduler. The response time of W 1 is similar to that of pClock. This is because both pClock and WF2Q delay the violating requests which in turn affects the later requests.

4.3

Reduced Capacity Provisioning

We now explore the relationship between capacity provisioning and performance. By profiling, the capacity required by different schedulers to achieve a certain QoS is determined empirically. We find that the Graduated QoS scheduler provides significantly better performance at reduced capacity compared to the other schedulers. Our method reduces capacity using both decomposition and the EDF policy. The former reduces capacity by decomposing the

(a) Response time of W1 with Graduated QoS

(b) Response time of W1 with pClock

Figure 8. W1 violates its SLA and sends more requests from 600s to 700s. Graduated QoS isolates the bad region and delays some of the requests of W1 between 600s and 700s. However pClock delays all of W1’s requests from 600s all the way up to 790s.

workload and providing different response times for its badly behaved portions. EDF exploits the spare capacity arising from having different deadlines for different clients. We first evaluate the performance of Graduated QoS for a single workload. In this experiment, five workloads are used to evaluate Graduated QoS separately. W 1 and W 2 are WebSearch workloads with a long term average arrival rate of 330 IOPS; W 3 and W 4 are Financial workloads with a long term average arrival rate of 100 IOPS; W 5 is a Exchange Server workload with a long term average arrival rate of 910 IOPS. Our performance goal in this experiment is to have at least 90% of the requests meet a deadline of δ1 , at least 95% of the requests meet a deadline of δ2 , and all requests that satisfy the SLA face a maximum latency of δ3 . Since pClock and WF2Q use a single-level QoS model, we set the performance goal to be a deadline of δ1 for 100% of the workload. We compare the capacity requirement for Graduated QoS and singlelevel QoS for values of δ1 equal to 5 ms, 10 ms, 20 ms and 50 ms. The capacity required are shown in Figure 9. For all cases, Graduate QoS saves capacity significantly, while still providing comparable performance. For the performance with multiple workloads, we first conducted experiments by multiplexing workloads of the same type. In this experiment, we use two Exchange workloads with deadlines of 50ms and 100ms respectively. We vary the server capacity from 2000 IOPS to 6000 IOPS and monitor the number of requests meeting their deadlines. Figure 10(a) shows the performance with the three schedulers. We can see that Graduated QoS can provide better performance guarantees than both pClock and WF2Q. As seen in Figure 10(a), with a capacity of 2000 IOPS, our method guarantees 80% of the workload while pClock and WF2Q can only guarantee 40% and 22% respectively. In order to achieve a 90% guarantee, our method requires about 2500 IOPS while pClock and WF2Q require about 3500 IOPS and 4000 IOPS respectively. The difference of pClock and WF2Q shows the benefit of EDF scheduling, while the gap between our method and WF2Q can be attributed to both the decomposition and EDF policy.

In the second experiment, we multiplex workloads of different types. Two Exchange workloads with deadlines of 50ms, and two Web Search WS workloads with deadlines of 100ms and one Financial Transaction FinTran workload with a deadline of 200ms are run concurrently. Figure 10 (b) show similar performance as Figure 10(a) for each scheduler.

5.

Related Work

A large body of work in QoS ideals with the issue of proportional sharing. These methods are widely used in multiplexing fixed-capacity resources like network bandwidth, processor cycles, and memory. A large number of fair schedulers have been proposed to provide proportional sharing e.g. WFQ [19, 5], WF2Q [4], SelfClocking [6], Start Time Fair Queuing [8], Fair Queuing [9] etc. The general idea is to emulate the behavior of an ideal (continuous) GPS [9] scheduler in a discrete system, to divide the resource at a fine granularity in proportion to client weights. These works did not explicitly address the problems of latency control and resource provisioning. Issues of latency are largely restricted to controlling the jitter i.e. the waiting time of a request arriving at an empty queue. In particular, it is not possible to independently specify a response time requirement and a throughput requirement to a client, and the algorithms are not designed to provide any guarantee on the request deadlines. In addition, resource planning is also not a concern of these frameworks. The problem of independently guaranteeing both response time and throughput was addressed in [19, 5, 11]. Cruz [19, 5] provided latency control using a service curve based approach while pClock [11] uses an arrival curve based method and avoids the problem of starvation of the former solutions by using real time tagging and synchronization. pClock uses a Token Bucket model to control the arrival rate and also provides a capacity planning method based on estimating the worst case number of requests that would need to be scheduled in an interval by an EDF scheduler. However, a problem with pClock is that it does not isolate the

7000

Capacity Requirement (IOPS) for Deadline 10 ms

Capacity Requirement (IOPS) for Deadline 5 ms

8000 Graduated QoS Single−Level QoS

6000 5000 4000 3000 2000 1000 0

W1

W2

W3 Workloads

W4

6000 Graduated QoS Single−Level QoS

5000

4000

3000

2000

1000

0

W5

4500 4000

Graduated QoS Single−Level QoS

3500 3000 2500 2000 1500 1000 500 0

W1

W2

W3 Workloads

W4

W2

W3 Workloads

W4

W5

(b) Response time of 10 ms Capacity Requirement (IOPS) for Deadline 50 ms

Capacity Requirement (IOPS) for Deadline 20 ms

(a) Response time of 5ms

W1

3000 Graduated QoS Single−Level QoS

2500

2000

1500

1000

500

0

W5

(c) Response time of 20 ms

W1

W2

W3 Workloads

W4

W5

(d) Response time of 50 ms

Figure 9. Reduced Capacity Requirements for Different Deadlines using Graduated QoS bad requests from good ones, which allows an unbounded number of request following a bad request to be delayed. Because of this property, it is not robust to workload fluctuations which can have more than a local effect on the QoS guarantees. Our method addresses this problem by providing a Graduated QoS as part of the SLA, so that bad requests are automatically moved out of the stream and not allowed to delay good portions of the workload. Empirical study of storage workloads to show the benefits of exempting a fraction of the workload from response time bounds was shown in [14], and used in the design of a slack-based twolevel scheduler for a single client workload in [15]. However the issues of sharing a server among multiple decomposed client workloads was not addressed. Variable system capacity was considered in [12]. However the effects of reduced capacity on the response time guarantees was not considered.

6.

Conclusions

The increasing popularity of storage and server consolidation introduces new challenges for resource management, capacity pro-

visioning, and guaranteeing application performance. In this paper we presented a novel method for multiplexing multiple concurrent bursty workloads on a shared storage server. Our solution employs two strategies together: systematically decomposing bursts to provide each workload with a graduated Quality of Service (QoS), and efficiently scheduling the decomposed portions of all the workloads. The results show that it achieves isolation of different workloads from each other, isolation of the bursty portions of a single workload from its well behaved portions, and significant capacity reductions for small relaxations of the QoS.

7.

Acknowledgements

Support by the National Science Foundation (NSF) under grants CNS 0615376 and CNS 0917157 is gratefully acknowledged.

8.

References

[1] The DiskSim simulation environment (version 3.0). http://www.pdl.cmu.edu/DiskSim/.

1

1

0.9

0.9

Percentage of Guaranteed

Percentage of Guaranteed

0.8 0.8 0.7 0.6 0.5

Graduated Qos

0.4

0.7 0.6 0.5 0.4 0.3 0.2

2500

3000

3500

4000

4500

5000

pClock

0.1

WF2Q 0.2 2000

Graduated Qos

pClock

0.3

WF2Q 5500

6000

Capacity (IOPS)

(a) Multiplexing Similar Traces

0 2000

3000

4000

5000

6000

7000

8000

9000

10000

Capacity (IOPS)

(b) Multiplexing Dissimilar Traces

Figure 10. Reduced Capacity Requirements using Graduated QoS [2] Storage performance council (umass trace repository), 2007. http://traces.cs.umass.edu/index.php/Storage. [3] L. A. Barroso and U. Hölzle. The case for energy-proportional computing. Computer, 40:33–37, December 2007. [4] J. C. R. Bennett and H. Zhang. W F 2 Q: Worst-case fair weighted fair queueing. In INFOCOM (1), pages 120–128, 1996. [5] R. L. Cruz. Quality of service guarantees in virtual circuit switched networks. IEEE Journal on Selected Areas in Communications, 13(6):1048–1056, 1995. [6] S. Golestani. A self-clocked fair queueing scheme for broadband applications. In INFOCOMM’94, pages 636–646, April 1994. [7] M. E. Gómez and V. Santonja. On the impact of workload burstiness on disk performance. In Workload characterization of emerging computer applications, 2001. [8] P. Goyal, H. M. Vin, and H. Cheng. Start-time fair queueing: a scheduling algorithm for integrated services packet switching networks. IEEE/ACM Trans. Netw., 5(5):690–704, 1997. [9] A. G. Greenberg and N. Madras. How fair is fair queuing. J. ACM, 39(3):568–598, 1992. [10] A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional Allocation of Resources in Distributed Storage Access. In (FAST ’09)Proceedings of the Seventh Usenix Conference on File and Storage Technologies, Feb 2009. [11] A. Gulati, A. Merchant, and P. Varman. pClock: An arrival curve based approach for QoS in shared storage systems. In ACM SIGMETRICS, 2007. [12] A. Gulati, A. Merchant, and P. Varman. mClock: Handling Throughput Variability for Hypervisor IO Scheduling . In USENIX OSDI, 2010. [13] J. R. Jump. Yacsim reference manual. http://www.owlnet.rice.edu/ elec428/yacsim/yacsim.man.ps. [14] L. Lu, K. Doshi, and P. Varman. Workload decomposition for qos in hosted storage services. In MW4SOC, 2008. [15] L. Lu, K. Doshi, and P. Varman. Graduated QoS by decomposing bursts: Don’t let the tail wag your server. In 29th IEEE International Conference on Distributed Computing Systems, 2009. [16] C. Lumb, A. Merchant, and G. Alvarez. Façade: Virtual storage devices with performance guarantees. File and Storage technologies (FAST’03), pages 131–144, March 2003. [17] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, and A. Rowstron. Everest: Scaling down peak loads through i/o off-loading. In Proceedings of OSDI, 2008. [18] A. Riska and E. Riedel. Long-range dependence at the disk drive level. In Proceedings of QEST, 2006.

[19] H. Sariowan, R. L. Cruz, and G. C. Polyzos. Scheduling for quality of service guarantees via service curves. In Proceedings of the International Conference on Computer Communications and Networks, pages 512–520, 1995. [20] J. Turner. New directions in communications. Communications Magazine, IEEE 24 (10): 8C15. ISSN 0163-68041986.

The IPS Framework: A Flexible Approach to Loosely ...