Frugal Storage for Cloud File Systems

Viewer
Transcript

Frugal Storage for Cloud File Systems Krishna P. N. Puttaswamy, Thyaga Nandagopal ∗ , Murali Kodialam Bell Labs, Alcatel-Lucent, Murray Hill, NJ

Abstract Enterprises are moving their IT infrastructure to cloud service providers with the goal of saving costs and simplifying management overhead. One of the critical services for any enterprise is its file system, where users require realtime access to files. Cloud service providers provide several building blocks such as Amazon EBS, or Azure Cache, each with very different pricing structures that differ on the basis of storage, access and bandwidth costs. Moving an entire file system to the cloud using such services is not cost-optimal if we rely on only one of these services. In this paper, we propose FCFS, a storage solution that drastically reduces the cost of operating a file system in the cloud. Our solution integrates multiple storage services and dynamically adapts the storage volume sizes of each service to provide a costefficient solution with provable performance bounds. Using real-world large scale data sets spanning a variety of work loads from an enterprise data center, we show that FCFS can reduce file storage and access costs in current cloud services by a factor of two or more, while allowing users to utilize the benefits of the various cloud storage services. Categories and Subject Descriptors D.4.2 [Storage Management]: Storage hierarchies General Terms

Algorithms, Design, Experimentation

Keywords Cloud computing, Storage, Storage cost, Caching

1. Introduction Data center based cloud services have become the choice of enterprises and businesses to host their data, including mission-critical services such as application data and file systems. Enterprises are moving their internal IT services to ∗ This

work was done while at Bell Labs, Alcatel-Lucent. The author is currently affiliated with the National Science Foundation.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EuroSys’12, April 10–13, 2012, Bern, Switzerland. c 2012 ACM 978-1-4503-1223-3/12/04. . . $10.00 Copyright

the cloud, in order to reduce their IT capital expenses as well as reduce network management overhead. While enterprise data can be stored in several forms, it is typically in the form of a large collection of files in a file system. Cost is the primary driver behind the migration to the cloud. Storage services in the cloud allow users to expand/contract their storage outlay on a dynamic basis at the granularity of several megabytes to gigabytes for very short time scales (hours). However, with the different array of pricing options for the variety of storage services, it is often unclear which type of storage model is the right choice for a specific type of service. In Table 1 we summarize the pricing options for popular storage services based on storage and I/O accesses. For e.g., Amazon S3 provides low cost storage1 , but charges more for accesses, while Amazon EBS provides low-cost access to files, but at a higher storage cost. Services such as Amazon ElastiCache and AzureCache2 provide a low-latency high cost memory. Complicating this matter is the multi-tier pricing adopted within each storage model by the different providers. For example, the per-GB costs within S3 and ElastiCache differ by a factor of 3 between the low and high cost tiers, based on the size of the storage volume. S3 EBS Storage pricing (per GB-month) 0.08 $0.10 Request pricing (per 1 million I/O requests) PUT requests $10.00 $0.10 GET requests $1.00 Data transfer pricing per GB incoming $0.00 $0.00 per GB outgoing $0.12 $0.05

ElastiCache $40 0 0 0 0

Table 1. Cloud Storage Pricing as of October 1, 2011 [2–4]. The choice of what combination of storage options to use in order to minimize the operational costs depends on the memory and access costs for the different storage options as well as the working set for the different data sets. Depending on the nature of the application workload at any given 1 Storage

pricing shown for S3 assumes a petabyte-scale file system. AzureCache [5] has a different pricing model: Each GB of storage costs $110 per month, and buying this storage provides certain number of PUT/GET requests and a pre-determined amount of bandwidth for free. Higher usage migrates the user to the next storage tier pricing.

2 The

time, different storage services might be the optimal choice for hosting data. An I/O intensive workload might prefer EBS while a workload that does not have much I/O accesses might prefer S3. Within a single file system, the pattern of workload might require these different characteristics at various points over a large period of time. It is clearly evident that a single storage system may not be a cost-optimal solution for an application at any time. Our goal in this paper is to minimize the total cost of storing and accessing data in the cloud by effectively utilizing and integrating the various choices of cloud storage services. 1.1 Related Work The problem is related to cache eviction problems and hierarchical storage management (HSM) [15], where data moves automatically between high-cost and low-cost storage media. The wide body of research on cache eviction policies such as Least Recently Used (LRU) or Adaptive Replacement Cache (ARC) [10] assume that there is a fixed cache size and focus on how to move data blocks in and out of the cache in order to minimize overall cache miss rate. Our problem generalizes this to the case where the cache sizes themselves can change in response to demand patterns and focuses on minimizing the cost. HSM systems are utilized to reduce data storage costs for enterprises, with the low-cost storage typically being tape-drives and optical disks, while the high-cost storage involves hard-disks and flash drives. Thus, the high-cost storage media act as a cache for the slower low-cost media. The decision to move a file between these storage media is made based on how long a file has been inactive, typically of the order of months. There are three-stage HSMs involving Fiber Channel SANs, SATA HDD arrays, and tape drives, or using a combination of flash drives, SATA HDD and tape. Conventional HSM models assume that a strict hierarchy of storage options are given and the sizes of the high-cost (and high-performance) tiers are fixed. Therefore, given a specific workload, the goal is minimize overall file access latency. This is the common model, whether it is used for a content-agnostic RAID cluster [16], content-specific file system [6], or for relational databases [7]. The problem addressed in this paper differs from the general problems addressed in the HSM literature in two important aspects: • Unlike a standard hierarchical storage system, we do not

assume that the amount of memory at different levels of the hierarchy is fixed. We assume that the amount of memory at different levels can be expanded and contracted based on current needs. Our model fits a cloud storage service quite well since elasticity of storage is a key selling point of the cloud storage services. • In a traditional HSM, memory is a sunk cost. There-

fore cost optimization is not the focus in HSM literature. Since memory can be added and deleted in very short

time scales in a cloud storage system, it is possible to tailor the mix of memory based on the current working set requirements of the file systems. Our objective is to determine the mix of memory that minimizes the operational cost. 1.2 Contributions These differences lead to a richer and more complex problem that deals with how files can be stored in order to minimize the overall cost. In this paper, we consider this problem of constructing a cost-effective file storage solution using the array of cloud storage services available. Our contributions in this paper are as follows. 1. We present a dynamic storage framework for cost-efficient file system storage in the cloud. 2. We present two schemes to determine how files can be moved between different storage systems dynamically, and derive tight performance bounds for the cost incurred in these schemes while accounting for the storage, I/O access and bandwidth costs. Both these schemes automatically adapt to the current file system requirements, but independent of the access patterns, in order to determine the most cost effective way of serving the demands. 3. Using simulations based on real-life disk traces representative of a medium-size data center, we demonstrate that our algorithms can reduce overall costs by a factor of two or more, and are within 85% of the optimal costs. We motivate our dynamic storage framework in Section 2. We describe the Frugal Cloud File System in Section 3, and present our algorithms in Section 4. In Section 5 we describe our experimental setup and describe the reallife traces used in our evaluation. Experimental evaluation results are shown in Section 6. We conclude in Section 7 with a discussion on our observations.

2. Dynamic Storage Management In the cloud, storage resources can purchased and discarded on fine-grained time-scales, of the order of hours. Moreover, there is no strict hierarchy of storage tiers as in a traditional HSM model. From Table 1, we can see that depending on the access patterns of the resource, the high-cost tier itself can vary. For example, for an I/O intensive workload, Amazon EBS will be cost-optimal and hence will be the highest tier, while for a sparse I/O workload, Amazon S3 might be cost-optimal. Thus, unlike a conventional HSM model, the preference among storage tiers may not be global and will be workload dependent. Let us consider, for example, Amazon S3 and Amazon EBS. The access costs in EBS are far lower than in S3, which leads one to think that from a cost perspective, we could cache the working set of files in EBS and store all data in S3. In models currently provided by Amazon [2], the entire file system is hosted on EBS with periodic incremental

Only S3 S3 + EBS Smart S3 + EBS

$ Cost

4000 3000 2000 1000 0 1

2

3

4

5 6 IO Scaling

7

8

9

8

9

Figure 1. Cost with variable I/O.

2500

Only S3 S3 + EBS Smart S3 + EBS

$ Cost

2000 1500 1000 500 0 1

2

3

4 5 6 Storage Size in TB

7

Figure 2. Cost with variable storage. snapshots maintained in S3. The other alternative is to host the entire file system on S3 and serve all accesses from there. We tested three schemes: (a) Only S3: where the file system is in S3 and data is accessed directly from there, (b) S3 + EBS: the way file systems are mounted today, and (c) Smart S3 + EBS: using S3 for keeping all the data and retrieving only working set files into a fixed-size EBS volume (instead of the whole image), updating them, and flushing the files to S3 when we are done. For the Smart S3 + EBS scheme (referred henceforth as the Smart scheme), the EBS volume size is kept fixed, as it is in current caching models. If the working set size exceeds the size of the allocated EBS volume, then, as in a cache, we use LRU as the block replacement policy. Using the CIFS stats from Leung et.al. [9], we measured the monetary cost of running such a file system in the three schemes. There were 352 million I/Os, out of which 6% was writes (hence accounted under the PUT costs in S3) and the rest were accounted under the GET costs in S3. The CPU cost and bandwidth transfer costs between the S3/EBS and the compute instance hosting the file server are the same in all the schemes, and hence are not shown in the plots. The first set of tests was with respect to the number of I/Os. We used the number of I/Os in the above traces as a baseline, and scaled it up by up to a factor of 10, using the baseline access distribution. We set the EBS volume size for the Smart scheme to 10% of the file system size, and measured the total costs for all three schemes. The result is shown in Figure 1. When the workload is completely I/O dependent, S3 + EBS and the Smart scheme perform very

well. But, Smart is the winner overall since it keeps only a limited set of files inside EBS. Next, we varied the amount of storage required by the file system from the baseline of 1TB to 10TB, with the same number of I/Os, thereby modifying the ratio of storage to I/O. The cumulative costs are shown in Figure 2. For high storage-to-I/O ratios, the Smart scheme works remarkably well, as expected. Interestingly, at some point, S3 beats S3+EBS (at around 5 TB), and is also eventually better than the Smart scheme at around 70 TB. This suggests that when storage-to-I/O ratio is really high, it is better to run from S3 directly. This behavior is entirely expected, and suggests that the storage-to-I/O ratio is an important metric to consider while optimizing costs. We finally varied the working set size that is allocated in EBS for handling files that are accessed frequently in our Smart scheme, while keeping the storage at 1TB and using the same I/O as in the traces. The results are shown in Figure 3. One can see that the size of the working set directly correlates to the overall cost, and it is in our best interests to keep the working set in EBS as low as possible. The above tests point out the benefits of a tiered storage framework in the cloud. However, different file systems have varying amounts of working sets that depend on the workload and the time-of-day [12]. Reserving a fixed size such as the 10% share of the file system size, as we did above for our example using EBS, is not truly cost-effective, because the working set might be much higher or lower than this fixed threshold during any given time window.

$ Cost

5000

1000 900 800 700 600 500 400 300 200 100

Only S3 S3 + EBS Smart S3 + EBS

0.1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 EBS Volume Size (% of Total Disk Size)

1

Figure 3. Cost vs. working set size. 2.1 Cost Tradeoff with Working Set Size In order to see how the optimal working set threshold might vary for different file systems, we ran various file system workloads from Narayanan et.al. [11, 12], and compared the total cost as a function of the size of the EBS volume in the Smart scheme, using the same experimental setup as before. Figure 4 shows the variation of the aggregate normalized costs versus the EBS volume size expressed as a percentage of the total file system size. The y-axis is normalized based on the least value of the cost during the run for that particular trace. From the figure, we can clearly see how the costs vary with different working set sizes.

% Normalized Total Cost

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9

MDS0 MDS1 PROJ3 PROJ4 HM0 HM1 SRC12

0

5

10 15 20 25 30 35 40 45 EBS Volume Size (% of Total Disk Size)

50

Figure 4. Cost for file systems with different EBS volume sizes. In Table 2, we specify the values of the EBS volume size for each trace that minimizes overall costs. We also show the corresponding costs on a unit storage cost basis, since each of the traces have different file system sizes. Notice that the min cost point is different for different traces. The minimum cost is achieved when the EBS volume size is near 30% of the total file system size for some traces, while for others it is as small as 0.25%. Another key observation from this table is that the minimum costs shown here (on a per TB basis) do not have a linear relationship with the size of the EBS volume. In other words, a quadrupling of the ideal EBS size across traces does not lead to a quadrupling of operations cost. In fact, there is no clear relationship between the two, apart from a general increasing trend. Trace Name MDS1 HM1 MDS0 PROJ3 PROJ4 HM0 SRC12

Ideal EBS Volume Size (in %) 0.25 0.5 1 1 10 18 30

Operations Cost ($/TB) 42.94 43.63 47.46 38.47 68.39 113.34 223.75

Table 2. Ideal EBS volume size and the corresponding cost. Note that even this variation does not account for dynamic patterns of access within a single workload over time. Accounting for those patterns can lead to even substantial savings in costs. This insight leads us to design a dynamic storage framework for file systems in the cloud. 2.2 Storage performance versus cost In this paper, we mainly focus on optimizing the operational cost of a cloud-based file system. Typically the cost of a storage service also reflects its performance, i.e., latency3 . For instance, Amazon-ElastiCache offers better latency than Amazon-EBS, which in turn offers better latency than S3. At the same time, storage costs decrease as we move from 3 We do not consider storage throughput as a performance criteria at this time, though the discussion presented here equally applies to it as well.

ElastiCache towards S3, with I/O costs moving in the opposite direction. We notice two general trends here: (a) storage costs are inversely proportional to I/O access costs, given a choice of multiple storage systems, and (b) a storage system with lower I/O access cost also provides very low access latency. Therefore, these trends suggest that, for any workload, regardless of the access patterns, we should automatically move data blocks to the appropriate storage layer that optimizes costs (as a function of access), and doing so will in some sense also optimize performance. A highly accessed data block should move to a layer with lower I/O access cost, while a data block that is never accessed should be stored in a layer with the lowest storage cost. This dependency between costs and performance will be broken only when this pricing model is violated, i.e., there exists one storage layer with lowest I/O cost but very high I/O latency. We have also not come across such a storage model in existing cloud storage services. If one were to be offered in the future, we could impose performance (latency) constraints on the cost optimization framework presented here to create a low cost storage system that meets certain performance criteria. In the subsequent sections, we outline our design and present algorithms that minimize costs using dynamic storage.

3. A Frugal Cloud File System Our design is motivated by two factors: (a) in the cloud, storage resources can be purchased on fine time-scales, and (b) the working set of a file system can change drastically over short time-periods. Storage resources in the cloud can be purchased on a very granular basis, e.g., GB per hour on Amazon S3 or EBS. These resources can be purchased as often as needed for a single compute instance, e.g., up to a max of 16 distinct volumes in EBS. This granular purchase feature is very useful, especially since modern file systems show remarkable variation in the size of the file system accessed over a fixed time-window of the order of minutes or hours. Our study of the data used in Narayanan et.al. [11, 12] also confirms this variation in nearly all traces. Hence, our goal is to design a cloud storage system that can span multiple cloud storage services, and adaptively grow/shrink in response to the file system workload with the aim of reducing cumulative storage and access costs. 3.1 Components We present our Frugal Cloud File System, or FCFS, in Figure 5. FCFS consists of three components: (a) Cost Optimizer, (b) Workload Analyzer, and (c) Disk Volume Resize Engine. We explain the functions of each of these components below.

Read/Write Requests

VFS

Cloud Storage Services

Cost Optimizer

Workload Analyzer

000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 S3 0000000000000000000000000000000000000000000000000000000000000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

…

Disk Volume Resize Engine

000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00EBS 000000000000000000000000000000000000000000000000000000000 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000

Storage Manager

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Azure 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Cache 0000000000000000000000000000000000000000000000000000000000000000000

Figure 5. FCFS Structure.

Cost Optimizer: Given storage systems with varying storage, I/O and bandwidth costs for storing and accessing data, the Cost Optimizer computes the optimal location for each file, and the duration for which the file should stay in this location. The details of this module are presented in Section 4.1. Workload Analyzer: The Workload Analyzer looks at the file system working set over time and moves files between the different storage volumes as dictated by their deadlines as decided by the Cost Optimizer. It also determines if some files should be moved ahead of their deadline, using some form of cache replacement policies, such as LRU or ARC [10]. The working of this module is explained in Sections 4.2 and 4.3. Disk Volume Resize Engine: Using inputs from the Workload Analyzer and the Cost Optimizer, this module adjusts the sizes of the different storage volumes at different time instants in a way that minimizes overall cost of operation of the file system in the cloud. The design and implementation details of this component are described in Section 5.1. 3.2 Illustrative Example A simple high-level example here can illustrate the above functions. Consider a data block of size 4MB that is stored in the Amazon S3. Using the data from Table 1, the cost of fetching this block from S3 (GET request) once is the same as the cost of storing it in ElastiCache for 4.5 hours. We show in the next section that this information is all that the Cost Optimizer needs to determine the duration for which the file is stored in the ElastiCache. If the block is not accessed for this duration in ElastiCache, one could move it to S3 and relinquish the corresponding space in ElastiCache. Similarly, if the block stored in S3 is being accessed, then we have to decide if we want

to either bring this into ElastiCache by either replacing an existing block in the ElastiCache volume or add additional space to bring it into the ElastiCache volume. This is the job of the Workload Analyzer. If we decide to replace an existing block, then we use LRU to find this replacement block. The Disk Volume Resize Engine will decide how often to increase/decrease the size of the ElastiCache volume and by how much. For example, say the increases in ElastiCache are performed once every 4 minutes. During any such 4 minute period, let the existing size of the ElastiCache volume be 2GB, and the total number of blocks displaced by the Workload Analyzer in the past 4 minutes add up to 1 GB. Thus, the Resize Engine can decide to increase the ElastiCache volume by 1 GB for the next four minutes. Conversely, if there are data items that have been removed on account of exceeding their duration, or if there has been no block displacement for the past 4 minutes, then the Resize Engine can reduce the ElastiCache volume to keep only those items that are still within the cost-optimal time window. In the next section, we present algorithms that dictate the behavior of these three modules.

4. Cost Optimization Algorithms For ease of understanding, we explain our proposed Frugal Cloud File System (FCFS) using a dual storage system model: one composed of a low-latency system such as Amazon ElastiCache or EBS, while the other is composed of a (relatively) high-latency system such as Amazon S3. We call these two tiers Cache and Disk respectively. The main expense of running a file system are the access and storage costs. The cost of fetching data from the disk consists of two components: (a) a per-block I/O access cost that is independent of the size of the block, and (b) bandwidth cost of transferring the block, which depends on the size of the block4. The storage cost is expressed in units of dollars per block per hour, the per-I/O access cost of moving data is in dollars per access per block, and the bandwidth cost is expressed in dollars per block. We can combine the bandwidth cost and the per-I/O cost into one fetch-cost parameter. Clearly, these parameters will change when the block size changes. Our goal is to optimize the overall costs involved for storing and accessing X bytes of data in the file system with these dual storage systems. Let the cost of storing data and fetching data from the Disk (Cache) be md (mc ) and fd (fc ) respectively. This is illustrated in Figure 6. There are three questions that need to be answered to address this problem: (a) where should data reside by default, (b) when should data move from Disk to Cache and vice versa, and 4 In

AWS, bandwidth cost of transfer between S3 and EBS/ElastiCache instances in the same data center is zero.

m

m

d

c

f

Disk

d

Cache

3 1

f

d

2

fc

Figure 6. Dual Storage Systems and the Data Read Paths

(c) when should the size of the Cache be increased or decreased? In current cloud storage services, services that serve data from main memory instances are more expensive than the services that serve data from permanent stores (such as hard disk). For instance, the per-MB storage costs of Amazon ElastiCache [3] and Azure AppFabric Cache [5] are three and two orders of magnitude more expensive than Amazon S3 and EBS storage services respectively. One of the key reasons, among others, seems to be that serving data from main memory also requires a CPU instance (either rented directly by the customers as in ElastiCache or hidden behind the service cost as in Azure Cache), where as disk-based services (running on NAS and SAN) do not. It is simple to see that all data must by default reside in the Disk, which has the lowest cost of storage among the two tiers (i.e., md < mc ). This is because data has to be stored in one of these two systems by default, and clearly the low storage cost location, i.e., Disk, minimizes this default storage cost. From the Disk, data can be accessed in one of three ways, shown in Figure 6. First, it can be fetched straight from the Disk, while in the second method, it can be fetched from the Cache if it exists there. The third way is the conventional caching model, where data is fetched from the Disk and stored in the Cache, from where it is served henceforth until it is removed from the Cache. Note that in this method, the data could also be removed from Disk once it is cached, thus storing it in only one location. Given the very small size of the cache compared to the Disk, doing so has very low impact on cost. However, removing data from Disk has substantial implications on the resiliency of the data in a cloud storage system, and therefore we do not consider this method further. We simply assume that the a version of the data resides in Disk as well even after it is moved to Cache. If fd ≤ fc , then it makes no sense to keep any data in the cache, since the total costs are lowered by always accessing data from the disk. However, this is not common in practice, since fc << fd in cloud storage systems. Given that fd > fc , and that future arrivals are not known, one should consider keeping data in the cache for some amount of time, whenever it is retrieved from the disk to serve an access.

Hence, we consider the third method more carefully. When a block is requested, it is read from the disk to the cache and is then read from the cache to the VM. This incurs a cost of fd + fc . At this point the block is in both the disk and the cache. The system now has the option of keeping this block in the cache for additional time. During this time when the block is in the cache, if there is a request for the block, then it can be read from the cache at a cost of fc . Note however, that during the time the block is in the cache, the storage cost rate is mc + md . At any point in time, a block can be evicted from the cache. If this is done, then the block is only in the disk and the memory cost rate will be md . Our objective is to devise an eviction policy from the cache that minimizes the overall operational cost. In developing the cache replacement algorithms we do not make any assumptions about future arrival patterns. Instead, we develop online algorithms that have a cumulative cost that is provably within a constant factor of the optimal cost. In determining the optimal cost, we assume that all access times to the blocks are known ahead of time. We show that even when compared to the scenario where all accesses are known ahead of time, our algorithm performs remarkably well. The algorithm as well as the analysis can be viewed as generalizations of the classical ski-rental problem [8]. 4.1 Determining the Optimal Cost We first determine the optimal cumulative storage and access cost when the access times for the block are known ahead of time. Henceforth, the term cost, when used alone, refers to this cumulative storage and access cost, unless explicitly specified otherwise. When the block is accessed, it is read from the disk onto the cache. Assume that we know that the next access to the block is after ℓ time units. If after the current retrieval the block is stored in the cache for the next ℓ time units, then the cost will be (mc + md )ℓ + fc . If we do not store the current access in the cache and instead leave it in the disk, then the cost for the next access will be md ℓ + fc + fd . The optimal policy will depend on when the next access will occur. It is better to keep the block in the cache and retrieve it from the cache, if (mc + md )ℓ + fc ≤ md ℓ + fc + fd ⇒ ℓ ≤

fd mc

fd , then it is more cost effective to discard the block If ℓ > m c from the cache and retrieve it from the disk. We use

T =

fd mc

to denote this crossover time. We denote by OP T (ℓ) the optimum cost if the next access occurs after ℓ time units. Therefore, from the above discussion,

4.3 Probabilistic Online Scheme OP T (ℓ) =

(mc + md )ℓ + fc md ℓ + f c + f d

if ℓ ≤ T if ℓ > T

4.2 Deterministic Online Scheme In an online algorithm, we assume that we do not have any knowledge of when the next access will occur. We describe here a deterministic online scheme. DET: When a block is accessed (either from the disk or the cache) • It is stored in the cache for T more time units from the

current time. • It is evicted from the cache after T time units.

If the next access occurs before T time units, then the block is retrieved from the cache, and is kept for a further T time units. If the access occurs after T time units, it is retrieved from the disk, at which time it is brought back to the cache, and the cycle starts over again. Let DET(ℓ) represent the cost of the deterministic online scheme when the next access occurs after ℓ time units. We want to show that the ratio of DET(ℓ) to OPT(ℓ) is bounded for all values of ℓ. 4.2.1 Cost Analysis of DET We do the analysis in two parts: (a) Next Access Before T : If ℓ ≤ T , then the cost of the online algorithm as well as the optimal algorithm is (mc + md )ℓ + fc , and the ratio of the costs is one. DET(ℓ) = 1, ∀ℓ ≤ T. OPT (ℓ)

The deterministic scheme holds a block for T time units in the cache from the last access and then discards it. A natural question to ask is whether the expected cost can be reduced by probabilistically evicting blocks from the cache even before time T . Indeed this can be done if the eviction probabilities are chosen carefully. PROB: When a block is accessed (either from the disk or the cache) • We compute a block eviction time based on a probability

density function p(t) that describes the probability of discarding a block from the cache at time t ∈ [0, T ] from the last access time of the block. • The block is evicted after this time has elapsed with no

subsequent access to this block. In this scheme, the block is definitely discarded from the cache by time T from its last access time, implying that RT p(t)dt = 1. Let E[PROB(ℓ)] denote the expected cost 0 of the probabilistic eviction scheme when the next access is after ℓ time units. Note that the expectation is due to the uncertainty in when the block will be discarded. We do not make any probabilistic assumptions about when the next access will occur. We now want to pick p(t) in order to ensure that the expected competitive ratio α = max ℓ

E[PROB](ℓ)] OPT (ℓ)

is as small as possible. 4.3.1 Cost Analysis of PROB

(b) Next Access After T : If the next access occurs at time ℓ > T , then the cost for the online algorithm is (mc + md )T + md (ℓ − T ) + fc + fd . The first term is the memory cost of keeping the block in the cache and the disk for T time units. The block is discarded from the cache at time period T . For the remaining (ℓ − T ) time units, the memory cost is md (ℓ − T ). The last two terms are the retrieval cost of the block from the disk. The optimum cost is md ℓ + fc + fd . Therefore DET(ℓ) OPT (ℓ)

(mc + md )T + md (ℓ − T ) + fc + fd = md ℓ + f c + f d mc T = 1+ md ℓ + f c + f d fd <2 = 1+ md ℓ + f c + f d

fd where we used the fact that T = m in the penultimate c step. It is possible to show that no deterministic algorithm can perform better than the ratio of 2, using an adversarial workload. Next we outline a probabilistic online algorithm that gives a better expected performance ratio.

Assume that we have an access at time ℓ, while the block is discarded from the cache at time t. The expected cost of the probabilistic online algorithm E[PROB(ℓ)] is

Z

ℓ 0

[(md + mc )t + fc + fd + md (ℓ − t)] p(t)dt Z t + [(md + mc )ℓ + fc ] p(t)dt ℓ

The first integral represents the expected cost if the block is discarded at some time t before the retrieval time ℓ. There is a disk and cache cost of (mc + md )t and a disk cost of md (ℓ − t) from the discard time t until access time ℓ. In addition, there is the reading cost of fc + fd from the disk since the block has been discarded from the cache before the access time ℓ. The second integral represents the cost when the access time ℓ is before the discard time t. In this case, there is a memory cost of (md + mc )ℓ and the read cost from the cache. Each of these costs are weighted with the probability of discarding the block from the cache at time t.

Our objective is to solve the following optimization problem. min α E[PROB(ℓ)] ≤ Z T p(t)dt =

α OPT(ℓ), ∀ℓ

(1) (2)

1

(3)

0

Differentiating Equation (2) with respect to ℓ and simplifying, we get Z ℓ Z T dOPT(ℓ) md p(t)dt+fd p(ℓ)+(md +mc ) p(t)dt ≤ dℓ 0 ℓ Differentiating again with respect to x, we get fd p′ (ℓ) − mc p(ℓ) ≤

d2 OPT(ℓ) dℓ2

(ℓ) Note that from the definition of OPT(ℓ), d OPT = 0. dℓ2 Moreover, at the optimal point, this constraint is tight, and hence the inequality can be replaced by an equality sign. fd and the above differential equation can Recall that T = m c be rewritten as 1 p′ (t) − p(t) = 0. T We can now solve for p(t) to obtain p(t) = Ket/T . Using 1 . Equation (3), we can solve for K to get K = T (e−1) Therefore the optimal probability distribution is

5. Implementation and Trace Details We implemented a dual storage system simulator to evaluate our algorithms. We implemented the simulator in C++ with about 6.5K lines of code. This simulator is highly configurable, which enabled us to experiment with different storage services with different pricing options. We simulated three types of storage services: S3, EBS, and CloudCache – a version of storage that has similar pricing and properties as that of ElastiCache [3] and Azure Cache [5]. We set the storage prices of CloudCache to $100 per GB per month. We used the current pricing values for the rest of the services in our simulator and used the most common values for other parameters. For e.g., we set the block size in EBS to 4KB (which is the size used by most file systems). Below, we describe important implementation decisions pertinent to FCFS.

2

p(t) =

1 et/T T (e − 1)

Substituting this in Equation (2) and solving for α gives the optimum value of fd 1 1 ≤1+ < 1.582. α=1+ e − 1 md + f d + f c e−1 Therefore, PROB has an expected competitive ratio of 1.582. Note that this ratio is much better than the competitive ratio of 2 obtained for DET. We now outline how the eviction time is generated in the PROB scheme. 4.3.2 Generating Block Eviction Time Whenever a block enters the cache or is accessed while in the cache, an eviction time for the block is computed as follows: • Compute T =

fd mc

for the block.

• Generate U which is a uniformly distributed random vari-

able in the range [0 : 1]. • Generate the block eviction time from the current time as

T log [(e − 1)U + 1]. In cases where the cache is examined only periodically, the eviction time is rounded to the closest time at which the cache is examined. This rounding can affect the performance ratio if the rounding intervals are very long, but our results show that this effect is negligible.

5.1 Volume Resizing The ideal cache size should be the minimum cache needed to host the working set of data from the file system. Ideally, no blocks should be evicted from the cache because there was no space (via LRU) in the cache volume, but only due to cost constraints. As the working set changes, the cache size should also change accordingly. Let the resizing happen at periodic intervals, and let the size of the cache at the moment of a re-sizing be S GB. Between two re-sizing events, we keep track of how many blocks are replaced in S before their eviction time due to LRU. Let this add up to BLRU GB. This describes the level of inadequacy of the current cache size. In the same interval, let the number of blocks that have been evicted by FCFS add up to Bevict GB. This indicates the size of the cache that is no longer needed to hold the blocks. Therefore, at the next re-sizing event, we set the new cache volume size to be S + BLRU − Bevict . The analysis in the previous section assumed that the cache can be expanded and contracted at any time instant. In practice there are restrictions on how often and by how much the cache volume can be resized. We describe these in detail next. Resizing Intervals. We set volume resizing interval values as follows: every 4 minutes we attempt to increase the cache size, but the cache size decrease is attempted only at the end of every hour. The reasoning behind this is as follows. Amazon EBS, for instance, allows a maximum of 16 volumes to be attached to a VM at any point in time. But once a volume is allocated, it is paid for the next hour, and hence it would be a waste of resource to deallocate it before the end of an hour. Moreover, allocating or deallocating a volume involves moving blocks around, which can cause a lot of overhead if done frequently. As a result, we set the interval time for incrementing the size to 4 mins and the interval for decrementing to one hour. We noticed that having a larger time period for decreasing the size avoids frequent

fluctuations in the size, thus making the cache volume size more stable. Resizing Granularity. There are practical restrictions on the granularity of increasing or decreasing the storage volume size, as well, in the cloud. For e.g., in Amazon, the minimum increment/decrement size for the cache in 1GB. In the description in Pseudocode 1, we use G to represent the granularity of resizing the volume. For instance G = 1GB in Amazon. If BLRU ≥ Bevict then BLRU − Bevict represents the amount by which the cache size has to be increased. If BLRU < Bevict then BLRU − Bevict represents the amount by which the cache size has to be decreased. Due to the granularity restrictions, we round the increase or decrease to the nearest multiple of G. This is shown in Line 26 in Pseudocode 1. 5.2 Separate Read and Write Volumes Second, we allocate a separate read and write cache in EBS and CloudCache. If the cloud storage service charges differently for writes and reads, as is the case for Amazon S3, then the replacement thresholds for a file opened for a read or a write should intuitively be different. Based on the expression for the replacement threshold in the previous section, and from the differential pricing in Table 1 for PUTs and GETs in the S3 service, a file opened for write has a replacement threshold that is 10 times longer than that of a file that has been opened for a read. 5.3 Block Sizes We set the block size of data in S3 to 4MB; S3’s pricing does not require a specific block size – the prices are based on the number of operations with a limit of 1GB per operation. But choosing a block size has to navigate a tradeoff: large blocks reduce the I/O cost due to coalescing of writes but increases the storage cost in EBS/Cache, and vice versa. Some of the prior systems operating on S3 have found the block size in the order of MBs to provide good tradeoff [14], and even Amazon uses 4MB for snapshotting [1]. We deal with the discrepancy in the block size between S3 and EBS/Cache as follows: Whenever a S3 read is issued by EBS, we use range read to read only the relevant blocks (of 4KB). But whenever a dirty block is evicted from EBS/Cache, we write back all the 4KB blocks that are dirty in the evicted block’s 4MB S3 block. EBS and CloudCache block sizes, however, are set to 4KB, as mentioned before. 5.4 FCFS Pseudocode The pseudocode for the FCFS algorithm is shown below in Pseudocode 1. This Pseudocode is for block reads. The code for block writes is the same, with additional code to track dirty blocks, write back dirty blocks upon eviction, and periodic checkpointing of dirty blocks. Pseudocode 1 presents four functions. Initialize is called at the start of the system, Access Block is called to answer every read request

Pseudocode 1 FCFS Read(Disk, Cache) 1: function Initialize() 2: Cache size, S = 1 GB; BLRU = Bevict = 0; T ←

fd mc

3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35:

function Access Block(Block A) {// Every block read} t ← current time if A ∈ Cache then Serve Access Request (A) else { A ∈ / Cache} fetch A from Disk if Cache is full then R ← FindReplacementBlock(LRU) evict R from Cache if evictionTime(R) > t then BLRU = BLRU + block size end if end if Load A into Cache end if evictionTime(A) ← t+ Compute Eviction Time() function Volume Resize(Cache) {// Periodic} t ← current time I ← All blocks in Cache with evictionTime ≤ t evict all blocks in I Bevict = Bevict + |I|∗ block size evict ⌋∗G S = S + ⌊ BLRU −B G BLRU = Bevict = 0 function Compute Eviction Time() if Eviction Method is Deterministic then return T else {Probabilistic eviction method} r ← Random[0 : 1] return T log[(e − 1)r + 1] end if

from the application, Volume Resize is called periodically to resize the Cache, and finally, Compute Eviction Time is called to decide the eviction time of a block upon access. 5.5 Traces Used for Evaluation We used the traces from a prior file system study [12]. The authors graciously made the traces public [11], which we use for our work. The original paper [12] has all the details of the trace, we only present some basic information here. The trace dataset has block-level read and write requests captured below the file system cache for 36 disk volumes from a medium-sized enterprise data center. These disks belong to a wide range of services, ranging from web server, source code control system, user home directories, to print servers. This dataset was captured for a period of 168 hours starting from February 22nd 2007 5PM GMT [12]. Each entry in each of these trace files is of

the format: . We only use the timestamp (when the request arrives), type (read or write), offset (where the read/write starts in the disk), and size (the amount of data to be read/written) for our work. This trace does not have details about the size of each disk. As a result, we conservatively estimate the size of the disk to be the same as that of the highest offset value found in the entire trace. Finally, even though this dataset had traces of 36 different file systems, due to memory limitations in our experimental setup, we could only run experiments on 33 of these file system traces. 5.6 Storage Strategies for Evaluation We implemented four different storage strategies for evaluation. They are as follows. • Average working set size strategy (AVG). In this strategy,

we set the size of the cache storage layer to be the size of the average per-day working set of a file system averaged over a period of one week. • FCFS with deterministic kickoff time (DET). In this strat-

egy we set the time to remove a block from the cache layer to be the ratio of the disk volume I/O cost to the cache volume storage cost. • FCFS with probabilistic kickoff time (PROB). In this

strategy the blocks are probabilistically kicked out at the end of every one hour. • Optimal strategy based on trace analysis (OPT). This

strategy allocates the cache sizes optimally for each trace at every point in time. Here we pre-process the trace to learn the complete knowledge of the future arrivals of requests to each file system block and use it to decide if the block should be stored in the cache upon access or should be thrown out immediately after access. Note that the AVG strategy is not implemented in the cloud storage systems of today, but it is an intuitive scheme that can be expected to do very well to reduce cumulative storage and access costs.

6. Trace-Driven Experimental Results In this section, we present the results from our trace-driven experiments to understand the improvements from dynamic storage tiers. The main question we seek to answer is: What are the cost and storage savings due to our algorithms if we use different cloud storage services as our choices for the cache and disk volumes? We answer this question with experiments on four different combination of storage services. 1) We look at the S3-EBS combination just the way it’s used today on Amazon AWS. 2) We look at the S3-CloudCache combination, where we use CloudCache instead of EBS. In both these cases, the minimum granularity by which the the cache storage can be resized is set to 1GB.

3) We explore the savings that we can get if the cloud storage providers were to offer a finer resizing granularity of 64MB, much along the lines of the offering in Windows Azure Caching service (granularity = 128MB). 4) Finally, we compare the cost of S3-EBS and S3-CloudCache combinations with a setting where only S3 is used to offer a file system service.

6.1 Savings in S3-EBS A file system running on S3-EBS today [13] stores the entire file system image in both S3 and EBS. We used this setting (called FULL from here onwards) in our simulator, and then used DET (where only the relevant blocks are stored in EBS). We then compared FULL with DET, by normalizing the DET costs by that of FULL. The ratio of I/O in S3 to the storage in EBS for a 4KB block is 1800 hours for reads and 18000 hours for writes. Unfortunately, the traces we have are only 168 hours long. As a result, DET, PROB or OPT strategies do not really get a chance to shrink the EBS volume size, which does not help us understand how these strategies perform. So we only compare FULL with DET in S3-EBS setting where the EBS volume size keeps increasing for the entire 168 hours. Figure 7 shows the savings in the total cost and Figure 8 shows the savings in the EBS storage size due to DET. As shows in Figure 7, average savings in total cost is about 22%. The average storage savings, however, is 84% compared to S3-EBS. This high storage savings does not lead to high cost savings due to two reasons: a) In many file system traces the contribution of EBS storage costs to the total cost is quite small, as majority of the contributions come from the I/O costs to EBS and S3. This is the most common reason for the discrepancy in savings. b) In some rare cases, the increase in the I/O cost to S3 (due to reduction in the EBS volume size) is more than the savings due to the decrease in the EBS storage cost. This is mainly due to sudden and high peaks in some traces that increase the required memory size faster than the increase in the EBS memory size during DET (1GB every 4 mins). In this case, many requests will not be cached in EBS forcing them to be fetched from S3 again later. SRC1 1 is an example for this case, which shows a negative cost savings of 5% as this slow increase misses to cache data from two peaks in the incoming requests. By making the EBS increase more aggressive, however, this second problem can be avoided. A more interesting result from the above two graphs is the significant savings in the storage size, despite the fact that EBS volume kept growing throughout the week (as the code to shrink will only be triggered after 18K hours). This is due to the fact that the working set of a file system is generally much less than the total file system. The average working set in a week was about 16% across all of our traces.

Cost Per $100 in S3-EBS-FULL

120 100 DET 80 60 40 20 0 3

eb w 2 eb w 1 eb w 0 eb w 3 v de w 2 v de w 1 v de w 0 v de w r0 us

ts 0 g1 st g0 st 2 c2 sr 1 c2 sr 0 c2 sr 2 c1 sr 1 c1 sr 2 h rc rs 1 h rc rs 0 h rc rs xy 1 pr xy 0 pr oj 4 pr oj 3 pr oj 0 pr n1 pr n0 pr 1 ds m 0 ds m

1

hm 0 hm

Figure 7. A plot of the % savings in the total cost while using DET over today’s S3-EBS-based file systems for 33 different file system

50 40 DET 30 20 10 0 3

eb w 2 eb w 1 eb w 0 eb w 3 v de w 2 v de w 1 v de w 0 v de w r0 us

ts 0 g1 st g0 st 2 c2 sr 1 c2 sr 0 c2 sr 2 c1 sr 1 c1 sr 2 h rc rs 1 h rc rs 0 h rc rs xy 1 pr xy 0 pr oj 4 pr oj 3 pr oj 0 pr n1 pr n0 pr 1 ds m 0 ds m

1

hm 0 hm

Storage Per 100GB in S3-EBS-FULL

traces. The average cost savings is about 22%.

Cost Per $100 in S3-CloudCache-AVG

Figure 8. A plot of the % savings in the EBS storage size while using DET over today’s S3-EBS-based file systems for 33 different file system traces. The average storage savings is about 84%.

120

DET

PROB

OPT

100 80 60 40 20 0 3 eb w 2 eb w 1 eb w 0 eb w v3 de w v2 de w v1 de w v0 de w r2 us r0 us ts 0 g1 st g0 st 2 c2 sr 1 c2 sr 0 c2 sr 2 c1 sr 1 c1 sr h 2 rc rs h 1 rc rs h 0 rc rs 1 xy pr 0 xy pr oj 4 pr oj 3 pr oj 1 pr oj 0 pr n1 pr n0 pr 1 ds m 0

1

0

ds

m

hm

hm

Figure 9. A plot of the % savings in the total cost while using S3-CloudCache with 1GB CloudCache granularity. The results are normalized

Cost Per $100 in S3-CloudCache-AVG

by the AVG strategy – for every $100 spent by AVG, the amount spent by the other three strategies are shown.

120 100 80 60 40 20 0

DET

PROB

OPT

3 eb w 2 eb w 1 eb w 0 eb w v3 de w v2 de w v1 de w v0 de w r2 us r0 us ts 0 g1 st g0 st 2 c2 sr 1 c2 sr 0 c2 sr 2 c1 sr 1 c1 sr h 2 rc rs h 1 rc rs h 0 rc rs 1 xy pr 0 xy pr oj 4 pr oj 3 pr oj 1 pr oj 0 pr n1 pr n0 pr 1 ds m 0

1

0

ds

m

hm

hm

Figure 10. A plot of the % savings in the total cost while using S3-CloudCache with 64MB CloudCache granularity. The results are normalized by the AVG strategy – for every $100 spent by AVG, the amount spent by the other strategies are shown.

6.2 Savings in S3-CloudCache CloudCache storage cost ($100 per GB per month) is 1000 times more expensive than EBS, while the other costs are the same as in the previous experiment. This increase in the storage cost, however, changes the time to cache for reads

to 1.8 hours and time to cache of writes to 18 hours. This increase in the storage cost also means that almost all of the contributions to the total file system cost is due to savings in the CloudCache storage. As a result, we only show the graphs for total cost savings (as it is almost identical to the storage savings graph).

Due to the shorter time scales for eviction times in the volume with the one week trace, we were able to run all the four different strategies outlined before. We present the results from these in Figure 9, where we show the results normalized by the values of the AVG strategy. The Figure shows the details about the savings in S3-CloudCache for 33 different traces. On an average, in these traces, there is a savings of about 20 to 25% in the total cost. Specifically, for every $100 spent by AVG, the amount spent by DET, PROB, OPT are $79, $76, and $69 respectively. The competitive ratio (CR) for PROB and DET are 1.36, and 1.49 respectively. The main reason the savings is so low is because in many of the traces, the performance of all four strategies are identical. We found out that this is due to the fact that the minimum granularity of allocation or deallocation is set to 1GB in our implementation. This is akin to the setting in Amazon AWS, but the problem is that many traces have their average per-day working set size smaller than a GB (working set is computed separately for reads and writes). And all these traces get a minimum of 1GB. This is much more than the necessary size and does not shrink any further. This makes the cost of all the strategies identical. In almost all the traces, however, we see the trend we expect: AVG costs more than DET, which costs more than PROB, and OPT is the cheapest. In some rare cases, when the working set size of AVG is very close to 1 GB, DET temporarily increases the size of the cache volume and hence incurs a slightly higher cost than AVG (as in Proj 0, Prxy 1, and Mds 0). 6.3 S3-CloudCache with Smaller Storage Granularity As described in the previous section, the larger granularity of allocation and deallocation in a storage service does not allow our algorithms to extract higher cost savings. So here we explore the impact of finer granularity on our algorithms. Specifically, we set the granularity of allocation to 64MB, and re-run the experiments from the previous section. Figure 10 shows the savings for the 33 different traces. Clearly, the number of traces where AVG performs as good as the OPT has now gone down. The average savings in the cost has also gone up significantly. For every $100 spent by AVG, the amount spent by DET, PROB and OPT are $54, $52, and $46 respectively. This is about 30% more than the savings with 1GB granularity. Along similar lines, the competitive ratio (CR) from the experiments for DET and PROB have gone down to 1.25, and 1.18 respectively. We then re-ran the experiments with 4MB granularity to understand the benefits of very fine granularity. In this case, for every $100 spent by AVG, $53, $46, and $43 were spent respectively by DET, PROB, and OPT strategies. And the CR for DET and PROB were respectively 1.18 and 1.07. Clearly, there is diminishing returns in reducing the granularity size beyond 64MB. In addition, the cost of our system with 64MB granularity is only 5 to 10% worse than the cost with very fine granularity of 4MB.

Table 3 presents a summary of the key results. It shows the savings we have observed in different storage service combinations for the three strategies we evaluate in this paper. Granularity DET PROB OPT

S3EBS-FULL 1GB 22% -

S3CloudCache-AVG 1GB 64MB 4MB 21% 46% 47% 24% 48% 54% 31% 54% 57%

Table 3. Summary of the average cost savings of DET, PROB, and OPT schemes under different storage combinations. The 90th percentile of the cost savings across all traces is generally within 2% of the average values shown here.

6.4 Comparing Only-S3 with Other Combinations Finally, we compare the cost of running the different file systems only on S3 with the cost of running them on S3EBS and S3-CloudCache (64MB) combinations using DET. Running directly on S3 saves us the EBS and CloudCache storage costs, but the I/O costs increase due to higher I/O cost of S3. We want to know if running on S3 directly outperforms running a file system on higher cost storage services such as CloudCache. In Figure 11 we plot the ratio of the cost of DET running on S3-EBS to Only S3 along with the ratio of DET on S3CloudCache with 64MB granularity to Only S3. The average ratio of the two comparisons across the 33 traces is 0.69 for S3-EBS-DET and 0.9 for S3-CloudCache-DET. The ratio of S3-EBS to only S3 is less than 1 in 22 out of the 33 traces. In the other traces, the number of I/O operations are quite low, causing the ratio to go above 1. A more interesting result is that the average ratio of S3-CloudCache to only S3 is also less than 1, even though CloudCache is 1000 times more expensive than EBS. CloudCache comes with a huge performance (latency) boost to the file systems at generally a very high storage cost, and yet we can get all the benefits of hosting the file system on CloudCache using our proposed FCFS system, at a cost that is even lower than that of the cheapest disk-based storage service based on S3. 6.5 Performance of FCFS We now quantify the performance of FCFS by comparing it against a file system that has the entire image in the cache (like the file system run on S3-EBS combination today). We measure the performance in terms of the hit fraction, which is the fraction of the total number of requests that were served from the cache. In S3-EBS-FULL, all the requests would be served form the cache, and hence all requests incur the cache read/write delay. We show the hit ratio in Figure 12 for the DET, PROB, and OPT schemes, when the cache increments are in steps of 1GB. Ideally, we would like to see very high hit ratios indicating overall low latency of access. The general observation from the figure is that

Ratio of Total Cost to OnlyS3

S3-EBS-DET

2

S3-CloudCache-64MB-DET

1 0 1

3 eb w 2 eb w 1 eb w 0 eb w v3 de w v2 de w v1 de w v0 de w r2 us r0 us ts 0 g1 st g0 st 2 c2 sr 1 c2 sr 0 c2 sr 2 c1 sr 1 c1 sr h 2 rc rs h 1 rc rs h 0 rc rs 1 xy pr 0 xy pr oj 4 pr oj 3 pr oj 1 pr oj 0 pr n1 pr n0 pr 1 ds m 0

0

ds

hm

m

hm

Fraction of Reqs Hits in Cache

Figure 11. A plot of the ratios of the total cost of DET on S3-EBS to only S3, and DET on S3-CloudCache to only S3.

DET

PROB

OPT

1

0 1

3 eb w 2 eb w 1 eb w 0 eb w v3 de w v2 de w v1 de w v0 de w r2 us r0 us ts 0 g1 st g0 st 2 c2 sr 1 c2 sr 0 c2 sr 2 c1 sr 1 c1 sr h 2 rc rs h 1 rc rs h 0 rc rs 1 xy pr 0 xy pr oj 4 pr oj 3 pr oj 1 pr oj 0 pr n1 pr n0 pr 1 ds m 0

0

ds

hm

m

hm

Figure 12. A plot of the fraction of the requests that were hit from the cache for various file system traces in our S3-CloudCache setup. FCFS does indeed perform well and provides low latency in a majority of the file system traces in S3-CloudCache setup. In fact, nearly half of the traces have a hit ratio over 0.9. The average hit fraction is 0.7 and less than one-third of the traces have a hit ratio below 0.5. Remarkably, both DET and PROB perform well and are within a small fraction away from the OPT scheme (DET is usually better, as expected). For those set of traces with very low hit-ratios across all three schemes, we observe that much of the data is accessed exactly in one narrow time interval and never accessed again, thereby leading to cache misses in a vast majority of accesses in such traces. The OPT scheme will intuitively have a slightly lower hit ratio than the DET and PROB schemes. The reasoning behind this is as follows. Since OPT proactively knows if a block will be accessed within the cost-optimal time threshold following its last access, and if not, that block will be kicked out right away from the cache. This leads to a smaller active cache size than DET or PROB schemes. Therefore, the decrease portion of our FCFS algorithm will aggressively reduce the cache size of OPT in the initial stages. However, this process will stabilize with time, wherein all three schemes will have similar hit ratios from the cache. This is the reason why the DET and PROB schemes are very close to the OPT scheme as well. As a part of our future work, we are looking into ways of providing performance SLAs while also optimizing the cost of the Cloud storage services.

7. Conclusion There are a wide variety of storage options in the cloud and an enterprise is faced with the problem of deciding what combination of storage services to use in order to build a cost effective cloud file system. Using actual pricing

data, we have shown that for a typical enterprise, no single storage option provides a cost effective file system solution. The appropriate combination of storage options depends on the working set dynamics of the file system. The correct combination to use can vary over time. In this paper we have: • Presented the Frugal Cloud File System (FCFS) that pro-

vides a dynamic framework for cost effective file storage in the cloud. • Developed and analyzed two cache resizing schemes that

significantly reduce the cost of storing files in the cloud. Theoretical analysis show that the resizing algorithms provide constant factor performance guarantees. These guarantees are independent of the access patterns. • Experimented using several real-world file system traces

and have shown that FCFS reduces the cost of running the file system significantly. In summary, FCFS integrates multiple storage services and dynamically adapts the storage volume sizes of each service to provide a cost-efficient cloud file system with provable performance bounds. As a part of our future work, we are looking into extending the benefits of FCFS from two-layered storage hierarchy to a generic multi-tier storage framework that provides cost optimality within the required performance bounds.

Acknowledgments We thank Eric Jul for his thoughful comments on the earlier versions of this paper.

References [1] Amazon. EBS to S3 Snapshot Block Size, https://forums.aws.amazon.com/message.jspa?messageID =142082.

.

[2] Amazon. Elastic block store, . http://aws.amazon.com/ebs/. [3] Amazon. Elasticache, . http://aws.amazon.com/elasticache/. [4] Amazon. Simple storage http://aws.amazon.com/s3/faqs/.

service

faqs,

.

[5] Azure. Caching service. http://msdn.microsoft.com/enus/library/windowsazure/gg278356.aspx. [6] S.-H. Gary Chan and F. A. Tobagi. Modeling and dimensioning hierarchical storage systems for low-delay video services. IEEE Transactions on Computers, 52, July 2003.

[10] N. Megiddo and D. Modha. Arc: A self-tuning,. low overhead replacement cache. In Proceedings of the USENIX Conference on (FAST) File and Storage Technologies, 2003. [11] D. Narayanan, A. Donnelly, and A. Rowstron. MSR Cambridge Traces. http://iotta.snia.org/traces/388. [12] D. Narayanan, A. Donnelly, and A. Rowstron. Write offloading: practical power management for enterprise storage. In Proceedings of the USENIX Conference on (FAST) File and Storage Technologies, 2008.

[7] D. Isaac. Hierarchical storage management for relational databases. In Symposium on Mass Storage Systems, 1993.

[13] S3-EBS. Amazon’s Elastic Block Store explained. http://blog.rightscale.com/2008/08/20/amazon-ebsexplained/.

[8] A. Karlin, M. Manasse, L. McGeoch, and S. Owicki. Competitive randomized algorithms for non-uniform problems. In Proc. of ACM-SIAM Symposium on Discrete Algorithms (SODA), 1990.

[14] S3Backer. FUSE-based single file backing store via Amazon S3. http://code.google.com/p/s3backer/wiki/ChoosingBlockSize.

[9] A. W. Leung, S. Pasupathy, G. Goodson, and E. L. Miller. Measurement and analysis of large-scale network file system workloads. In Proc. of the USENIX (ATC) Annual Technical Conference, 2008.

[15] Wikipedia. Hierarchical storage management. http://en.wikipedia.org/wiki/Hierarchical storage management. [16] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The hp autoraid hierarchical storage system. ACM Transactions on Computer Systems, 14, Feb 1996.

Load Balancing for Distributed File Systems in Cloud