Improved Approximation Algorithms for Data Migration Samir Khuller · Yoo-Ah Kim · Azarakhsh Malekian

Received: 14 December 2009 / Accepted: 23 May 2011 / Published online: 6 July 2011 © Springer Science+Business Media, LLC 2011

Abstract Our work is motivated by the need to manage data items on a collection of storage devices to handle dynamically changing demand. As demand for data items changes, for performance reasons, the system needs to automatically respond to changes in demand for different data items. The problem of computing a migration plan among the storage devices is called the data migration problem. This problem was shown to be NP-hard, and an approximation algorithm achieving an approximation factor of 9.5 was presented for the half-duplex communication model in Khuller, Kim and Wan (Algorithms for data migration with cloning. SIAM J. Comput. 33(2):448–461, 2004). In this paper we develop an improved approximation algorithm that gives a bound of 6.5 + o(1) using new ideas. In addition, we develop better algorithms using external disks and get an approximation factor of 4.5 using external disks. We also consider the full duplex communication model and develop an improved bound of 4 + o(1) for this model, with no external disks. Keywords Data migration · Edge coloring · Approximation algorithms

A preliminary version of the paper was presented at the 2006 APPROX conference. Research of S. Khuller was supported by NSF CCF 0728839 and a Google Research Award. S. Khuller · A. Malekian () Department of Computer Science, University of Maryland, College Park, MD 20742, USA e-mail: [email protected] S. Khuller e-mail: [email protected] Y.-A. Kim Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA e-mail: [email protected]

348

Algorithmica (2012) 63:347–362

1 Introduction To handle high demand, especially for multimedia data, a common approach is to replicate data items within the storage system. Typically, a large storage server consists of several disks connected using a dedicated network, called a Storage Area Network. Disks typically have constraints on storage as well as the number of clients that can access data items from a single disk simultaneously. These systems are getting increasing attention since TV channels are moving to systems where TV programs will be available for users to watch with full video functionality (pause, fast forward, rewind etc.). Such programs will require large amounts of storage, in addition to bandwidth capacity to handle high demand. Approximation algorithms have been developed [6, 11, 17, 18] to map known demand for data items to a specific data layout pattern to maximize utilization, where the utilization is the total number of clients that can be assigned to a disk that contains the data item they want. In the layout, we compute not only how many copies of each item we need, but also a layout pattern that specifies the precise subset of items on each disk. The problem is NP-hard, but there are polynomial-time approximation schemes [6, 11, 17, 18]. Given the relative demand for data items, the algorithm computes an almost optimal layout. Note that this problem is slightly different from the data placement problem considered in [3, 9, 16] since all the disks are in the same location, it does not matter which disk a client is assigned to; even in this special case, the problem is NP-hard [6]. Over time as the demand for data changes, the system needs to create new data layouts. The problem we are interested in is the problem of computing a data migration plan for the set of disks to convert an initial layout to a target layout. We assume that data objects have the same size (these could be data blocks, or files) and that it takes the same amount of time to migrate any data item from one disk to another disk. In this work we consider two models. In the first model (half-duplex) the crucial constraint is that each disk can participate in the transfer of only one item—either as a sender or as a receiver in each round. In other words, the communication pattern in each round forms a matching. Our goal is to find a migration schedule to minimize the time taken to complete the migration (makespan). To handle high demand for popular objects, new copies will have to be dynamically created and stored on different disks. All previous work on this problem deals with the half-duplex model. We also consider the full-duplex model, where each disk can act as a sender and a receiver in each round for a single item. Previously we did not consider this natural extension of the half-duplex model since we did not completely understand how to utilize its power to prove interesting approximation guarantees. The formal description of the data migration problem is as follows: data item i resides in a specified (source) subset Si of disks, and needs to be moved to a (destination) subset Di . In other words, each data item that initially belongs to a subset of disks, needs to be moved to another subset of disks. (We might need to create new copies of this data item and store it on an additional set of disks.) See Fig. 1 for an example. If each disk had exactly one data item, and needs to copy this data item to every other disk, then it is exactly the problem of gossiping. The data migration problem in this form was first studied by Khuller, Kim and Wan [14], and it was

Algorithmica (2012) 63:347–362

349

Fig. 1 An initial and target layout (left). For example, disk 1 initially has items {2, 4, 5} and in the target layout has items {1, 3, 4}. The corresponding Si ’s and Di ’s are shown on the right. Data item i resides in a subset Si of disks initially, and needs to be moved to Di during migration. For example, item 1 is located in disk 2 and 3 in the initial layout and in the target layout, we need to create a copy of the item in disk 1

shown to be NP-hard. In addition, a polynomial-time 9.5-approximation algorithm was developed for the half-duplex communication model. A slightly different formulation was considered by Hall et al. [10] in which a particular transfer graph was specified. In their transfer graph vertices represent the storage devices, and edges represent data transfers. The transfer is complete when all the edges (transfers) incident on it complete. While they are able to solve the problem very well, this approach is limited in the sense that it does not allow (a) cloning (creation of several new copies) and (b) does not allow optimization over the space of transfer graphs. In [14] it was shown that a more general problem formulation is the one with source and destination subsets specified for each data item. However, the main focus in [10] is to do the transfers without violating space constraints. Another formulation has been considered recently where one can optimize over the space of possible target layouts [12]. The resulting problems are also NP-hard. However, no significant progress on developing approximation algorithms was made on this problem. They presented a simple flow based heuristic for the problem, and was demonstrated to be effective in finding good target layouts. Job migration has also been considered in the scheduling context recently as well [2], where a fixed number of jobs can be migrated to reduce the makespan by as much as possible. There is a lot of work on data migration for minimizing completion time for a fixed transfer graph as well (see [5, 15] for references). The transfer graph definition in their model is similar to [14]. In [15], Kim proved that the problem is NP-hard when edge lengths are the same and showed that Graham’s list scheduling algorithm [8], when guided by an optimal solution to a linear programming relaxation, gives an approximation ratio of 3. For the objective of minimizing the average vertex completion time, Gandhi and Mestre [5] gave a primal-dual 3-approximation for unit processing times and a 5.83-approximation for arbitrary√processing times. For minimizing the average edge completion time, they present a 2-approximation for bipartite graphs. 1.1 Communication Model Different communication models can be considered based on how the disks are connected. In this paper we consider two models. The first model is the same model as in the work by Hall et al. [1, 10, 13, 14] where the disks may communicate on any

350

Algorithmica (2012) 63:347–362

matching; in other words, the underlying communication graph allows for communication between any pair of devices via a matching (a switched storage network with unbounded backplane bandwidth). Moreover, to model the limited switching capacity of the network connecting the disks, one could allow for choosing any matching of bounded size as the set of transfers that can be done in each round. We call this the bounded-size matching model. It was shown in [14] that an algorithm for the bounded matching model can be obtained by a simple simulation of the algorithm for the unbounded matching model with excellent performance guarantees. In addition we consider the full duplex model where each disk may act as a sender and a receiver for an item in each round. Note that we do not require the communication pattern to be a matching any more. For example, we may have cycles, with disk 1 sending an item to disk 2, disk 2 to disk 3 and disk 3 to disk 1. In earlier work we did not discuss this model as we were unable to utilize the power of this model to prove non-trivial approximation guarantees. 1.2 Our Results Our approach is based on the approach initially developed by Khuller et al. [14]. Various new ideas enable a reduction of the approximation factor to 6.5 + o(1). The main technical difficulty is simply that of “putting it all together” and making the analysis work. In addition we show two more results. If we are allowed to use “external disks” (called bypass disks in [10]), we can improve the approximation guarantee further to 3 + 12 max(3, γ ). (We assume that each external disk can hold γ items.) This can be achieved by using at most γ external disks, where is the number of items that need to be migrated. This gives an approximation factor of 4.5 by setting γ = 3. Finally, we also consider the full-duplex model where each disk can be the source or destination of a transfer in each round. In this model we show that an approximation guarantee of 4 + o(1) can be achieved. The algorithm developed in [14] has been implemented, and we performed an extensive set of experiments comparing its performance with the performance of other heuristics [7]. Even though the worst case approximation factor is 9.5, the algorithm performed very well in practice, giving approximation ratios within twice the lower bounds computed by the algorithm in most cases. In Sect. 2 we first discuss a brief overview of our 6.5 + o(1)-approximation algorithm along with some theorems and lower bounds that we will utilize for the analysis. The full details are described in Sect. 3. The algorithms for external disks and full duplex models are discussed in Sect. 4 and 5, respectively. Section 6 concludes the paper.

2 The Data Migration Algorithm We start this section by describing some theorems from the edge coloring and scheduling literature, as well as the lower bounds that we will use in the following sections for the analysis. In the second part, we present our data migration algorithm.

Algorithmica (2012) 63:347–362

351

2.1 Preliminaries Our algorithms make use of known results on edge coloring of multigraphs. Given a graph G with max degree G and multiplicity μ the following results are known (see Bondy-Murty [4] for example). Let χ be the edge chromatic number of G. Note that when G is bipartite, χ = G and such an edge coloring can be obtained in polynomial time [4]. Theorem 1 (Vizing [21]) If G has no self-loops then χ ≤ G + μ. Theorem 2 (Shannon [19]) If G has no self-loops then χ ≤ 32 G . Another result that we will use (related to scheduling) is the following theorem by Shmoys and Tardos [20]: Theorem 3 We are given a collection of jobs J , each of which is to be assigned to exactly one machine among the set M; if job j ∈ J is assigned to machine i ∈ M, then it requires pij units of processing time, and incurs a cost of cij . Suppose that there exists a fractional solution (that is, a job can be assigned fractionally to machines) with makespan P and total cost C. Then in polynomial time we can find a schedule with makespan P + max pij and total cost C. We use two main lower bounds for our analysis. As in [14] let βj be |{i|j ∈ Di }|, i.e., the number of different sets Di , to which a disk j belongs. We define β as maxj =1,...,N βj . In other words, β is an upper bound on the number of items a disk may need. Note that β is a lower bound on the optimal number of rounds, since the disk j that attains the maximum, needs at least β rounds to receive all the items i such that j ∈ Di , since it can receive at most one item in each round. Another lower bound that we will use in the analysis is α which is defined as follows: For an item i decide a primary source si ∈ Si so that α = maxj =1,...,N (|{i|j = si }| + βj ) is minimized. In other words, α is the maximum number of items for which a disk may be a primary source (si ) or destination. As one can see α is also a lower bound on the optimal number of rounds. We will describe how to compute α optimally in polynomial time using network flow algorithms. Moreover, we may assume that Di = ∅ and Di ∩ Si = ∅. This is because we can define the destination set Di as the set of disks that need item i and do not currently have it. Next, we present a high level description of our data migration algorithm. 2.2 Data Migration Algorithm: High Level Idea The high level description of the algorithm is as follows: Algorithm 1 (Data Migration Algorithm) 1. For each item i, find a disk (call it the primary source) si ∈ Si such that maxj =1,...,N (|{i|j = si }| + βj ) is minimized. Later we show how we can do this step in polynomial time.

352

Algorithmica (2012) 63:347–362

2. For each item i, we define two different subgroups Gi ⊆ Di and Ri (⊆ Di ) with the following properties: – Gi ’s are disjoint from each other. At first, we send item i to these disks. – Ri sets are not disjoint from each other but each disk belongs to only a bounded number of different Ri sets. In our algorithm, we send data items from Gi to Ri and then from Ri to the rest of the disks in Di . 3. Ri sets are selected as follows: (a) First partition Di into subgroups Di,k k = 0 . . . |Dqi | of size at most q (q is

a parameter that will be specified later). That is, we partition Di into |Dqi | subgroups of size q and possibly one subgroup of size less than q (if |Di | is not a multiple of q). (b) Now select Ri ⊆ Di and assign each Di,k to a disk in Ri such that for each disk in Ri the total size of subgroups (the total number of disks) assigned to the disk is at most β + q. (We will see later that it is always possible to select Ri with this property.) Let ri be the disk in Ri to which the small subgroup (a subgroup with size strictly less than q) is assigned. Note that if |Di | is a multiple of q, there is no disk ri . We define Ri to be Ri \ ri . For the analysis purposes we need this classification. 4. Compute Gi ⊆ Di such that |Gi | = |Dβi | and they are mutually disjoint.

5. For each item i for which Gi = ∅ but Ri = ∅, we select a disk gi . Let Gi = Gi if Gi is not empty and Gi = {gi } otherwise. Note that gi disk exists iff q < |Di | < β. 6. Send data item i from the primary source si to Gi . 7. Send item i from Gi to Ri \ Gi by setting up a transfer graph and using an edge coloring to schedule the transfer. Here, Ri is defined to be Ri \ ri where ri is the disk in Ri to which the small subgroup (a subgroup with size strictly less than q) is assigned. 8. Send item i from si to ri if ri has not received item i. 9. Finally set up a transfer graph from Ri to Di \ Ri . We find an edge coloring of the transfer graph and the number of colors used is an upper bound on the number of rounds required to ensure that each disk in Di gets item i. In Lemma 7 we derive an upper bound on the number of required colors. In the algorithm given in [14], the migration is performed in two stages. In the first stage, each item i is migrated to disjoint subsets of Di ’s (they called it Gi ) and in the second stage, the items are migrated from Gi ’s to the rest of the disks in Di . By choosing disjoint sets, broadcasting inside the subsets are faster and easier but also selecting disjoint sets limits the size of these subsets and as a result, the number of rounds required to complete the second stage will be increased. In our algorithm we add an extra intermediate stage. As in the previous method, the items are first migrated to disjoint sets Gi . In the intermediate stage, the data items are sent to a specific subset of disks in Di (we call them Ri ) which are not necessarily disjoint from each other, however the overlap is limited. And finally we migrate items from Ri to the rest of the disks in Di . A high level presentation for the algorithm can be find in Fig. 2. In the following sections, we will describe the details of the Algorithm 1.

Algorithmica (2012) 63:347–362

353

Fig. 2 An overall picture of the data migration algorithm

3 Details of Steps In this section, we discuss some of the steps of the algorithms in more detail. Description of Step 1: Selecting the Primary Source for Each Item This is exactly the same as Lemma 3.1 described in [14]. Lemma 1 [14] We can find a source si ∈ Si for each item i so that maxj =1,...,N (|{i|j = si }| + βj ) is minimized, using a flow network. Proof We create a flow network with a source s and a sink t as shown in Fig. 3. We have two sets of nodes corresponding to disks and items respectively. Add directed edges from s to nodes for items and also directed edges from item i to disk j if j ∈ Si . The capacity of all those edges is one. Finally we add an edge from the node corresponding to disk j , to t, with capacity α − βj . We want to find the minimum α so that the maximum flow of the network is . We can do this by checking if there is a flow of value with α starting from β and increasing by one until it is satisfied. If there is outgoing flow from item i to disk j , then we set j as si . Description of Step 3: Selecting RI for Each Item i

Let Dik (k = 1, . . . , |Dqi | ) be

kth subgroup of Di . The size of Dik is q for k = 1, . . . , |Dqi | and Dik , k = |Dqi | + 1 contains the remaining disks in Di of size |Di | − q · |Di |/q (and it could be possibly empty). To show how we assign Dij to Ri we use Theorem 3. In our problem, we can

354

Algorithmica (2012) 63:347–362

think of each subgroup Dik as a job and each disk as a machine. If disk j belongs to Di , then we can assign job Dik to disk j with zero cost. The processing time is the size of Dik , which is at most q. If disk j does not belongs to Di , then the cost to assign Dik to j is ∞ (disk j cannot be in Ri ). First we can show that: Lemma 2 There exists a fractional assignment such that the max load of each disk is at most β. Proof We can assign a |D1i | fraction of subgroup Dik to each disk j ∈ Di . It is easy to check that every subgroup Dik is completely assigned. The load on disk j is given by 1 |Dik | = |Dik | = 1≤β |Di | |Di | i:j ∈D k i:j ∈D k i:j ∈D i

i

i

Now we can show that: Lemma 3 There is a way to choose Ri sets for each i = 1, . . . , and assign subgroups Dik such that for each disk in Ri the total size of subgroups Dik assigned to the disk is at most β + q. Proof By Theorem 3, we can convert the fractional solution obtained in Lemma 2 to an integral solution such that each subgroup is completely assigned to one disk, and the maximum load on a disk is at most β + q. (Since as maximum size of Dik is q.) Considering this assignment, we can directly conclude that: Fact 1 For each disk j , at most β/q + 1 different large subgroups Dik (of size exactly q) can be assigned to the disk j . In other words, a disk can belong to at most β/q + 1 different R¯i sets. The reason is that the number of disks assigned to Ri is at most β + q and the size of each large subgroup is q. We will use this fact later. Description of Step 4: Select Gi ⊆ Di We can find disjoint sets Gi ⊆ Di using the same algorithm as in [14]. For completeness we include their method here: Lemma 4 There is a way to choose disjoint sets Gi for each i = 1, . . . , , such that |Gi | = |Dβi | and Gi ⊆ Di . Proof First note that the total size of the sets Gi is at most N . i=1

|Gi | ≤

|Di | i=1

β

=

1 |Di | β i=1

Note that i=1 |Di | is at most βN by definition of β. This proves the upper bound of N on the total size of all sets Gi . We now show how to find the sets Gi . As shown

Algorithmica (2012) 63:347–362

355

Fig. 3 Computing Gi sets

in Fig. 3, we create a flow network with a source s and sink t. In addition we have two sets of vertices U and W . The first set U has nodes, each corresponding to an item. The set W has N nodes, each corresponding to a disk in the system. We add directed edges from s to each node in U such that the edge (s, i) has capacity |Dβi | . We also add directed edges with unit capacity from node i ∈ U to j ∈ W if j ∈ Di . We add unit capacity edges from nodes in W to t. We find a max-flow from s to t in this network. The min-cut in this network is obtained by simply selecting the outgoing edges from s. To see this, note that we can find a fractional flow of this value as follows: saturate all the outgoing edges from s. From each node i there are |Di | edges to nodes in W . Suppose λi = |Dβi | . Send β1 units of flow along λi β outgoing edges from i. Note that since λi β ≤ |Di | this can be done. Observe that the total incoming flow to a vertex in W is at most 1 since there are at most β incoming edges, each carrying at most β1 units of flow. An integral max-flow in this network will correspond to |Gi | units of flow going from s to i and from i to a subset of vertices in Di before reaching t. The vertices to which i has nonzero flow will form the set Gi .

Description of Step 5: Selecting GI for Each Item i The above approach can help us find Gi sets. However if Gi = ∅ but Ri = ∅, we need to select another disk gi as well. Note that if |Gi | = 0 then |Di | < β, and therefore, |Ri | < βq . We define Gi to be Gi if Gi = ∅ and Gi = gi otherwise. Next, we show how to select gi as well. Lemma5 For each item i for which Gi = ∅ but Ri = ∅, we can find gi so that for a disk j , i:j =gi |Ri | ≤ 2 βq + 1. Or in other words, each gi is responsible for at most β/q disks but a disk can be gi for multiple items. So a disk may be responsible for 2β/q + 1 disks. Proof We again use Theorem 3. Reduce the problem to the following scheduling problem: Consider each disk as a machine. For each item such that |Gi | = 0, create a job of size |Ri |. The cost of assigning job i to machine j is 1 iff j ∈ Ri , otherwise it is infinite. Note that there is a fractional assignment such that the load on each machine is at most βq + 1. The way to show it is by assigning a 1 fraction of each job to |Ri |

each machine in its Ri set. The load due to this job on the machine is 1. Since a disk

356

is in at most

Algorithmica (2012) 63:347–362 β q

+ 1 different Ri sets (based on fact 1), the fractional load on each

machine is at most βq + 1. By applying the Shmoys-Tardos [20] scheduling algorithm (Theorem 3), we can find an assignment of jobs (items) to machines (disks) such that the total cost is at most the number of items and the load on each machine(disk) is at β most 2β q + 1. Note that the size of each job is at most q . gi will be the disk(machine) that item i is assigned to. Description of Step 6: Sending Items from Si to Gi First we show how to send data items from Si to Gi and also give the number of rounds these transfers take. We claim that this can be done in 2OPT + O( βq ) rounds. We develop a lower bound on the optimal solution by solving the following linear program L(m) for a given m. L(m) :

m

nij k xij k ≥ |Gi |

for all i

(1)

j k=1

xij k ≤ 1

for all j , k

(2)

i

0 ≤ xij k ≤ 1

(3)

where nij k = min(2m−k , |Gi |) if disk j belongs to Si and nij k = 0 otherwise. Intuitively, xij k indicates that at time k, disk j send item i to some disk in Gi . Let M be the minimum m such that L(m) has a feasible solution. Note that M is a lower bound for the optimal solution. (Otherwise, consider a feasible migration and set xij k based on the given schedule as defined above.) One can easily verify that the schedule gives a feasible solution for the linear program L(m). Also, we know that between all the feasible solutions, M is the smallest possible m that has a feasible solution. Now, we show that: Lemma 6 We can perform migrations from Si to Gi in 2 · M + O(β/q) rounds. ∗ Proof Given a fractional ∗∗ x to we can obtain an integral solution solution L(M), ∗∗ x such that for all i, j k xij k ≥ j k xij∗ k . (Using Lemma 3.4 from [14].) For each item i, we arbitrarily select min( j k xij∗∗k , |Gi |) disks from Gi . Let Hi denote this subset. We create the following transfer graph from Si to Hi : create an edge from a disk j ∈ Si to every disk in Hi if xij∗∗k = 1. (Make sure every disk in Hi has an incoming edge from a disk in Si .) Note the indegree of a disk in this transfer graph is 2 + βq since a disk can belong to Hi for at most 2 + β/q different i’s (a disk can be gi for at most β/q + 1 different items because in the worst case, a node is responsible for β/q sets of size 1 and one set of size β/q and also may belong to one Gi ). The outdegree is M (because of Constraints (2) and the fact that k ≤ M) and the multiplicity of the transfer graph is 2β/q + 4 (again because a disk can belong to Hi for at most β/q different i’s and each can be a source or destination). Therefore, we can perform the migration from Si to Hi in M + O(β/q) rounds by Theorem 1. For i with |Gi | = 1, the transfer is complete. For the rest of the items, since sets

Algorithmica (2012) 63:347–362

357

Gi (= Gi ) are disjoint, we can double the number of copies in each round until the number of copies becomes |Gi |. After M rounds, the number of copies we can make for item i is at least 2M |Hi | = 2M min xij∗∗k , |Gi | j

k

xij∗∗k , |Gi | ≥ min 2M−1 · 2 j

k

j

k

xij∗∗k + 1 , |Gi | ≥ min 2M−1 · M−1 ∗ xij k , |Gi | ≥ min 2 j

k

∗ nij k xij k , |Gi | ≥ |Gi |. ≥ min j

k

The second inequality comes from the fact that j k xij∗∗k ≥ 1. Therefore we can finish the whole transfer from Si to Gi in 2 · M + O(β/q) rounds by Theorem 1. Description of Step 7: Sending Item i from Gi to Ri We now focus on sending item i from the disks in Gi to disks in Ri . We construct a transfer graph to send items from Gi to Ri so that each disk in Ri \ Gi receives item i from one disk in Gi . We create the transfer graph as follows: First, add directed edges from disks in Gi to disks in Ri . Recall that |Gi | = |Dβi | and |Ri | = |Dqi | . Since Gi sets are disjoint, there is a transfer graph in which each disk in Gi has at most (β/q) outgoing edges. For items with Gi = ∅, we put edges from gi to all disks in Ri . The outdegree of each disk can be increased by at most 2 βq + 1. The indegree of a disk in Ri is at most βq + 1 and the multiplicity is

2β q

+ 2. Therefore, this step can be done in O(β/q) rounds.

Description of Step 8: Sending Item i from si to ri Again we create a transfer graph in which there is an edge from si to ri if ri has not received item i in the previous steps. The indegree of a disk j is at most βj since a disk j is selected as ri only if j ∈ Di and the outdegree of disk j is at most α − βj . Using Theorem 2, this step can be done in 3α 2 rounds. Description of Step 9: Sending Item i from Ri to Di \ (Ri ∪ Gi ) We now create a transfer graph from Ri to Di \ (Ri ∪ Gi ) such that there is an edge from disk a ∈ Ri to disk b if the subgroup that b belongs to is assigned to a in Lemma 3. We find an edge coloring of the transfer graph. The following lemma gives an upper bound on the number of rounds required to ensure that each disk in Di gets item i. Lemma 7 The number of colors we need to color the transfer graph is at most 3β + q.

358

Algorithmica (2012) 63:347–362

Proof First, we compute the maximum indegree and outdegree of each node. The outdegree of a node is at most β + q due to the way we choose Ri (see Lemma 3). The indegree of each node is at most β since in the transfer graph we send items only to the disks in their corresponding destination sets. The multiplicity of the graph is also at most β since we send item i from disk j to disk k (or vice versa) only if both disk j and k belong to Di . By Theorem 1, we see that the maximum number of colors needed is at most 3β + q. To wrap up, in the next theorem we show that the total number of rounds in this algorithm is bounded by 6.5 + o(1) times the optimal solution. Theorem 4 The total number of rounds required for the data migration is at most 6.5 + o(1) times OPT. Proof The total number of rounds we need is 2M + 3α/2 + 3β + O(β/q) + q. Since √ M, α, and β are the lower bounds on the optimal solution, choosing q = ( β) gives the desired result.

4 External Disks Until now we assumed that we had N disks, and the source and destination sets were chosen from this set of disks and only essential transfers are performed. In other words, if an item i is sent to disk j , then it must be that j ∈ Di (disk j was in the destination set for item i), hence the total number of transfers done is the least possible. In several situations, we may have access to idle disks with available storage that we can make use of as temporary devices to enable a faster completion of the transfers we are trying to schedule. In addition, we exploit the fact that by performing a small number of non-essential transfers (this was also used in [10, 13]), we can further reduce the total number of rounds required. We show that indeed such techniques can considerably reduce the total number of rounds required for performing the transfers from Si sets to Di sets. We assume that each external disk has enough space to pack γ items. If we are allowed to use γ external disks, the approximation ratio can be improved to 3 + max(1.5, γ2 ). For example, choosing γ = 3 gives a bound of 4.5. |Di | ¯ Define β¯ = i=1 N . We can see that 2β is a lower bound on the optimal number of rounds since in each round at most N2 data items can be transferred. The high level description of the algorithm is as follows: 1. 2. 3. 4.

Assign γ items to each external disk. Send items to their assigned external disks. ¯ For each item i, choose disjoint Gi sets of size Di /β. Send item i to all disks in the Gi set. Send item i from the Gi set to all the disks in Di . We will also make use of the copy of item i on the external disk. We now discuss these steps in detail.

Algorithmica (2012) 63:347–362

359

First step can be done in at most max(α, γ ) rounds by sending the items from their primary sources to the external disks. For this step the primary disks are chosen by the same method that we used to compute α as discussed in Sect. 3. Once α is computed, we make a bipartite graph with two set of vertices. The first set (call it primary set) contains the primary sources and the second set (call it external set) contains the external disks. We put edges between primary sources and the corresponding external disks for each item. Based on the way that we construct this graph, the outdegree of the disks in the primary set is at most α and the indegree of the vertices in external set is at most γ . Since the graph is bipartite, we can find an edge coloring (and a transfer schedule respectively) with max(α, γ ) in polynomial time. We can easily choose disjoint set Gi as we are allowed to perform non-essential transfers (i.e., a disk j can belong to Gi even if j is not in Di ). Hence we can use a simple greedy method to choose Gi . Broadcasting items inside Gi can be done in 2M rounds as described in Sect. 2. Next step is to send the item to all the remaining disks in the Di sets. We make a transfer graph as follows: assign to each disk in Gi at most β¯ disks in Di so that each disk in Di is assigned to at most one disk in the Gi set. The number of unassigned ¯ Assign all of the remaining disks from Di to the disks from each Di set is at most β. external disk containing that item. The outdegree of the internal disks is at most β¯ since each disk belongs to at most one Gi set. The indegree of each internal disk is at most β since a disk will receive an item only if it is in its demand set. The multiplicity between two internal disks is at most 2. (Since each disk can belong to at most one ¯ Each external disk Gi set.) So the total degree of each internal disk is at most β + β. ¯ So has at most γ items and the number of remaining disks for each item is at most β. γ ¯ the outdegree of each external disk is at most γ β ≤ 2 OPT. As a summary, the first step can be done in max(α, γ ) rounds. Step 3, can be com¯ γ β) ¯ ≤ 1 max(3, γ )OPT + pleted in 2M rounds and step 4 can be done in max(β + β, 2 OPT max(2, γ ) rounds. (β¯ ≤ 2 ). So the total number of steps to do the whole transfer is at most α + 2M + 3 + 12 max(3, γ )OPT + max(2, γ ) ≤ (3 + 12 max(3, γ ))OPT + 2γ + O(1). Theorem 5 There is a (3 + 12 max(3, γ ))OPT + 2γ + O(1) approximation algorithm for data migration when there exists γ external disks. 5 Full Duplex Model In this section we consider the full duplex communication model. In this model, we assume that each disk can send and receive at most one item in each round. In the half-duplex model, we assumed that at each round, a disk can either send or receive one item (but not both at the same time). In the full duplex model the communication pattern does not have to induce a matching since directed cycles are allowed (the direction indicates the data item transfer direction). We develop a 4 + o(1) approximation algorithm for this model. In this model, given a transfer graph G, we find an optimal migration schedule for G as follows: Construct a bipartite graph by putting one copy of each disk in each partition. We

360

Algorithmica (2012) 63:347–362

Fig. 4 Computing α

call the copy of vertex u in the first partition uA , and in the other partition uB . We add an edge from uA to vB in the bipartite graph if and only if there is a directed edge in the transfer graph from u to v. The bipartite graph can be colored optimally in polynomial time and the number of colors is equal to the maximum degree of the bipartite graph. Note that β and M are still lower bounds on the optimal solution in the full-duplex model. The algorithm is the same as in Sect. 2 except the procedure to select primary sources si . – For each item i, decide a primary source si so that α = maxj =1,...,N (max(|{j |j = si }|, βj )) is minimized. Note that α is also a lower bound for the optimal solution. We can find these primary sources as shown in Lemma 8 by adapting the method used in [14]. We show how to find the primary sources si . Lemma 8 By using network flow we can choose primary sources to minimize maxj =1,...,N (max(|{j |j = si }|, βj )).

Proof Create two vertices s and t. (See Fig. 4 for example.) Make two sets, one for the items and one for the disks. Add edges from s to each node corresponding to an item of unit capacity. Add a directed edge of infinite capacity between item j and disk i if i ∈ Sj . Add edges of capacity α from each node in the set of disks to t. Find the minimum α (initially α = β), so that we can find a feasible flow of value . For each item j , choose the disk as its primary source sj to which it sends one unit of flow. Theorem 6 There is a 4 + o(1) approximation algorithm for data migration in the full duplex model. Proof sending data items from Si to Gi (step 6) and from Gi to Ri (step 7) still takes 2M + O(β/q) rounds and O(β/q) rounds, respectively. For step 8, if we construct a bipartite graph, then the max degree is at most max(α , β), which is the number of rounds required for this step. For Step 9, the maximum degree of the bipartite

Algorithmica (2012) 63:347–362

361

graph is β + q. Therefore, the total number √ of rounds we need is 2M + max(α , β) + β + O(β/q) + q. By choosing q = ( β), we can obtain a 4 + o(1)-approximation algorithm.

6 Conclusion In this paper, we studied the data migration problem. In this problem, the objective is to find a migration plan among the storage devices. In this paper we developed an improved approximation algorithm that gives a bound of 6.5 + o(1). The improvements mainly came from using additional intermediate representative sets Ri ’s, which are not necessarily disjoint but overlapped only limited number of times. We also utilized existing copies of items more efficiently for tighter analysis. We also developed better algorithms using external disks and get an approximation factor of 4.5 using external disks. In addition, we considered the full duplex communication model and developed an improved bound of 4 + o(1) for this model, with no external disks.

References 1. Anderson, E., Hall, J., Hartline, J., Hobbes, M., Karlin, A., Saia, J., Swaminathan, R., Wilkes, J.: An experimental study of data migration algorithms. In: Workshop on Algorithm Engineering, London, UK, 2001, pp. 145–158. Springer, Berlin (2001) 2. Aggarwal, G., Motwani, R., Zhu, A.: The load rebalancing problem. In: Symposium on Parallel Algorithms and Architectures, pp. 258–265 (2003) 3. Baev, I.D., Rajaraman, R.: Approximation algorithms for data placement in arbitrary networks. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms, pp. 661–670 (2001) 4. Bondy, J.A., Murty, U.S.R.: Graph Theory with Applications. American Elsevier, New York (1977) 5. Gandhi, R., Mestre, J.: Combinatorial algorithms for data migration to minimize average completion time. Algorithmica 54(1), 54–71 (2009) 6. Golubchik, L., Khanna, S., Khuller, S., Thurimella, R., Zhu, A.: Approximation algorithms for data placement on parallel disks. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms, Washington, D.C., USA, 2000, pp. 661–670. Society of Industrial and Applied Mathematics, Philadelphia (2000) 7. Golubchik, L., Khuller, S., Kim, Y., Shargorodskaya, S., Wan, Y.: Data migration on parallel disks: algorithms and evaluation. Algorithmica 45(1), 137–158 (2006) 8. Graham, R.L., Grahamt, R.L.: Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 416–429 (1969) 9. Guha, S., Munagala, K.: Improved algorithms for the data placement problem, 2002. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms, San Fransisco, CA, USA, 2002, pp. 106–107. Society of Industrial and Applied Mathematics, Philadelphia (2002) 10. Hall, J., Hartline, J., Karlin, A., Saia, J., Wilkes, J.: On algorithms for efficient data migration. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms, pp. 620–629 (2001) 11. Kashyap, S., Khuller, S.: Algorithms for non-uniform size data placement on parallel disks. J. Algorithms 60(2), 144–167 (2006) 12. Kashyap, S., Khuller, S., Wan, Y.C., Golubchik, L.: Fast reconfiguration of data placement in parallel disks. In: 2006 ALENEX Conference, Jan. 2006 13. Khuller, S., Kim, Y., Wan, Y.C.: On generalized gossiping and broadcasting. In: European Symposia on Algorithms, Budapest, Hungary, 2003, pp. 373–384. Springer, Berlin (2003) 14. Khuller, S., Kim, Y.A., Wan, Y.C.: Algorithms for data migration with cloning. SIAM J. Comput. 33(2), 448–461 (2004)

362

Algorithmica (2012) 63:347–362

15. Kim, Y.: Data migration to minimize the average completion time. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms, pp. 97–98 (2003) 16. Meyerson, A., Munagala, K., Plotkin, S.A.: Web caching using access statistics. In: Symposium on Discrete Algorithms, pp. 354–363 (2001) 17. Shachnai, H., Tamir, T.: Polynomial time approximation schemes for class-constrained packing problems. In: Workshop on Approximation Algorithms. LNCS, vol. 1913, pp. 238–249 (2000) 18. Shachnai, H., Tamir, T.: On two class-constrained versions of the multiple knapsack problem. Algorithmica 29, 442–467 (2001) 19. Shannon, C.E.: A theorem on colouring lines of a network. J. Math. Phys. 28, 148–151 (1949) 20. Shmoys, D.B., Tardos, E.: An approximation algorithm for the generalized assignment problem. Math. Program., Ser. A 62, 461–474 (1993) 21. Vizing, V.G.: On an estimate of the chromatic class of a p-graph. Diskretn. Anal. 3, 25–30 (1964) (Russian)