Scheduling divisible loads on partially reconfigurable hardware K. N. Vikram and V. Vasudevan Department of Electrical Engineering, Indian Institute of Technology Madras, Chennai, India - 600036 [email protected], [email protected] Abstract For a task mapped to the reconfigurable fabric (RF) of a partially reconfigurable hybrid processor architecture, significant speedup can be obtained if multiple processing units (PUs) are used to accelerate the task. In this paper, we present the results obtained from a quantitative analysis for a single data-parallel task mapped to the RF of a busbased hybrid processor architecture. The architectural constraints in this case include run-time reconfiguration delay and a shared data bus to main memory.

Table 1. Notation used in the analysis of our system, based on DLT. Symbol n αi w z Tcp Tcm

1. Introduction Reconfigurable hybrid processor architectures, consisting of general purpose processor (GPP) coupled to a reconfigurable fabric (RF), allow flexible implementation of any application. In order to obtain the maximum possible speedup, spatial parallelism and partial run-time reconfiguration (PRTR) are used. Typical applications mapped to reconfigurable hybrid processors include signal/image processing, multimedia and computer vision. Many of the tasks in these applications are data-parallel, i.e., their input data can be partitioned and processed independently by multiple, identical processing units (PUs) configured in the RF. This fact is used by task schedulers to increase speedup of various applications. Use of partial run-time reconfiguration allows configuration of a PU to overlap with computation on other PUs. This can be used to minimize the overhead due to reconfiguration. Since PUs used for data-parallel tasks operate independently, each PU can start functioning as soon as the RF area allocated to it is configured. In a bus-based hybrid processor architecture, all PUs use a shared data bus for accessing main memory. Reconfiguration delay and limited data bandwidth are the two architectural constraints present in such a system. In order to achieve minimum processing time under these constraints, a systematic technique is required for scheduling and allocating load to the PUs. We have developed a framework for

Tp Tr σ β

Description Number of PUs used Fraction of total load assigned to PU pi . Ratio of computation time of a PU for a given load, to the computation time of a standard PU for the same load. Ratio of time taken to transmit a given load on the bus, to the time taken to transmit the same load on a standard bus. Time taken to process entire load by the standard PU. Time taken to transmit entire load on a standard bus. Total processing time, including result collection. Time taken to configure / reconfigure a single PU. wTcp /(zTcm ) (σ + 2)/(σ + 1)

this analysis based on divisible load theory (DLT) [2] in our earlier work [3]. The specific case considered in that work was a situation where the results of the computation could be retained within the local memory of the RF itself. The analysis in [3] is not applicable if the RF needs to be reconfigured to perform another task, in which case the computed results will have to be sent back to main memory. We have addressed this problem here. The problem of scheduling with consideration of result collection has received rigorous treatment in [1], for processors in an arbitrary tree network. The bus network considered in this paper is a specific case of an arbitrary tree network. We therefore use the results in [1] as the foundation for performing the required analysis. PRTR introduces an additional dimension to the problem and gives some interesting results.

2. Analysis and Results The notation that we use for our computation and communication model is given in Table 1, based on DLT. Using the notation given in Table 1, the time taken to transfer load fraction αi to PU pi is αi zTcm , whereas the time taken by pi for processing it is αi wTcp . If the result data size is the same as input data size, the time taken to transfer the result

14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'06) 0-7695-2661-6/06 $20.00 © 2006

1

2

i

1

n

Tr

2

p1

t

n

i

i−1

Normalized processing time, φ

Bus

40

zTcm

i−1

Tp

α1wTcp

Tr p2

α2wTcp Tr

pi−1

Tr

pi

αiwTcp αi−1wTcp

Tr

αnwTcp

pn nTr

Partial reconfiguration in process

i

PU unconfigured

Load transfer PU executing

i

Result transfer

35

β = 1.06

ρ = 18.0

30 25

Minimum φ for each ρ

ρ = 7.0

20

ρ = 3.4

15

ρ = 2.0 ρ = 1.0

10

ρ = 0.5

5 0 0

2

4

6

8

10

12

ρ=0 14

16

18

Number of PUs utilized, n

i ≡ αizTcm

Figure 1. Timing diagram for Tr > zTcm /n

Figure 2. Results for 1-D DWT

from pi back to memory is also αi zTcm . Reconfiguration of the PUs is assumed to occur continuously using a separate configuration bus, from p1 to pn , where n is the number of PUs used. Minimum processing time can be obtained if the following optimality criteria are satisfied: (1) Each PU should never be idle between its load transfer and result transfer phases, (2) The data bus and any PU should never simultaneously be idle and (3) Result transfer sequence is same as the load transfer sequence. We have rigorous proofs for each of them. A consequence of the optimality criteria is that during the result transfer phase there should be no gaps on the data bus. Performing a quantitative analysis on our system while enforcing the above criteria gives us the following results. If n PUs are used for task acceleration and the reconfiguration time Tr ≤ zTcm /n, then the load fractions allocated to PUs are equal, and no gaps occur during load transfer. The total processing time is then Tp = Tr +zTcm +(zTcm +wTcp )/n. When Tr > zTcm /n, there are gaps in load transfer, as shown in Fig. 2. Then the load fractions and optimum processing time are given by

We have shown that for a limited range of Tr , there exists a heuristic load allocation strategy that results in an optimal processing time. For other reconfiguration times also, we have derived closed form expressions for the load fractions and processing time. We have verified the developed theory for computation of 1-D DWT. Fig. 2 shows variation of φ = Tp /(zTcm ) with the number of PUs, for different values of ρ = Tr /(zTcm ). The parameter β is based on the computation speed of a PU. The figure shows that an optimum number of PUs exists for a given ρ, beyond which more PUs do not contribute to speedup. It also gives the minimum possible processing time.

αi Tp

3. Conclusions We have presented a theoretical framework for scheduling load for a data-parallel task mapped to the RF of a hybrid processor. The theory gives the maximum speedup that can be obtained, and is also a good approximation when the application load is not arbitrarily divisible. The theory is also useful for deriving the design considerations for optimal usage of the shared data bus.

Tr (β i−1 − 1), i = 1, . . . ,n = β i−1 α1 − zTcm     n 1 1 References − n + zTcm 1 + n = Tr 1 + β−1 β −1 β −1

where α1 =

Tr − zTcm



β−1 βn − 1



nTr −1 zTcm



(1)

However, if wTcp is small, some PUs might finish computation even before the end of the load transfer phase. In such cases, an optimal schedule does not exist and we have developed heuristic strategies with the basic idea being that result transfer can occur in the gaps during load transfer.

[1] G. D. Barlas. Collection-aware optimum sequencing of operations and closed-form solutions for the distribution of a divisible load on arbitrary processor trees. IEEE Trans. Parallel Distrib. Syst., 9(5):429–441, May 1998. [2] V. Bharadwaj, D. Ghose, V. Mani, and T. G. Robertazzi. Scheduling divisible loads in parallel and distributed systems. IEEE CS Press, Sept. 1996. [3] K. N. Vikram and V. Vasudevan. Mapping data-parallel tasks onto partially reconfigurable hybrid processor architectures. Accepted for publication in IEEE Trans. VLSI Syst.

14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'06) 0-7695-2661-6/06 $20.00 © 2006

Scheduling divisible loads on partially reconfigurable ...

For a task mapped to the reconfigurable fabric (RF) of a partially reconfigurable hybrid processor architecture, significant speedup can be obtained if multiple processing units (PUs) are used to accelerate the task. In this paper, we present the results obtained from a quantitative analysis for a single data-parallel task ...

156KB Sizes 0 Downloads 169 Views

Recommend Documents

A Dynamic Scheduling Algorithm for Divisible Loads in ...
UMR exhibits the best performance among its family of algorithms. The MRRS .... by local applications (e.g. desktop applications) at the worker. The arrival of the local ..... u = (u1, u2, ... un) : the best solution so far, ui. {0,1} в : the value

A Scheduling Method for Divisible Workload Problem in ...
previously introduced are based on the master-worker model. ... cess runs in a particular computer. ..... CS2002-0721, Dept. of Computer Science and Engi-.

Scheduling partially ordered jobs faster than 2n - Research at Google
In the SCHED problem we are given a set of n jobs, together with their processing times and prece- ... with respect to the best exponential time exact solutions.

A Scheduling Method for Divisible Workload Problem in Grid ...
ing algorithms. Section 3 briefly describes our hetero- geneous computation platform. Section 4 introduces our dynamic scheduling methodology. Section 5 con-.

p-divisible groups Introduction
homomorphisms of group schemes fv : Gv → Hv which are compatible with the structure maps: ivfv = fv+1iv. Let iv,m : Gv → Gm+v denote the closed immersion iv+m−1 ◦ ... ◦ iv+1 ◦ iv. A diagram chase shows that Gm+v pm. −→ Gm+v can be fac

ON SOME PARTIALLY DE RHAM GALOIS ...
Let L be a finite Galois extension of Qp of degree d, E a finite extension of Qp .... of [30, Thm.1] (note that the convention of weights in loc. cit. is slightly different.

Reconfigurable computing iee05tjt.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Reconfigurable ...

final paper on Reconfigurable Radio Using Software ...
radio telephone to mobile communications and beyond. ... than point-to-point communication, and has the variety of evolutions from early analog. [Citizen's ... service and network support has to be for creating a connection between the remote.

Scheduling your Hangout On Air - Services
Click the Q&A app on the left sidebar in the Hangout On Air. 4. After a moment, you'll see the app appear in the right sidebar with questions that have been submitted from the audience. 5. Click on a question and then answer it live. Later on, viewer

Sea loads on ships and offshore structures.pdf
Whoops! There was a problem loading more pages. Retrying... Sea loads on ships and offshore structures.pdf. Sea loads on ships and offshore structures.pdf.

Scheduling your Hangout On Air Services
To get started, click Hangouts in the left-side navigation menu. 1. Click on Schedule a Hangout On Air. 2. Give it a name and description. 3. Choose a starting time: • Choose Now only if your Hangout On Air starts right now. • Choose Later if it

Scheduling your Hangout On Air - PDFKUL.COM
Click the Q&A app on the left sidebar in the Hangout On Air. 4. ... On the day of your Hangout On Air, you're now ready to invite your participants and start the ...

On some methods of construction of partially balanced ...
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions ... N and the elements ag, of the matrix are symbols 0, 1, 2, - - - , s – 1. Consider the s" 1 X t ...

Heuristic Scheduling Based on Policy Learning - CiteSeerX
machine centres, loading/unloading station and work-in-process storage racks. Five types of parts were processed in the FMS, and each part type could be processed by several flexible routing sequences. Inter arrival times of all parts was assumed to

Heuristic Scheduling Based on Policy Learning - CiteSeerX
production systems is done by allocating priorities to jobs waiting at various machines through these dispatching heuristics. 2.1 Heuristic Rules. These are Simple priority rules based on information available related to jobs. In the context of produ

On the configuration-LP for scheduling on unrelated ...
May 11, 2012 - Springer Science+Business Media New York 2013. Abstract Closing the approximability gap .... inequalities that prohibit two large jobs to be simultane- ously assigned to the same machine. .... Table 1 The integrality gap of the configu

Calculation external loads on buried pipe.pdf
Boussinesq Formula (in psi) (3 x If x W w x H3 ) / (2 x pi x r5 * 144)*2 Double sets of dual wheels were considered. If Impact factor ( Off road vehicles: 2 to 3) 1. W w Wheel load (lb.) 25,179 lb From Vehicle PressureEst. Spreadsheet. H Vertical dep

Studying Nonlinear Dynamical Systems on a Reconfigurable ... - Sites
So, the analog de- signer must depart from the traditional linear design paradigm, ..... [4] B.P. Lathi, Modern Digital and Analog Communication Systems, Oxford.

Study on Cloud Computing Resource Scheduling Strategy Based on ...
proposes a new business calculation mode- cloud computing ... Cloud Computing is hotspot for business ... thought is scattered through the high-speed network.

Reconfigurable Models for Scene Recognition - Brown CS
Note however that a region in the middle of the image could contain water or sand. Similarly a region at the top of the image could contain a cloud, the sun or blue ..... Last column of Table 1 shows the final value of LSSVM objective under each init