The Back-Propagation Learning Algorithm on the Meiko CS-2: Two Mapping Schemes Antonio d' Acierno1 and Salvatore Palma1 2 1 IRSIP - CNR, Via P. Castellino 111, I - 80131, Napoli 2 IIASS, Via Pellegrino 19, I - 84019, Vietri S/M (SA)

emails: [email protected]

Abstract. This paper deals with the parallel implementation of the

back-propagation of errors learning algorithm. We propose two mapping schemes that allow to obtain two ecient parallel algorithms implemented on the Meiko CS-2 MIMD parallel computer. The parallel algorithms, obtained from the sequential code by means of simple and well localised modications, are based on the use of a global operator whose straightforward hardware implementation could improve both performance and scalability of the proposed solutions.

1 Introduction The behaviour of an Articial Neural Network (ANN) is determined by parameters denoted weights and a learning procedure is used to compute these parameters. Such a learning procedure tend to be very time-consuming and it is therefore obvious to try to develop faster learning algorithms and/or to capitalise on the intrinsic parallelism of these systems in order to speed up the computation, since massively parallel computers today available have the potential to provide both computational power and exibility for high-speed simulations. However, though learning algorithms typically involve only local computations, the output of an unit usually depends on the output of many other units. Thus, without careful design, an implementation of such algorithms on a massively parallel (distributed memory) computer can easily spend the majority of its running time, say 75% or more, in communication rather than in actual computation. Hence, the mapping problem is not trivial and deserves attention, since the communication problem is obviously crucial. The major motivation for this work was exactly to try to solve this problem, with reference to a well known learning algorithm. In this paper it is proposed a fundamentally new approach to the parallel implementation of the Back-Propagation of errors learning Algorithm (BPA) 1] 2], employed for the training phase of feed-forward ANNs, which are used as patterns classiers, features detectors and for images compression in these networks the topology is such that each neuron in a layer receives inputs from every node in the previous layer and it sends the output only to neurons of the next layer. Hereafter we consider, for the sake of simplicity, networks with just 1 hidden layer let Ni be the number of input neurons, Nh the number of hidden neurons, No the number of output neurons, W1 Nh] Ni] the connection matrix between

input and output layers and W2 No] Nh] the connection matrix between hidden and output layers. In the rst phase of the BPA (forward phase) an input vector to the network is provided and values propagate forward through the network to compute the output vector. The output vector O is then compared with a target vector T (which is provided by a teacher) resulting in an error vector E = T ; O. In the second phase (backward phase) the error vector values are propagated back through the network by dening, for each node, the error signals that depend on the derivative of the activation functions. Specically, the error signals for neurons belonging to layer i are determined from a weighted sum of error signals of layer i + 1 again using the connection weights (now backward) the weighted sum is then multiplied by the derivative of the activation function. For the i ; th output neuron the error term is assumed to be the product of E(i) and the derivative of the activation function in O(i). Finally, the weight changes are evaluated the weight change for the connection from neuron i to neuron j is assumed to be proportional to the product of the error term of neuron j and the output of neuron i. In the "on-line" version of the algorithm the weight changes are applied as they are evaluated while, in the "batch" or "by-epoch" version of the BPA, the weight changes are accumulated to compute a total weight change after all (or some) training patterns have been presented. In this paper we concentrate on the neural network partitioning problem, not considering at all the training example parallelism moreover, we do not consider excess parallelism, so we will use processor as a synonym of process and vice versa.

2 Parallel BPA A rst glance at the standard equations used to describe the on-line BPA reveals that there are at least two degrees of parallelism in such an algorithm.First, there is the parallel processing performed by the many nodes of each layer (neuron parallelism 3]) this form of parallelism corresponds to viewing the calculations as matrix-vector products and mapping each row of the matrix onto a processor. Since the incoming activation values must be multiplied by a weight, there is parallelism also at neuron level (synapse parallelism 3]) this form of parallelism corresponds to viewing the calculations again as matrix-vector products and mapping each column of the matrix onto a processor. Despite of this, an eective parallel implementation of the BPA that splits the connection matrices among processors is not a very simple task since such a parallel implementation must perform eciently the product of a matrix (W2) by a vector as well as the product of the transpose of W2 by a vector. At present, the exploitation of neuron parallelism (vertical slicing), when possible combined with the training examples parallelism, represents the most used method to obtain fast simulations of the learning phase of feed-forward networks. The approach here proposed is an attempt to improve the performance through the use of a mixture of neuron parallelism and synapse parallelism.

With reference to the on-line BPA, let us suppose that (i) the memory of each processor is unlimited and that (ii) the number Nh of hidden neurons equals to the number P of processors to be used. Starting from these hypotheses, we can suppose that each process knows all the training patterns and handles an hidden neuron. As a rst step, each process evaluates the activation value of the handled hidden neuron supposing that the i ; th process knows the i ; th row of W 1, there is not communication and the load is perfectly balanced. To evaluate the output value of output neurons, each process in our implementation (say the process s) calculates the vector J(s), representing the product of the output of hidden neuron s and the s ; th column of W2 suppose that such a column is handled by process s. Once the vectors J have been evaluated (and this evaluation is performed in a fully parallel and perfectly balanced way), they are summed through a read, sum and send algorithm the processes are thus logically organised as a tree rooted by a master process. The master completes the evaluation of the net inputs to output neurons (by subtracting biases, if any), applies the activation function and evaluates the error terms of output neurons. Such error terms are then broadcast to slave processes. Once the error terms of output neurons have been received, the error of the handled hidden neuron must be evaluated. This step only requires that the s ; th process knows the s ; th column of W2 to be executed since this hypothesis is veried, there are not data to be communicated and the load is perfectly balanced. Finally, each weight must be updated since each process knows all it needs to evaluate the weight changes for the connections it knows (and handles), we have again that there is not communication and the load is perfectly balanced. To summarise, in the implementation here proposed it is used a mixture of synapse parallelism and neuron parallelism so that an all-to-one broadcasting and an one-to-all broadcasting are required for each training pattern. More precisely, it is used: { { {

neuron parallelism to evaluate the activation values of hidden neurons synapse parallelism to evaluate the activation value of output neurons neuron parallelism to evaluate the error terms of hidden neurons.

Compared with the vertical slicing, the method here proposed presents, in our opinion, some advantages. First, by using the vertical slicing, the maximum number Pvsmax of processors that can be used equals the dimension of the smaller layer. Clearly, being Nh  min(Ni Nh No), we have that the proposed method potentially allows the ecient use of an higher number of processors. Second, even if the number P of available processors is less than Pvsmax , the vertical slicing requires, to assure a perfect load balancing, that P exactly divides Ni, Nh and No. More precisely, the dimension of each layer must be modied (by adding dummy neurons) to verify such hypothesis this of course introduces an overload. By using our scheme, instead, we have rst of all that the condition to be veried is that P exactly divides just Nh. Besides, such a condition is strictly required only if synchronous machines are taken into account if asynchronous machines are used the topology of the network does not need modications

so that no overload is introduced (clearly the best eciency is obtained when P exactly divides Nh). Last, but not least, it is worth emphasising that the proposed method requires just an all-to-one associative broadcasting and an one-to-all broadcasting while the vertical slicing requires that each process has to communicate with each other at least twice (even supposing that each process knows all the training patterns). Although the use of, say, systolic methodologies allows to perform data exchanges in a very ecient way, it is a matter of fact that (i) systolic solutions do not allow the use of powerful communication facilities that could improve both the performance and the facility of porting sequential code into the parallel one and (ii) topologies dierent from the ring cannot be used (or, at least, cannot be used easily). Our mapping scheme, instead, can be mapped on each topology and, because of the required communication, is able to fully exploit ecient communication facilities such as, for example, the scan functions in SIMD parallel machines (e.g. Maspar MP-1 and MP-2). Moreover, if some patterns can be processed in batch mode, it is possible to group vectors to be transmitted so that the overhead due to start-up operations can be diminished.

3 Implementation and Results The proposed mapping scheme has been implemented on the Meiko CS-2, an MIMD parallel computer based on Superspark nodes connected via an indirect logarithmic communication network. The parallel implementation is realised using, as parallel statement, only the gdsum function, that realises a combination of vectors. This function works as follows: suppose that each of P processes in a parallel application holds a vector j the vector in process j after the gdsum(X,n) call, X of dimension n and let Xold P j (i). the i-th component of vector Xnew in each process equals Pj=1 Xold Because of the structure of the gdsum, we have that the work assigned to the master process has to be replicated and is executed by each process this clearly requires that each process has to know the input part as well as the output part of the whole training set. It is worth noting that, starting from a sequential code able to deal with a number of hidden neurons xed at run-time, what we need to obtain the parallel code is to add a gdsum call before evaluating the error terms of output neurons. A collection of P processes, each simulating a network with Ni input neurons, Nh=P hidden neurons and No output neurons, will then simulate the neural network we are dealing with. The generalisation to the situation where P does not divide Nh exactly is straightforward clearly, care has to be paid to parallel I/O subroutines. The simple structure of the parallel software allows to model the performance as follows: Seq (1) T Par = T P + Y  T gdsum (X P) where T gdsum (n P) is the time to perform a combination of a vector of length n on P processors on the Meiko CS-2 the gdsum is realised using a logarith-

mic numer of steps, so that T gdsum (n P) = O(n  log P). In Eq. 1, clearly, we neglected the replication of the work assigned to the master. In the mapping we are discussing about we have n = No and Y = 1 (i. e. just a gdsum on a vector of dimension No is performed). Figure 1 shows the performance obtained, as well as the dierence between measured and theoretical times the latter have been obtained experimentally, i.e. simply by measuring T gdsum for each No and each P . 30 25 20

Time (msecs) 15 10 5 0

=2 =4 =8

P P P

?

r c

Eq. 1

c

cc c

?? ?

?

rr r

r

c

c

?

c

? r

?

r

r

64 128 192 Number of Output Neurons

256

Fig. 1. First mapping scheme. The network has 256 input neurons and 128 hidden neurons computing times have been measured on 1 iteration on 1 example. Continuous lines represent the time evaluated with the theoretical model.

Table 1 shows the actual eciency obtained: we used No and P as parameters, since the eciency trivially increases both with Ni and Nh (see Eq. 1). What we have, apart from experimental errors due to random overhead related to operating systems, is that the eciency decreases as both P and No increase the result obtained for the worst case in our experiments (i.e. No = 256 and P = 8), on the other hand, is considerably high (> 0:8), so making the approach very interesting. Since both T Seq and T gdsum are linearly increasing functions of No, the eciency is, from a theoretical point of view, asynptotically independent of No.

4 Generalisation The main drawbacks on the proposed mapping scheme are represented by the following facts: (i) it requires that each process knows all the training set and (ii) it cannot exploit all the parallelism for networks with input and output layer

Table 1. First mapping scheme. The eciency obtained when the network has 256 input neurons and 128 hidden neurons, having as parameters N o and P . P

Output Neurons 4 8 16 32 64 128 256 2 0.99 1.02 1.01 1.01 1.00 1.00 1.02 4 0.99 1.00 0.99 1.00 0.99 0.98 0.92 6 0.96 0.99 0.93 0.93 0.95 0.91 0.87 8 0.96 0.96 0.95 0.92 0.92 0.90 0.85

signicantly larger than the hidden layer (e.g. feed-forward ANNs designed for solving compression problems). Both problems can be solved by inverting the scheme and using: { { {

synapse parallelism to evaluate the activation values of hidden neurons neuron parallelism to evaluate the activation value of output neurons synapse parallelism to evaluate the error terms of hidden neurons.

In this case, each process has to simulate an ANN with Ni=P input neurons, Nh hidden neurons and No=P output neurons of course, we have P max = min(Ni No). Using such a second scheme, the porting of the sequential code on the parallel platform is obtained by adding a gdsum statement in the forward phase (before evaluating the output of hidden neurons) and a gdsum statement in the backward phase, before completing the evaluation of error terms of hidden neurons. Figure 2 shows the performance obtained, as well as the dierence between measured and expected times, evaluated using Eq. 1, with n = Nh and Y = 2. In this case, the theoretical model just represents an upper bound for the actual performance, since the training set is partitioned and then the number of cache con icts decreases. In other words, the term T Seq =P should be multiplied by a factor  < 1. Table 2 shows the actual eciency obtained it is worth noting the super eciency (due to the addressed problems related to the cache handling) obtained in many cases. To conclude, it is worth emphasising that both schemes can be easily generalised to networks with two or more hidden layers. For example, in the case of two hidden layers, the rst mapping scheme can be extended using: neuron parallelism to evaluate the activation values of rst layer hidden neurons { synapse parallelism to evaluate the activation value of second layer hidden neurons { neuron parallelism to evaluate the activation values of output neurons {

140 120 P = 2 P =4 100 P = 8 Time 80 (msecs) 60 40 20 c c c?c ?r ?r rr ? 0

c

c

?

r

Eq. 1

c c

?

?r

r

64 128 192 Number of Hidden Neurons

? r

256

Fig. 2. Second mapping scheme: the network has 512 input neurons and 512 output neurons computing times have been measured on 1 iteration on 1 example. Continuous lines represent the time evaluated with the theoretical model.

Table 2. Second mapping scheme. The eciency obtained when the network has 512 input neurons and 512 hidden neurons, having as parameters N o and P . P

Hidden Neurons 4 8 16 32 64 128 256 2 1.01 1.00 1.12 1.04 1.13 1.14 1.12 4 1.02 1.03 1.07 1.03 1.11 1.15 1.15 6 0.94 1.01 1.04 0.98 1.10 1.14 1.16 8 0.84 0.94 1.01 0.95 1.06 1.14 1.10

{ {

synapse parallelismto evaluate the error term of second layer hidden neurons neuron parallelism to evaluate the error term of rst layer hidden neurons.

In this case we have P max = min(Nh1 No). The second mapping scheme is obtained in a straightforward way, starting with synapse parallelism on the rst hidden layer in this case we have P max = min(Ni Nh2).

5 Conclusions In this paper we proposed two mapping scheme for the parallel implementation of the back-propagation learning algorithm both schemes are based on the use

of a global operator. The parallel algorithms we realised and tested are obtained from the sequential codes with simple and well localised modications. We obtained interesting results in terms of eciency. For the rst mapping, in fact, we have that the worst case eciency in our experiments, obtained with a compression network (i. e. Ni = No) and using the maximum number of processors available, is 0.85. In the second mapping case, even if the overhead due to vector combination roughly doubles (i.e. Y = 2 in Eq. 1), we have that even better results can be obtained, thanks both to the goodness of the approach and to the fact that, by splitting the training set, the number of cache con icts decreases. The complexity of the gdsum on the Meiko CS-2 is O(N  logP ) such a complexity can become O(N+log P ) simply by using a tree of adders with pipelining. Moreover, the performance and the scalability can be increased by overlapping the combination of the i ; th component of vectors with the evaluation of the (i + 1) ; th partial components. What we need is to have no set-up times at all this of course can be obtained by using a special-purpose communication and computation structure, whose denition is the concern of our work in progress. A further improvement we are working on is related to the fact that each node in our machine is composed by 2 processors sharing the memory. Our algorithms are not able to eciently use both processors since the overhead of the gdsum function heavily increases because of con icts on the communication processors. To overcome this problem, we are developing a new version of such a function exploiting the shared memory mechanisms. Last, we can split the whole training set only if the number of hidden layers is odd if this is not the case, we have the (unpleasant) situation that each process has to know each input pattern or each output pattern. The problem can be clearly solved by adding an hidden layer since, as a rule of thumb, what is performed by a network with, say, 2 hidden layers can be perfomed by a network with 3 hidden layers this solution, on the other hand, does not solve the problem correctly, since the neural designer is forced to adapt the solution to the parallel algorithm. What we are dealing with is the possibility of adding an hydden layer performing a liner trasformation, so that the outcoming network is really equivalent to the starting one the introduced overhead is under investigation.

References 1. Rumelhart, D.E., Hinton, G.E., and Williams, R.J.: Learning Representation by Back-Propagation of Errors. Nature, 323 (1986), 533-536. 2. Rumelhart, D.E., and McClelland, J.L.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, MA, 1986. 3. Nordstrom, T., and Svensson, B.: Using and Designing Massively Parallel Computers for Articial Neural Networks. Journal of Parallel and Distributed Computing, 14 (1992), 260-285. This article was processed using the LATEX macro package with LLNCS style

The Back-Propagation Learning Algorithm on the Meiko ... - CiteSeerX

the Meiko CS-2: Two Mapping Schemes. Antonio d' Acierno1 and .... is veri ed, there are not data to be communicated and the load is perfectly balanced. Finally ...

146KB Sizes 1 Downloads 222 Views

Recommend Documents

The Back-Propagation Learning Algorithm on the Meiko ...
the Meiko CS-2: Two Mapping Schemes. Antonio d' Acierno1 and .... is veri ed, there are not data to be communicated and the load is perfectly balanced. Finally ...

Exponentiated backpropagation algorithm for multilayer ...
3Department of Mathematics and Computer Applications. 4Department of Information Technology. Sri Venkateswara College of Engineering, Sriperumbudur ...

On the Impact of Kernel Approximation on Learning ... - CiteSeerX
The size of modern day learning problems found in com- puter vision, natural ... tion 2 introduces the problem of kernel stability and gives a kernel stability ...

A learning and control approach based on the human ... - CiteSeerX
Computer Science Department. Brigham Young ... There is also reasonable support for the hypothesis that ..... Neuroscience, 49, 365-374. [13] James, W. (1890) ...

A machine learning perspective on the development of ... - CiteSeerX
May 26, 2005 - after calibrating the SELDI-TOF MS machines with the standard ...... of the way that the healthcare team does, or does not, incorporate the ...

A learning and control approach based on the human ... - CiteSeerX
MS 1010, PO Box 5800 ... learning algorithm that employs discrete-time sensory and motor control ... Index Terms— adaptive control, machine learning, discrete-.

Heuristic Scheduling Based on Policy Learning - CiteSeerX
production systems is done by allocating priorities to jobs waiting at various machines through these dispatching heuristics. 2.1 Heuristic Rules. These are Simple priority rules based on information available related to jobs. In the context of produ

A Universal Online Caching Algorithm Based on Pattern ... - CiteSeerX
... Computer Science. Purdue University .... Ziv-Lempel based prefetching algorithm approaches the fault rate of the best prefetcher (which has ... In the theoretical computer science literature, however, the online caching problem has received.

A Universal Online Caching Algorithm Based on Pattern ... - CiteSeerX
errors in learning will affect the performance of the online algorithm. .... In the theoretical computer science literature, however, the online caching problem has ...

Heuristic Scheduling Based on Policy Learning - CiteSeerX
machine centres, loading/unloading station and work-in-process storage racks. Five types of parts were processed in the FMS, and each part type could be processed by several flexible routing sequences. Inter arrival times of all parts was assumed to

Calculus on Computational Graphs: Backpropagation - GitHub
ismp/52_griewank-andreas-b.pdf)). The general .... cheap, and us silly humans have had to repeatedly rediscover this fact. ... (https://shlens.wordpress.com/),.

ARA – The Ant-Colony Based Routing Algorithm for ... - CiteSeerX
is the deployment of mobile ad-hoc networks for multime- .... A forward ant (F) is send from the sender (S) toward the destination ... node gets a ROUTE ERROR message for a certain link, it ... Loop-free: The nodes register the unique sequence.

Mark Blaug on the Sraffian Interpretation of the Surplus ... - CiteSeerX
(Ricardo 1951–73, 8:278-79; all emphases in quotations are ours)3. 1. When we used ... els, technical knowledge, the scarcity of renewable and depletable re- sources, and ... stocks of depletable resources, such as mineral deposits. Ricardo ...

Discriminative Tag Learning on YouTube Videos with ... - CiteSeerX
mance is obtained despite the high labeling noise. Fan et al. [8] also show that more effective classifiers can be ob- tained after pruning out the noisy tags by an ...

the matching-minimization algorithm, the inca algorithm and a ...
trix and ID ∈ D×D the identity matrix. Note that the operator vec{·} is simply rearranging the parameters by stacking together the columns of the matrix. For voice ...

the matching-minimization algorithm, the inca algorithm ... - Audentia
ABSTRACT. This paper presents a mathematical framework that is suitable for voice conversion and adaptation in speech processing. Voice con- version is formulated as a search for the optimal correspondances between a set of source-speaker spectra and

The Conquest of US Inflation: Learning and Robustness to ... - CiteSeerX
macroeconomic policy in the postwar period”. Striking and counterintuitive. ... Avoids disaster, so takes worst-case model into account. More useful focus on ...

The Conquest of US Inflation: Learning and Robustness to ... - CiteSeerX
Here: such robustness can explain “the greatest failure of American macroeconomic ... and Jeon, 2004). Note: Models very different, unlike most of the robustness literature. ... Avoids disaster, so takes worst-case model into account. ... If Fed ha

Polony Identification Using the EM Algorithm Based on ...
Wei Li∗, Paul M. Ruegger†, James Borneman† and Tao Jiang∗. ∗Department of ..... stochastic linear system with the em algorithm and its application to.

Variations on the retraction algorithm for symmetric ...
With block methods get. 1) basic triangular shape. 2) super long columns. 3) short columns which don't fit into rank k correction or vanish. x x x x x x. x x x x x x x. x x x x x x x x. x x x x x x x x x r r r. x x x x x x x x x x r r.

Simulation and Research on Data Fusion Algorithm of the Wireless ...
Nov 27, 2009 - The Wireless Sensor Network technology has been used widely; however the limited energy resource is one of the bottlenecks for its ...

E!ects of electrochemical reduction on the ... - CiteSeerX
compounds develop near their spin-ordering temper- ature (¹ ), [1]. Particularly, La ... is due to the existence of e mobile electrons between the. Mn >/Mn > pairs ...