Exponentiated backpropagation algorithm for multilayer ...

Viewer
Transcript

Proceedings of the 9th International Conference on Neural Information Processing (ICONIP'OZ) , Vol. 1 Lip0 Wang, Jagath C. Rajapakse, Kunihiko Fukushima, Soo-Young Lee, and Xm Yao (Editors)

EXPONENTIATEDBACKPROPAGATION ALGORITHM FOR MULTILAYER FEEDFORWARD NEURAL NETWORKS N Srinivasan'i2, V Ravi~handran~, K L Chad, J R Vdhya4,S Ram~kirishnan~, and S M Krishnan' 'Biomedical Engineering Research Centre School of Electrical and Electronics Engineering Nanyang Technological University, Singapore 639798 2Department of Electronics and Communication Engineering 3Department of Mathematics and Computer Applications 4Departmentof Information Technology Sri Venkateswara College of Engineering, Sriperumbudur 602 105, India level, the network is said to have converged and is considered to be trained. Recently an exponentiated gradient descent (EGD) algorithm has been proposed which minimizes the relative entropy instead of mean square error [3,4,5].The EGD algorithm has been applied to applications like acoustic echo cancellation [ 5 ] . It has been shown that when the target vector is sparse, the EGD algorithm typically converges faster than the GD algorithm. In this paper, the EGD algorithm has been extended to a multilayer feedforward neural network, which can be used in various applications. The paper discusses weight update rules for output layer neurons as well as hidden layer neurons. The EGD based backpropagation algorithm is applied to a pattern classification task and the results are presented in the current study.

ABSTRACT The gradient descent backpropagation learning algorithm is based on minimizing the mean square error. An alternate approach to gradient descent is the exponentiated gradient descent algorithm which minimizes the relative entropy. Exponentiated gradient descent applied to back propagation is proposed for a multilayer feedforward neural network. The learning rules for changing weights in the output layer as well the hidden layer neurons in the network are developed. Simulations were performed to explore the convergence and learning of the backpropagation algorithm with exponentiated gradient descent. Accuracy obtained with exponentiated gradient descent back propagation was comparable to the gradient descent back propagation while convergence was faster. The results show that exponentiated gradient descent can be extended to a multilayer feed forward neural network and used in pattern classification applications.

2. GRADIENT DESCENT BACKPROPAGATION ALGORITHM (21, yl), (22,yz), .. ., ( z pyp) , represent the p vector-. pairs used to train the network where xi E R N ,yi E R M .A back propagation feed forward network with an input layer, output layer and only one hidden layer is considered for analysis and simulation. An input vector z p = ( z p l.,.. ,xp,) is applied to the input layer of the network. The input units distribute the values to the hidden layer units. The net input to jth hidden unit is

Let

1. INTRODUCTION Multilayer feed forward network consists of layers of neurons where typically all neurons in one layer are connected to all the neurons in the subsequent layer. Multilayer feed forward networks (FFN)have been applied successfully to solve various problems in learning and classification [ 1,2]. These neural networks typically use various learning algorithms for training the network. One of the most popular algorithms is the error back-propagation algorithm. The back-propagation algorithm is a supervised learning algorithm, which trains a neural network using a gradient descent algorithm in which the mean square error between the network's output and the desired output is minimized. Once the network's error has decreased to the specified threshold

netki =

wjhixpi

+ ejh

i

where 'h' superscript refers to the quantities on the hidden layer, wjh, is the weight on connection from ith input unit to the jth hidden unit,and 6; is the bias term. The output of

327

If the output function f; is sigmoidal, then

this node is given by

(2)

ipj =

where f is the activation function. The net input, and output for the kth output node are

netik =

wEjipj

+ 0;

Similarly the gradient of L p with respect to hidden layer weights:

(3)

j 0;k

= fi (net;k

(4)

where the superscript ’0’refers to the quantities in the output layer, and w& is the weight on the connection between the jth hidden unit and the kth output unit. The GD algorithm typically minimizes the function

U ( w )= d(Wt+l,Wt) +.rlL(yt,ot)

The weights on the output layer nodes are updated using

(5)

i.e.,

where d(wt+l,wt) = 1/21wt+l - wtI2,the squared Euclidean distance for all the components of the weight vectors, q is the learning rate, yt is the desired output at time t, and ot is the actual output of the algorithm at time t. Setting aU(w)/aw = 0 and using the squared Euclidean distance,

Wt+l - wt

+ qL’(yt, O t ) = 0

The weights on the hidden layer nodes are updated using

+

wj”i(t 1) = Wj”Z(t)- q-dLP aw[ia

(6)

which when rearranged results in

Wt+l = W t - qL’(’Yt,0th

(7)

The direction in which to change the weights is determined by calculating the negative of the gradient of L,, with respect to the weights, ~ ~ + ~Then , i . the values of the weights can be adjusted such that the total loss is reduced. Thus, the gradient descent algorithm updates the weight vector by subtracting from it the gradient Lht (ot ) multiplied by the scalar q. The GD algorithm is applied to a back propagation algorithm based multilayer feedforward neural network. The loss at a single neuron in the output layer is ( y p k - O p k ) where y p k is the desired output value and opk is the actual output value from the kth unit, for the pth input. The loss minimized by the GD algorithm is the sum of the square of losses for all the output units where Total LOSS : L p = 1 / 2 x ( Y p k - Opk)’.

.The above formulas are used in updating the synaptic weights of a multilayer feedforward neural network using the GD back propagation algorithm.

3. EXPONENTIATED GRADIENT DESCENT BACK PROPAGATION ALGORITHM The EGD algorithm based on back propagation for a multilayer feedforward neural network is developed based on the E G f algorithm [3,4]. Kivinen and Warmuth [3,4] developed the EG algorithms and its modifications (EGDf) and derived worst case bounds for the algorithms. The EG algorithms are shown to be comparable or better for sparse input vectors when compared to the GD algorithms [3,4,5].

(8)

k

3.1. EG Algorithm

The weight changes are proportional to the gradient of L p in the GD backpropagation algorithm. The gradient of L p with respect to output layer weights is

The EG algorithm results from using for d the relative entropy, also known as Kullback-Leibler divergence,

If the output function f[ is linear, then

as the distance measure. The EG algorithm assumes that all the components wt,i and wt+l,i are positive, and the conwt,i = wt+l,i = 1are maintained. Entropy straints

xi

(10)

328

xi

xi

wt,i = 1 and wt,i 2 0 probability vector, i.e., it satisfies for all i. Therefore, the prediction wt.xt is a weighted average of the input variables xt,i, and wt gives the relative weights of the components in this weighted average. In contrast, the total weight I (wtI11 can change in the GD algorithm. Kivinen and Warmuth [4] proposed a variation of the EG algorithm called the E G f algorithm since the weight vector always being a probability vector restricted the abilities of EG to learn more general linear relationships. In the EGF algorithm, the weights are updated according to the rules

measures have many applications in various fields [6,7] and relative entropy is used as the distant measure for measuring the loss function in the EG algortihm and its variants [3,4]. Wt+l,i = 1. The EG algorithms use the property in addition to a distance measure. The weight updates are derived by introducing a Lagrangian multiplier y and solving the equations

xzl

for i = 1,...,N and the additional equation

fI:

Wt+l,i = 1.

(19)

i=l

The use of relative entropy as the distance measure satwt,i = wt+l,i = 1and the equaisfies the constraint tion (18) becomes

xi

ln-Wt+l,i Wt,i

xi

+ 1+ vL&t(ot)xi+

=O

(20)

where

The E G f algorithm can best be understood as a way to generalize the EG algorithm for more general weight vectors by using a reduction. Given a trial input sequence S to a neural network, let Sf be a modified trial input sequence obtained from S by replacing each instance zt by x: = ( U x l ,...,U X N -Ux1, , ..., - U X N )whichdoubles the number of dimensions. For a start vector pair (s+, s-) for E G f , w r =-Usf and w; = Us-. This transfonnation leads to an algorithm that in effect uses a weight vector (w: - w;), which can contain negative components. By using the scaling factor U, the weight vector w: - w t can range over all vectors w E R for which llwlll 5 U. Although llw:ll1 llwtlll is always exactly U, vectors w: - w t with 1 Iw: - wtI I < U result simply from having both w& > 0 and wLt > 0 for some i. The parameters of E G f are a loss function L, a scaling factor U, a pair (s+, s-) of start vectors in [0,11N with ~,"=,(s,' )s, = 1, and a leaming rate q. The output value for a particular neuron is computed by multiplying (w$ - w,) with the input values. The E G f algorithm discussed above is used to develop equations for weight updates for the output layer neurons and the hidden layer neurons in a multilayer feedforward neural network.

Solving for wt+l,i results in

Wt+l,i = wt,iriezd-y - 1)

(22)

where ri = exp(-vL&,(ot)xi). The update rule obtdned by applying equation (19) is

Wt+l,i =

Wt,iri

CY=, wt,jrj.

(23)

It can be noted that the update rule keeps the weights wt+l,i positive if the weights wt,i are positive. In the EG algorithm, the weights are initialized to s, N where si = 1, and si >= 0, for all i. Upon receiving the tth instance zt, the predicted output is ot = wt.zt. Upon receiving the tth outcome y t , the weights are updated according to the rule

Wt+l,i =

Wt,irt,i

+

+

(24)

~ , " _ Wt,jrt,j' i

where

rt,i = exp(-vL&,(ot)).

(25)

The EG algorithm has a loss function, start vector, and a learning rate as its parameters. In the update of EG, each weight is multiplied by a factor rt,i that is obtained by exponentiating the ith component of the gradient of L ( y t ,ot). After the multiplication, the weights are normalized so that they sum to 1. The weights clearly never change sign and are positive. Hence the weight vector wt of EG is always a

3.2. Updating weights in the output layer The weights on the output layer nodes are updated using, W&(t

329

+ 1)= w,.j.(t + 1) - w;;ct + 1)

(30)

where

U is the scaling factor representing the total weight on hidden layer nodes. Similarly,

If the output function is sigmoidal, then from (11) we have

w3% -yt

and

+ 1) = U.

'ji" (t).wjih ( t )

xi[r;h ( t ).wj'lzh( t )+ rjih ( t ).wji" (t)] . (48)

Thus the new weight update realtionships have been derived for the EGD algorithm. 4. SIMULATION RESULTS

In equations (31) and (38)' U is the scaling factor representing the total weight on output layer nodes. In the weight updating formulas above, it is to be noted that the component of gradient appears in the exponent of the factor that multiplies the weight values.

Experiments were conducted to compare the performance of gradient descent and EGD algorithms using sparse input vectors. The EGD backpropagation algorithm computes the exponent of the gradient term of the net loss in the network. A multilayer feedforward network with one hidden layer was used for testing the performance of the EGD network. The input layer consisted of a 7x5 matrix for representing the sample input patterns. The input vectors were representations of numbers from "0" to "9". Each neuron (representing a pixel) received binary input (either on or off) depending on the alphanumeric number given as input. The output layer consisted of 10 neurons to represent the 10 (0 to 9) inputs. The terminating gradient threshold was set at 0.001. Training was performed using 20 vectors (2 per input) and testing was conducted with 20 vectors which were noisy versions of the input vectors. The learning rate was varied and the performance of the GD and EGD backpropagation algorithms was evaluated. The EGD algorithm performed best for learning rates less than 0.01 and the performance decreased with respect to the GD algorithm for higher learning rates. The back propoagation neural network with both GD and EGD performed well

3.3. Updating weights in the hidden layer The weights on the hidden layer nodes are updated using, W3% "(t

+ 1) = w;yt + 1) - w+(t + 1) 32

(39)

where

and

330

6. REFERENCES

for the given set of input-output pairs for learning rates less than 0.01. Figure 1 shows the number of iterations taken to reach the threshold by GD and EGD algorithms to learn all the sample inputs. As seen from the Fig. 1, EGD takes less time than GD to reach the threshold and has faster convergence. 1 1

0.1

[11 Simon Haykin. Neural Networks: A Comprehensive

Foundation. Macmillan Publishing Company, New Jersy, 1994. [2] James A. Freeman, David M Skapura. Neural Networks: Algorithms, Applications and Programming Techniques.Addison-Wesley, 1991.

I

[3] J.Kivinen, M. K. Warmuth,Additiveversus-Exponentiated Gradient Descent for Linear Predictors, Inform. Comput. 132(1), pp 1-64,1997.

-

06-

'

[4] J.Kivinen, M. K. Warmuth, Exponentiated Gradient Versus Gradient Updates for Linear Predictors, Proceedings of the Annual ACM Symp. on the Theory of Computing, 1995.

9

04-

0.2

-

[5] S. I. Hill, R. C. Williamson, Convergence of Exponentiated Gradient A1gorithms;'IEEE Trans. Signal Processing, 49(2601), pp 1208-1215.

0-

[6] J.N.Kapur and H.K.Kesavan. Entropy Optimization Principles with Applications. Academic Press, Inc., 1992.

Figure 1: Performance of EGD and GD back propagation alg0;ithms.

[7] G.Jumarie. Relutive information. Springer - Verlag, 1990.

5. CONCLUSIONS

. -

A backpropagation algorithm based on EGD has been proposed that minimizes relative entropy and rules of weight -updates have been developed based on EGD for the neurons in a multilayer feedforward neural network. Simulations performed for a pattem classification problem show that the EGD algorithm performs well in a classification task and could be used in other pattem classification problems similar to GD based back propagation. The EGD algorithm is currently being explored for applications in biomedical problems. Further work can look atthe generalization capabilities of the EGD backpropagation network under various conditions.

- -

-

331

.

-