The Influence of Gaussian, Uniform, and Cauchy Perturbation Functions in the Neural Network Evolution Paulito P. Palmes and Shiro Usui RIKEN Brain Science Institute 2-1 Hirosawa, Wako, Saitama 351-0198 JAPAN
[email protected] [email protected]
Abstract. Majority of algorithms in the field of evolutionary artificial neural networks (EvoANN) rely on the proper choice and implementation of the perturbation function to maintain their population’s diversity from generation to generation. Maintaining diversity is an important factor in the evolution process since it helps the population of ANN (Artificial Neural Networks) to escape local minima. To determine which among the perturbation functions are ideal for ANN evolution, this paper analyzes the influence of the three commonly used functions, namely: Gaussian, Cauchy, and Uniform. Statistical comparisons are conducted to examine their influence in the generalization and training performance of EvoANN. Our simulations using the glass classification problem indicate that for mutation-with-crossover-based EvoANN, generalization performance among the three perturbation functions are not significantly different. On the other hand, mutation-based EvoANN that uses Gaussian mutation performs as good as that with crossover but it performs worst when it uses either Uniform or Cauchy distribution function. These observations suggest that crossover operation becomes a significant operation in systems that employ strong perturbation functions but has less significance in systems that use weak or conservative perturbation functions.
1
Introduction
There are two major approaches in evolving a non-gradient based population of neural networks, namely: Mutation-based approach using EP (Evolutionary Programming) or ES (Evolutionary Strategies) concepts and Crossover-based approach which is based on GA (Genetic Algorithm) implementation. While the former relies heavily on the mutation operation, the latter considers the crossover operation to be the dominant operation of evolution. Common to these approaches is the choice of the perturbation function that is responsible for the introduction of new characteristics and information in the population. Since the selection process favors individuals with better fitness for the next generation, it is important that the latter generation will not be populated by individuals
that are too similar to avoid the possibility of being stuck in a local minimum. The way to address this issue is through the proper choice and implementation of the perturbation function, encoding scheme, selection criteria, and the proper formulation of the fitness function. In this study, we are interested in the first issue. The SEPA (Structure Evolution and Parameter Adaptation) [4] evolutionary neural network model is chosen in the implementation to ensure that the main driving force of evolution is through the perturbation function and the crossover operation. The SEPA model does not use any gradient information and relies only in its mutation’s perturbation function and crossover operation for ANN evolution.
2
Related Study
Several studies have been conducted to examine the influence of the different perturbation functions in the area of optimization. While Gaussian mutation is the predominant function in numerical optimization, the work done by [7] indicated that local convergence was similar between Gaussian and spherical Cauchy but slower in non-spherical Cauchy. Studies done by [8] in evolutionary neural networks found that Cauchy mutation had better performance than Gaussian mutation in multimodal problems with many local minima. For problems with few local minima, both functions had similar performance. A study conducted by [1] combined both the Gaussian and Cauchy distributions by taking the mean of the random variable from Gaussian together with the random variable from Cauchy. Preliminary results showed that the new function performed as good or better than the plain Gaussian implementation. Common to these approaches is the reliance of the system to the perturbation function to effect gradual changes to its parameters in order for the system to find a better solution. In a typical implementation, the perturbation function undergoes adaptation together with the variables to be optimized. Equations (1) and (2) describe a typical implementation using Gaussian self-adaptation [1]: η 0 = η + ηN (0, 1)
(1)
x0 = x + η 0 N (0, 1)
(2)
where x is the vector of variables to be optimized; η is the vector of search step parameters (SSP), each undergoing self-adaptation; N is the vector of Gaussian functions with mean 0 and the standard deviation controlled by their respective SSPs; The typical implementations in evolutionary neural networks also follow similar formulation for the mutation of weights: w = w + N(0, α(ϕ)) ∀w ∈ ϕ where N(0, α(ϕ)) is the gaussian perturbation with mean 0 and standard deviation α(ϕ); w is a weight; and (ϕ) is an error function of network ϕ (e.g. mean-squared error) which is scaled by the user-defined constant α. 2
Unlike in a typical function optimization problem where the main goal is to optimize the objective function, the goal of neural network evolution is to find the most suitable architecture with the best generalization performance. Good network training performance using a certain perturbation function does not necessarily translate into a good generalization performance due to overfitness. It is important, therefore, to study the influence of the different perturbation functions in the training and the generalization performances of ANN. Moreover, knowing which combination of mutation and adaptation strategies are suited for a particular perturbation function and problem domain will be a big help in the neural network implementation. These issues will be examined in the future. In this paper, our discussions will only be limited to the performance of EvoANN in the glass classification problem taken from the UCI repository [2].
3
Evolutionary ANN Model
deleted node
1
a
a
b
x 2
c 1
1
2
3
4
a
b
2
x
3
y
b y
3 c
c 4
W1
4
W2
W1
a) ANN
W2
b) SEPA Representation of ANN
Fig. 1. Evolutionary ANN
Neural Network implementation can be viewed as a problem in optimization where the goal is to search for the best network configuration having good performance in training, testing, and validation. This is achieved by training the network to allow it to adjust its architecture and weights based on the constraint imposed by the problem. The SEPA model (Fig. 1) used in this study addresses this issue by making weight and architecture searches become a single process that is controlled by mutation and crossover. Changes caused by mutation and crossover induce corresponding changes to the weights and architecture of the ANN at the same time [3]. In this manner, the major driving force of evolution in SEPA is through the implementation of the crossover and mutation operations. This makes the choice of the perturbation function and the implementation of adaptation, mutation, and crossover very important for the successful evolution of the network. Below is a summary of the SEPA approach: 1. At iteration t=0, initialize a population P (t) = {nett1 , ..., nettµ } of µ individuals randomly: i i , ρ(pri , mi , σi )} , θw2 neti = {W 1i , W 2i , θw1
3
where: W 1, W 2 are the weight matrices; θw1 , θw2 are the threshold vectors; ρ is the perturbation function; pr is the mutation probability; m is the strategy parameter; and σ is the step size parameter (SSP). 2. Compute the fitness of each individual based on the objective function Qf it [5]: Qf it = α ∗ Qacc + β ∗ Qnmse + γ ∗ Qcomp where: Qacc is the percentage error in classification; Qnmse is the percentage of normalized mean-squared error (NMSE); Qcomp is the complexity measure in terms of the ratio between the active connections c and the total number of possible connections ctot ; α, β, and γ are constants used to control the strength of influence of their respective factors. 3. Using rank selection policy, repeat until there are µ individuals generated: – Rank-select two parents, netk and netl , and apply crossover operation by exchanging weights between W 1k and W 1l and weights between W 2k and W 2l : ∀(r, c) ∈ W 1k ∧ W 1l , if rand() < Θ, swap(W 1k [r][c], W 1l [r][c]) ∀(r, c) ∈ W 2k ∧ W 2l , if rand() < Θ, swap(W 2k [r][c], W 2l [r][c]) where Θ is initialized to a random value between 0 to 0.5 4. Mutate each individual neti , i = 1, ..., µ, by perturbing W 1i and W 2i using: δi = ρ(σi ); m0i = mi + ρ(δi ); wi0 = wi + ρ(m0i ) where: σ is the SSP (step size parameter); δ is mutation strength intensity; ρ is the perturbation function; m is the adapted strategy parameter, and w is the weight chosen randomly from either W1 or W2. 5. Compute the fitness of each offspring using Qf it 6. Using elitist replacement policy, retain the best two parents and replace the remaining parents by their offsprings. 7. Stop if the stopping criterion is satisfied; otherwise, go to step 2.
4
Experiments and Results
Two major SEPA variants were used to aid in the analysis, namely: mutationbased (mSEPA) and the mutation-crossover-based (mcSEPA or standard SEPA). Furthermore, each major variant is divided into three categories, namely: mSEPAc (Cauchy-based); mSEPA-g (Gaussian-based); and mSEPA-u (Uniform-based). Similarly, mcSEPA follows similar categorization, namely: mcSEPA-c, msSEPAg, and mcSEPA-u which is based on the type of perturbation function used. Table 1 summarizes the important parameters and variables used by the different variants. The glass problem was particularly chosen because its noisy data made generalization difficult which was a good way to discriminate robust variants. The sampling procedure divided the data into 50% training, 25% validation, and 25% testing [6]. The objective was to forecast the glass type (6 types) based on the results of the chemical analysis (6 inputs) using 214 observations. Table 2 shows the generalization performance of the different SEPA variants. The posthoc test in Table 2 uses the Tukey’s HSD wherein average error results that are not significantly different are indicated by the same label (∗ or 4
Table 1. Feature Implemented in SEPA for the Simulation SEPA Main Features Features
Implemented
selection type mutation type mutation prob SSP size crossover type replacement population size no. of trials max. hidden units max. generations stopping criterion fitness constants classification
rank gaussian-cauchy-uniform 0.01 σ=100 uniform elitist 100 30 10 5000 validation sampling α = 1.0, β = 0.7, γ = 0.3 winner-takes-all
Comment rank-sum selection depends on the variant Uniform range is U(-100,100) randomly assigned between (0,0.5) retains two best parents
evaluated at every 10th generation
Table 2. ANOVA of Generalization Error in Glass Classification Problem Gaussian vs Uniform vs Cauchy Variants mSEPA-g mcSEPA-u mcSEPA-g mcSEPA-c mSEPA-u mSEPA-c Linear-BP [6] Pivot-BP [6] NoShortCut-BP [6]
Average Error
Std Dev
0.3912∗ 0.4006∗ 0.4031∗ 0.4113∗† 0.4194† 0.4453†
0.0470 0.0380 0.0516 0.0626 0.0448 0.0649
0.5528 0.5560 0.5557
0.0127 0.0283 0.0370
∗, † (Tukey’s HSD posthoc test classification using α = 0.05 level of significance)
† labels). Table 2 indicates that for mutation-based SEPA (mSEPA), Gaussian perturbation is significantly superior than the Uniform and Cauchy functions. For the mutation-crossover-based SEPA (mcSEPA), there is no significant difference among the three perturbation functions. Furthermore, the table also indicates that any SEPA variant has superior generalization than any of the Backpropagation variants tested by Prechelt [6]. Since these results are only limited to the glass classification problem and BP can be implemented in many ways, the comparison of SEPA with the BP variants are not conclusive and requires further study. Moreover, Figure 2 and Table 2 suggest that even though the Uniform perturbation has the best training performance in mSEPA, it has the worst generalization performance. For mcSEPA, the performance of the three perturbation functions are similar.
5
Conclusion
This preliminary study suggests that for evolutionary neural networks that rely solely in mutation operation, Gaussian perturbation provides a superior generalization performance than the Uniform and Cauchy functions. On the other hand, introduction of crossover operation helps to significantly improve the performance of the Cauchy and Uniform functions. It also suggests that in order to manage complexity provided by more chaotic perturbation functions such as that of the Uniform and Cauchy perturbations, a proper crossover operation 5
0.9
Correct Classification
Correct Classification
0.9
0.8 0.7
(a) mSEPA
0.6 0.5
mSEPA−c mSEPA−g mSEPA−u
0.4 0.3
0
200
400
600
800
0.8 0.7
0.5
mcSEPA−c mcSEPA−g mcSEPA−u
0.4 0.3
1000
Generations
(b) mcSEPA
0.6
0
200
400
600
800
1000
Generations
Fig. 2. Training Performance of the Different SEPA Variants
must be introduced to leverage and exploit the wider search coverage introduced by these functions. The simulation also indicates that that superior performance in training for mutation-based evolution does not necessarily imply a good generalization performance. It may even worsen the generalization performance due to too localized searching.
References 1. K. Chellapilla and D. Fogel. Two new mutation operators for enhanced search and optimization in evolutionary programming. In B.Bosacchi, J.C.Bezdek, and D.B.Fogel, editors, Proc. of SPIE: Applications of Soft Computing, volume 3165, pages 260–269, 1997. 2. P. M. Murphy and D. W. Aha. UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA, 1994. 3. P. Palmes, T. Hayasaka, and S. Usui. Evolution and adaptation of neural networks. In Proceedings of the International Joint Conference on Neural Networks, IJCNN, volume II, pages 397–404, Portland, Oregon, USA, 19-24 July 2003. IEEE Computer Society Press. 4. P. Palmes, T. Hayasaka, and S. Usui. SEPA: Structure evolution and parameter adaptation. In E. Cantu Paz, editor, Proceedings of the Genetic and Evolutionary Computation Conference, volume 2, page 223, Chicago, Illinois, USA, 11-17 July 2003. Morgan Kaufmann. 5. P. Palmes, T. Hayasaka, and S. Usui. Mutation-based genetic neural network. IEEE Transactions on Neural Network, 2004. article in press. 6. L. Prechelt. Proben1–a set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Fakultat fur Informatik, Univ. Karlsruhe, Karlsruhe, Germany, Sept 1994. 7. G. Rudolph. Local convergence rates of simple evolutionary algorithms with cauchy mutations. IEEE Trans. on Evolutionary Computation, 1(4):249–258, 1997. 8. X. Yao, Y. Liu, and G. Liu. Evolutionary programming made faster. IEEE Trans. on Evolutionary Computation, 3(2):82–102, 1999.
6