Bayesian Model Averaging over Decision Trees for Assessing Newborn Brain Maturity from Electroencephalogram Livia Jakaite, Vitaly Schetinin, Carsten Maple
Joachim Schult
Department of Computer Science and Technology University of Bedfordshire Luton, UK {livija.jakaite,vitaly.schetinin,carsten.maple}@beds.ac.uk
Department Mathematik Bereich Geschichte der Naturwissenschaften Universität Hamburg Hamburg, Germany
Abstract—We use the Bayesian Model Averaging (BMA) over Decision Trees (DTs) for assessing newborn brain maturity from clinical EEG. We found that within this methodology an appreciable part of EEG features is rarely used in the DT models, because these features make weak contribution to the assessment. It was identified that the portion of DT models using weak EEG features is large. The negative impact of this is twofold. First, the use of weak features obstructs interpretation of DTs. Second, weak attributes increase dimensionality of a model parameter space needed to be explored in detail. We assumed that discarding the DTs using weak features will reduce the negative impact, and then proposed a new technique. This technique has been tested on some benchmark problems, and the results have shown that the original set of attributes can be reduced without a distinguishable decrease in BMA performance. On the EEG data, we found that the original set of features can be reduced from 36 to 12. Rerunning the BMA on the set of the 12 EEG features has slightly improved the performance. Keywords- feature selection, brain maturity, Bayesian Model Averaging
I.
INTRODUCTION
Abnormal newborn brain development, being a lifethreatening factor, should be diagnosed as early as possible. Clinical experts can assess brain maturity by analyzing electroencephalograms (EEG) recorded from sleeping newborns [1] - [3]. Such analysis is based on the clinical evidences that the post-conceptional and EEG estimated ages of healthy newborns typically match each other, and that the newborn’s brain maturity is most likely abnormal if the ages mismatch [2], [4]. It can take hours of expert work to confidently interpret the EEG. Another problem is that there are no regular rules for mapping the EEG features to postconceptional ages, and for this reason experts’ conclusions cannot be objective [16]. The established methodologies of automated brain maturity assessment are based on learning models from EEGs recorded from sleeping newborns whose brain maturity was already assessed by clinicians. The regression models are made
The research project is funded by the Leverhulme Trust, UK
capable of mapping the brain maturity into EEG based index [5]. The classification models are made capable of distinguishing maturity levels: at least one for normal and the other for abnormal brain maturity [4], [6]. The established methodologies are based on learning a single model from given data and thus cannot provide the assessment of uncertainty in decisions. Probabilistic reasoning, based on the methodology of Bayesian Model Averaging (BMA) enables to evaluate the uncertainty in decision making [7] – [9]. The use of Decision Trees (DTs) within BMA provides a benefit of selecting features which make most significant contribution to outcomes. The resultant DT ensemble can be interpreted by clinicians as shown in [10]. However, the success in implementation of the BMA over DTs is critically dependent on the diversity and proportion of sampled DTs. The DT models should be diverse in parameters and structure. Ideally, the portion of models with highest likelihood should be largest in order to ensure unbiased results of averaging. However, in the absence of prior information about size and structure of DT models, the use of such models within MCMC technique has been found imperfect. This is because MCMC technique tends to accept a DT with a larger number of splitting nodes rather than to explore the DT of a current configuration in details. Missing such details decreases a chance to collect samples of DT models drawn from areas of interest in which the posterior likelihoods are maximal. This leads to disproportion in the collected samples and biased results of BMA. A few approaches have been proposed to solve the above problem. One of the most efficient approaches is based on a restarting strategy of MCMC technique which multiply reruns BMA over DTs during a short period [17]. This strategy was shown efficient when sampling large DTs. The idea behind another approach was to limit the growth of DT models during a given period [9]. There are also publications proposing parallel and population MCMC techniques which run multiple Markov Chains [18]. In general, these techniques proposed collecting samples of models obtained in multiple runs or multiple chains, but
overlooking the importance of using additional information which can be obtained in a post-processing phase. The use of this information enables reducing a model parameter space and, subsequently, achieving better conditions for proportional sampling. One way of providing better conditions for proportional sampling is to use a priori information about feature importance. However, in many practical cases, this information is absent, and the provision of the proportional sampling of models becomes problematic [11], [12]. In our previous research [13], we attempted to overcome the above BMA problems and developed a new feature selection strategy for Bayesian averaging over DTs to assess survival of trauma patients based on a set of screening tests. We supposed that some tests make a weak contribution to the outcome, and therefore avoidance of such tests can improve the assessment accuracy. For clinical practice, it is important to reduce the number of screening tests required for making reliable decisions. In our experiments we used the UK Trauma data to test the efficiency of the proposed strategy and found that it enables to reduce the number of screening tests, keeping the performance and reliability of making decisions high. In this research we aim to further explore the proposed technique being applied to the problem of EEG assessment of newborn brain maturity. The paper is structured as follows. Section II states the problem in Bayesian assessment of newborn EEG maturity. Section III describes the bases of BMA over DTs, and section IV describes the proposed method. In section V, the proposed method is tested on some UCI benchmark problems [14], and then it is tested on the EEG data. Section VI concludes the paper. II.
PROBLEM STATEMENT
The newborn brain maturity can be estimated by experts from sleep EEG in terms of post-conceptional age measured in weeks. We aim to develop a BMA technique to assess the brain maturity from sleep EEGs which were recorded via the C3-C4 electrode system. These recordings were transformed into the standard 6 frequency bands: Subdelta (0–1.5 Hz), Delta (1.5–3.5 Hz), Theta (3.5–7.5 Hz), Alpha (7.5–13.5 Hz), Beta (13.5–19.5 Hz) and Beta2 (19.5–20 Hz). The absolute and relative spectral powers calculated for electrodes C3, C4, and their sum form a multidimensional representation of the EEG data. No other information about the data is available to make assumptions on EEG feature importance. In the absence of information about EEG feature importance, the standard BMA techniques cannot ensure unbiased results. This mainly happens because the detailed exploration of a multidimensional model space becomes problematic within a reasonable time. Such exploration is needed to ensure that the majority of models are sampled from areas of maximal posterior. Otherwise, instead of exploring all possible areas of maximal posterior, the models will be disproportionally sampled from some of these areas of interest. The negative effect of disproportional sampling of models is that results of BMA become biased.
Obviously, information about feature importance can reduce a model parameter space needed to be explored in detail. However in our case, we have no such information, and therefore we are forced to make an unrealistic assumption that all the EEG features make an equal contribution to the age assessment. Fortunately, DT models provide the feature selection, and the use of Bayesian averaging over such models gives a posterior information about feature importance. This means that if an attribute is rarely used in the ensemble, then we can conclude that it makes a weak contribution and can be deleted. A trivial way of using posterior information is to rerun BMA on data in which the weak attributes were deleted. This reduces a model parameter space, and thus enables to explore it in more detail. The other way is to refine the DT ensemble by discarding DTs which use weak attributes. We expect that such a strategy will reduce the original set of attributes without rerunning BMA, keeping its performance high. III.
IMPLEMENTATION OF BMA
For a DT given with parameters , the predictive distribution is written as an integral over the parameters . p( y | x, D)
p( y | x, θ, D) p(θ | D)dθ,
where y is the predicted class (1, …, C), x = (x1, …, xm) is the m-dimensional vector of input, and D are the given training data. This integral can be analytically calculated only in simple cases, and in practice part of the integrand, which is the posterior density of conditioned on the data D, p( | D), cannot usually be evaluated. However, for (1), …, (N) are the samples drawn from the posterior distribution p( | D), we can write. N
p( y | x, D)
i 1
p( y | x, θ (i ) , D) p(θ (i ) | D)
1 N
N
p( y | x, θ
(i )
, D).
i 1
The above integral can be approximated by using Markov Chain Monte Carlo (MCMC) technique as described in [7], [9]. To perform such an approximation, we need to run a Markov Chain until it has converged to a stationary distribution. Then we can collect N random samples from the posterior p( | D) to calculate the desired predictive posterior density. Using DTs for the classification, we need to find the probability tj with which an input x is assigned by terminal node t = 1, …, k to the jth class, where k is the number of terminal nodes in the DT. The DT parameters are defined by sipos, sivar, sirule, i = 1, …, k – 1, where sipos, sivar, and sirule define the position, predictor and rule of each splitting node, respectively. For these parameters the priors can be specified as follows. First, we can define a maximal number of splitting nodes, smax = n – 1. Second, we draw any of the m attributes from a uniform discrete distribution U(1, …, m) and assign sivar {1,..., m} .
Finally the candidate value for the splitting variable xj = sivar can be drawn from a discrete distribution U(xj(1), …, xj(L)), where L is the number of possible splitting rules for variable xj. Such priors allow us to explore DTs which split data in as many ways as possible, keeping in mind that the DTs with different numbers of splitting nodes should be explored proportionally to the numbers of possible DT configurations [7], [9]. To sample DTs of variable dimensionality, the MCMC technique exploits the Reversible Jump extension. In [7], [9] it has been suggested exploring the posterior probability by using the following types of moves: Birth. Randomly split the data points falling in one of the terminal nodes by a new splitting node with the variable and rule drawn from the corresponding priors. Death. Randomly pick a splitting node with two terminal nodes and assign it to be one terminal with the united data points. Change-split. Randomly pick a splitting node and assign it a new splitting variable and rule drawn from the corresponding priors. Change-rule. Randomly pick a splitting node and assign it a new rule drawn from a given prior. The first two moves, birth and death, are reversible and change the dimensionality of . The remaining moves provide jumps within the current dimensionality of . Note that the change-split move is included to make “large” jumps which potentially increase the chance of sampling from a maximal posterior whilst the change-rule move does “local” jumps. The RJ MCMC technique starts drawing samples from a DT consisting of one splitting node whose parameters were randomly assigned within the predefined priors. So we need to run the Markov Chain while a DT grows and its likelihood is unstable. This phase is said burn-in and it should be preset sufficiently long in order to stabilize the Markov Chain. When the Markov Chain will be enough stable, we can start sampling. This phase is said post burn-in. IV.
THE PROPOSED METHOD
Our strategy aims at refining a DT model ensemble obtained with BMA by discarding those DTs which use attributes found making a weak contribution. According to this strategy, first we use the BMA technique described in section above to collect an ensemble of DTs. Then we estimate the posterior probabilities of using attributes in this ensemble. These estimates give us the posterior information on feature importance which is used to sort the attributes by their impact. Then we use a sequential forward strategy of finding DT models using an attribute assumed to be weak in order to eliminate these models from the ensemble. The search continues while the likelihoods of the refined and original ensembles are comparable within a given p-value of t-test. Obviously, the greater the number of attributes found weak, the larger is the portion of DT models discarded from the ensemble. As a result, we expect to find the smallest set of attributes making the most important contribution and keep the performance of the refined ensemble high. The efficiency
of this technique is evaluated in experiments in terms of performance and accuracy of uncertainty assessment. Under the given conditions, the use of t-test can lead to a biased choice of weak attributes, and we thus will further investigate this problem. In this paper, we aim to explore the influence of the proposed discarding technique on the BMA performance, bearing in mind that a threshold value obtained with the standard t-test can be biased. The proposed technique can be summarized as a sequence of the following steps. 1. Set n-fold cross-validation (eg, n=10), and p-value for t-test 2. Run BMA MCMC to collect an ensemble of DTs, DTi, for i = 1,...,n 3. Calculate the ensemble likelihood, Li 4. Estimate the posterior feature importance for the given attributes 5. Sort list of features, Fi, in the order of their importance 6. Set the number of weak attributes, k=1 7. For folds i=1,...,n, find the DT models using a feature in position k and delete them from ensemble DTi 8. Calculate the likelihood L1i for the refined ensemble, i=1,...,n 9. Run the t-test to compare likelihoods (L11,...,L1n) and (L1,...,Ln) 10.If the test rejects the null-hypothesis, then k=k-1, and stop 11.Otherwise, k=k+1, and go to step 7
The algorithm returns the number of features which were found weak within a given p-value. Thus, the indexes of weak features are in positions from 1 to k of the list F. V.
EXPERIMENTS
A. Performance on Benchmark Problems The comparative experiments were run on the Chess, Landsat, Waveform, and Credit problems from the UCI Machine Learning Repository [14]. In these experiments we compared the performance and ensemble entropy obtained with the following three techniques. First, we ran the BMA on the data of the original dimensionality. Second, we estimated the feature importance and then ran the proposed technique to refine the DT ensembles. According to our method, we found DT models using weak attributes to discard them from the ensemble. In the third technique used for the comparison, we simply reran the BMA on the data sets of reduced dimensionality, having eliminated the weak attributes. Performance of each technique is defined as accuracy of classification of the test data. Following [15], we express the classification uncertainty in terms of ensemble entropy E as follows. t
c
E P( j | xi ) log 2 ( P( j | xi )), i 1 j 1
where t is the number of the test data samples xi, c is the number of classes, and P are the class posterior probabilities.
Table 1 shows the performance (P) and ensemble entropy (E) obtained in these experiments with each of techniques 1, 2 and 3. Here, n is the number of samples, m is the original number of attributes, and k is the number of weak attributes identified by the proposed technique. The performance and entropy were calculated within a 10-fold cross validation technique. We can see that the performances and entropies for the three techniques are comparable on these problems. On the Chess and Landsat problems, the technique 3 has slightly increased the performance and decreased the entropy that is supporting our arguments for improving conditions of sampling due to dimensionality reduction. TABLE 1. PERFORMANCE (P) AND ENTROPY (E) OF THE TECHNIQUES Dataset, Technique P, % E (n, c, m, k) 1 98.1±0.8 76.3±30.2 Chess 2 98.1±0.8 75.9±28.0 (3196, 2, 36, 12 ) 3 98.6±0.2 59.8±4.4 Landsat (6435, 7, 36, 9)
Waveform (5300, 3, 40, 6)
Credit (690, 2, 15, 4)
EEG (200, 2, 36, 24)
1
83.9±1.9
1218.0±41.1
2
83.8±1.7
1212.7±41.1
3
84.7±1.6
1050.9±27.9
1
76.2±0.7
1048.3±58.5
2
76.1±0.9
1048.6±62.2
3
76.5±1.4
993.7±83.2
1
87.0±7.4
31.4±6.7
2
86.4±7.1
31.5±7.0
3
87.1±6.6
31.7±7.1
1
86.5±7.6
21.4±4.5
2
86.5±7.6
21.0±4.3
3
87.5±8.7
20.9±5.1
Fig. 1 shows the details of the BMA on the Credit problem. During a burn-in and post burn-in phases, we collected 100,000 and 10,000 DTs, respectively. The minimal number of data samples allowed in DT nodes was set to 2. The proposal variance was given 1.0, and the probabilities of birth, death, change variable, and change moves were set to 0.15, 0.15, 0.1 and 0.6, respectively. The acceptance rate was, on average, 0.6. From Fig. 1 we can observe that the likelihood value become stationary, oscillating around -150, while the DTs have grown to 15 nodes, on average. Fig. 2 shows the posterior probabilities of 15 features of the Credit data obtained with the BMA.
Fig. 1. Log likelihood and DT size during the burn-in and post burn-in phases on the Credit problem.
Fig. 2. Posterior probabilities of features for the Credit problem.
Fig. 3 shows the likelihood, performance, ensemble size and p value of t-test calculated within the proposed technique. We can see that for (k = 5) weak attributes, p value becomes below 0.5 and the likelihood is decreasing more steeply. Thus we can define the maximal number of weak attributes to be 4 and select the remaining 11 attributes as most informative features. For comparison, we reran the BMA on the data represented by these 11 features. The results presented in Table 1 show that the performances of the original (1) and refined (2) ensembles are comparable with that obtained by the rerunning (3). In the following section, the proposed technique is applied to the problem of assessment of newborn brain maturity from sleep EEG.
The results presented in Table 1 show that the performances of the original (1) and refined (2) ensembles are comparable with that obtained by the rerunning (3). The last result shows a slight improvement in the performance and entropy of the ensemble.
Fig. 4. Log likelihood and DT size during the burn-in and post burn-in phases on the EEG brain maturity assessment problem.
Fig. 3. Finding a minimal attribute subset for the Credit problem: (a) likelihood, (b) performance, (c) ensemble size and (d) p value of t-test.
B. Performance on EEG Age Assessment We used the EEG recorded during sleep hours from 200 newborns in 2 age groups, 36 and 41 weeks after conception. Each of the groups contains 100 patients. The EEGs have been segmented in 10-s intervals and represented by 36 attributes including the absolute and relative spectral powers in the 6 standard bands. The EEG segments of a patient have been averaged, so that the EEG problem is represented by (n = 200) samples in a (m = 36) attribute space. We ran the BMA with the same settings as for the Credit problem. The acceptance rate was around 0.3 in both phases. The DT size became stationary after growing to 5 nodes, see Fig. 4. The average performance was 86.5%. Fig. 5 shows the posterior probabilities of 36 features representing the EEG problem. These probabilities range between 0.004 and 0.024. Fig. 6 shows the likelihood, performance, ensemble size and p value of t-test calculated within the proposed technique. We can see that for (k = 25) weak attributes, p value becomes below 0.5. Similarly to the case of the Credit problem, we defined 24 weak attributes and select the remaining 12 as most informative features. For comparison, we reran the BMA on the data represented by these 12 features.
Fig. 5. Posterior probabilities of features for the EEG brain maturity assessment problem.
VI.
CONCLUSION AND DISCUSSION
We explored how the posterior information can be used within the methodology of BMA over DTs in order to select a subset of EEG features making important contribution in the brain maturity assessment. We assumed that the posterior information about feature importance can be used to discard the DT models with features found making weak contribution. We could refine an ensemble from such models while keeping the BMA performance high. According to our observations, in the presence of weak EEG features, a number of models included in the resultant
ensemble will use these features. The negative impact of such models is twofold.
Typically, reduction of data dimensionality by discarding of the weak attributes is expected to improve BMA performance due to reducing a model parameter space needed to be explored. However, this technique requires multiply rerunning the BMA to find the minimal feature subset. The proposed refining method has been shown to provide the comparable performance without rerunning the BMA. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
Fig. 6. Finding a minimal attribute subset for the EEG brain maturity assessment problem: (a) likelihood, (b) performance, (c) ensemble size and (d) p value of t-test.
[12]
[13]
First, the use of weak EEG features obstructs model interpretation. Second, weak attributes increase the dimensionality of a model parameter space needed to be explored in detail. The larger the number of weak attributes, the greater is the negative impact on BMA performance. Consequently, we expected that reducing a model parameter space will improve the conditions for proportional sampling from areas of interest with maximal posterior likelihood. Using this technique, we found that the set of EEG features was reduced from 36 to 12, keeping the performance within 86.5±7.6. When we reran the BMA on the data represented by these 12 EEG features, we observed a slight improvement in the performance. This observation supports our argument about provision of better conditions for proportional sampling from areas of interest due to dimensionality reduction. This result was achieved despite that we did not yet ensure the necessary conditions for the t-test.
[14]
[15] [16] [17] [18]
B. Tharp, “Electrophysiological Brain Maturation in Premature Infants: an Historical Perspective”, Clinical Neurophysiology, Vol. 7, No. 3, 1990, pp. 302-314. M. Scher, “Neurophysiological Assessment of Brain Function and Maturation: a Measure of Brain Adaptation in High Risk Infants”, Pediatric Neurology, Vol. 16, No. 3, 1997, pp. 191-198. K. Holthausen, O. Breidbach, B. Scheidt, and J. Frenzel, “Brain Dysmaturity Index for Automatic Detection of High-Risk Infants”, Pediatric neurology, Vol. 22, No. 3, 2000, pp. 187-191. V. Schetinin, J. Schult, “A Neural-Network Technique to Learn Concepts from Electroencephalograms”, Theory in Biosciences, Vol. 124, No. 1, 2005, pp. 41-53. M. Scher, D. Steppe, and D. Banks, “Prediction of Lower Developmental Performances of Healthy Neonates by Neonatal EEGSleep Measures”, Pediatric Neurology, Vol. 14, No. 2, 1996, pp. 137-44. D. Crowell, L. Kapuniai, and R. Jones, “Autoregressive Spectral Estimates of Newborn Brain Maturational Level: Classification and Validation”, Psychophysiology, Vol. 15, No. 3, 1978, pp. 204-208. H. Chipman, E. George, R. McCullock, Bayesian CART model search, Journal of American Statistics, Vol. 93, 1998, pp. 935-960. W. Buntine, Learning Classification Trees, Statistics and Computing, Vol. 2, 1998, pp. 63-73. D. Denison, C. Holmes, B. Malick, and A. Smith, Bayesian Methods for Nonlinear Classification and Regression, Willey, 2002 V. Schetinin et al., “Confident Interpretation of Bayesian Decision Trees for Clinical Applications”, IEEE Transaction on Information Technology in Biomedicine, Vol. 11, No. 3, 2007, pp. 312-319. P. Domingos, “Bayesian Averaging of Classifiers and the Overfitting Problem”, Proceedings 17th International Conference. on Machine Learning, San Francisco, CA. 2000, pp. 223-230. V. Schetinin, and C. Maple, “A Bayesian Model Averaging Methodology for Detecting EEG Artifacts”, Proceedings 15th International Conference on Digital Signal Processing, 2007, pp. 499502. L. Jakaite, and V. Schetinin, “Feature Selection for Bayesian Evaluation of Trauma Death Risk”, Proceedings 14th Nordic-Baltic Conference on Biomedical Engineering and Medical Physics, Vol. 20, 2008, pp. 123126. A. Asuncion, D. J. Newman. UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science. 2007 L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms. Willey, 2004 H. J. Niemarkt et al., “Analyzing EEG Maturation in Preterm Infants: The Value of a Quantitative Approach”, Journal of Neonatal and Perinatal Medicine, 1, pp 131-144, 2008. P. J. Green, “Reversible Jump Markov Chain Monte Carlo and Bayesian Model Determination”, Biometrika, 82, pp 711-732, 1995. J. Corander, M. Ekdahl , T. Koski. “Parallel Interacting MCMC for Learning of Topologies of Graphical Models”, Data Mining and Knowledge Discovery, 17:3, pp 431-456, 2008.