2015 IEEE International Conference on Systems, Man, and Cybernetics
Efficient Labeling of EEG Signal Artifacts using Active Learning Vernon Lawhern1,2, David Slayback1,3, Dongrui Wu4 Senior Member, IEEE and Brent J. Lance1 Senior Member, IEEE 1.
Translational Neuroscience Branch, Human Research and Engineering Directorate, US Army Research Laboratory, Aberdeen Proving Ground, MD USA 2. Department of Computer Science, University of Texas at San Antonio, San Antonio, TX USA 3. Department of Computer Science, University of Pittsburgh, Pittsburgh, PA USA 4. Machine Learning Laboratory, GE Global Research, Niskayuna, NY USA
Abstract—Electroencephalography (EEG) has been widely used in a variety of contexts, including medical monitoring of subjects as well as performance monitoring in healthy individuals. Recent technological advances have now enabled researchers to quickly record and collect EEG on a wide scale. Although EEG is fairly easy to record, it is highly susceptible to noise sources called artifacts which can occur at amplitudes several times greater than the EEG signal of interest. Because of this, users must manually annotate the EEG signal to identify artifact regions in the data prior to any downstream processing. This can be time-consuming and impractical for large data collections. In this paper we present a method which uses Active Learning (AL) to improve the reliability of existing EEG artifact classifiers with minimal amounts of user interaction. Our results show that classification accuracy equivalent to classifiers trained on full data annotation can be obtained while labeling less than 25% of the data. This suggests significant time savings can be obtained when manually annotating artifacts in large EEG data collections.
specific instructions to subjects to not perform any unnecessary movements or actions during the experiment so as to minimize the presence of artifacts in the data. Once the experiment is completed, researchers then remove artifacts in the EEG to obtain a “clean” neural signal that can be further analyzed. This artifact removal procedure is traditionally done in one of several ways: the first is manual identification and removal, whereby the experimenter manually identifies artifactcontaminated EEG and removes the signal prior to analysis. However, this can severely reduce the amount of data available and may also reduce the degrees of freedom present for further statistical analysis. This approach is also time-consuming and impractical for long data recordings lasting from several hours to several days or more. An alternative approach for artifact removal is through Independent Component Analysis (ICA) . ICA has been shown to be effective at removing several kinds of EEG signal artifacts ,  and there exist several automated methods for identifying artifacts from derived independent components –. However, it is recommended that large EMG artifacts be manually labeled and removed prior to ICA, as the presence of these artifacts may reduce the quality of the resulting decomposition . This initial identification is still done manually in many cases, representing a significant burden on users. Finally, automated methods can be used to label and extract artifacts without manual intervention . However, it is currently unclear how effective these methods are at removing artifacts without also removing neural data. As a result, many experimenters continue to perform manual artifact removal. Thus, having a more efficient and automated method for the initial identification of EEG signal artifacts can save researchers significant effort in processing EEG data.
Keywords-EEG; Artifacts; Active Learning; Support Vector Machine; Autoregressive Model
Electroencephalography (EEG) is a method of measuring signals within the brain, specifically electrical potentials on the surface of the scalp. These signals have been analyzed in a variety of scenarios, including clinical applications such as epilepsy monitoring  and non-clinical applications such as in detecting fatigue and drowsiness in realistic driving environments . Applications of EEG for enhancing performance in healthy populations is also a growing research interest , . EEG signals, while relatively easy to obtain compared to alternative neuroimaging methods, are highly susceptible to artifactual noise. Generally speaking, artifacts are classified into two categories, depending on their point of origin: internally-generated artifacts, such as saccades, eye movements and electromyogram (EMG) activity from the face and jaw; and externally-generated artifacts, such as electromagnetic interference (EMI), loose electrode contact and cable sway . These artifacts often occur at signal amplitudes several times greater than the EEG signal of interest, thereby decreasing the signal-to-noise ratio and making analysis of the EEG data difficult. Because of this, researchers often give 978-1-4799-8697-2/15 $31.00 © 2015 IEEE DOI 10.1109/SMC.2015.558
We have previously shown that artifact detection and identification in EEG can be performed efficiently by using autoregressive (AR) features together with a support vector machine (SVM) classifier , with accuracies in some subjects approaching 95% for a multi-class classification. This procedure was recently implemented in a continuous fashion through a sliding-window classifier . In , we reported accuracies of approximately 75-80% in detecting among 4
different artifact classes (eye blinks, saccades/eye movements, EMG and the absence of any artifact), with the majority of errors due to false positives (identifying EEG segments as artifact when none is present) or false negatives (incorrectly identifying EEG segments as containing no artifact). We hypothesized that this drop in performance is due to the fact that the training data used for learning the classifier was wellstereotyped, whereas the test data contained more naturalistic artifact occurrences. This represents a mismatch in the feature spaces between the training and testing data and suggests that improved performance can be obtained by appropriately retraining the original classifier with some new data that represents the new temporal dynamics. Figure 1. Example EEG segments of the four different classes of interest, taken from a Biosemi EEG System with 64 EEG electrodes and 4 EOG electrodes. From left to right: Muscle Activity, Eye Blink, Eye Movements and None (no artifact).
One potential method for efficiently re-training classifiers with new data is through Active Learning (AL). AL is a semisupervised machine learning technique whereby a machine learning algorithm can actively query an oracle, assumed to be an absolute source of knowledge, for labels of informative data points . The machine learning algorithm is then re-trained with this new knowledge to (ideally) improve the classification model. Here, the machine learner can be any desired classification model (for example, support vector machines or logistic regression), with the oracle being the user who is annotating the data. This paper proposes using Active Learning (AL) to update existing EEG artifact classifiers with new, informative data so as to obtain improved performance with limited user interaction. One key component of AL is the technique used for identifying informative data for labeling. In this work we propose using a Query-by-Committee (QBC)  to select informative data. The QBC model uses a committee of machine learning algorithms that makes decisions via a majority voting method, and where data points identified for oracle querying are points where there is no clear majority on the predicted label. In this work we construct our committee through K-fold cross-validation, where the model learned from each fold is given a vote. Once the candidate EEG segments are identified, they are sent to the user for manual labeling. We also propose using a Decision-Confidence QBC classifier, which weighs each committee member by the confidence in its decision.
MATERIALS AND METHODS
A. Experimental Setup EEG was recorded from 7 subjects who participated in a series of experiments conducted at the Army Research Laboratory. All subjects provided consent prior to participating, and methods were approved as required by U.S. Army human use regulations , . Details about the experiments can be found in our previous papers , ; a brief summary is given below. The seven subjects performed two tasks in one continuous recording session: an ‘artifact battery’ condition and a visual-evoked potential (VEP) condition. In the artifact battery condition, subjects were told to perform 8 types of artifacts at pre-defined time points, creating a ground-truth database of EEG artifacts. These EEG artifacts were: eye blinks, eye movements (left saccade and right saccade), and muscle movements that generate EMG activity (jaw clench, jaw movement, eyebrow movement and neck twists), plus the baseline condition “None” containing no artifacts. For the purpose of this study we grouped the 8 original artifact conditions into four classes: None, Eye Blink, Eye Movement (combining vertical and horizontal eye movements) and Muscle (combining jaw clenches, jaw movements, eyebrow movements and neck twists). Figure 1 shows an example segment of EEG data from each of these four classes. In the VEP condition, subjects performed a visual discrimination task where images were presented at 0.5Hz rate (one image every two seconds). Images contained either an enemy insurgent (occurring 10% of the time) or a friendly soldier (occurring 90% of the time). For this experiment images containing enemy insurgents are classified as targets, with friendly soldiers classified as non-targets. Subjects were instructed to manually press a button with their dominant hand as rapidly and as accurately as possible, when a target is shown. The two experimental conditions combined lasted between 2025 minutes.
While AL has enjoyed success in other research settings, such as spam filtering  and image classification , its applications in the EEG domain have not been extensively explored. To our knowledge, our recent study on single-trial visual-evoked potential (VEP) classification using AL is the first published work in this area . This work focuses on applying AL techniques for multi-class artifact detection and identification in EEG. There are a couple of differences between our previous study and the current study that are worth noting. First, in the current study the user can visually identify the EEG signal artifacts present in the data directly, whereas in the previous study the oracle identified the external stimulus instead of the EEG itself. Second, this task is an application of AL to multi-class identification, whereas the previous study focused on binary classification.
B. EEG Data Collection and Processing The EEG data was recorded at 512 Hz using a 64-channel Biosemi ActiveTwo system (Biosemi, Amsterdam, Netherlands) and referenced to linked mastoids. Four external channels were used to record eye movements by electrooculography (EOG). EOG activity was recorded to
Assume an initial classifier ܥ, labeled data ܮ, unlabeled data ܷ, validation dataset ܸ generated by removing 10% of the data from ܷ at random, and ܯ, the maximum number of epochs to query in each AL iteration. Classification accuracies are evaluated with respect to V to control for the possibility of overfitting. AL proceeds as follows:
verify the instances of eye blinks and saccades in EEG. The data were down-sampled to 256 Hz and high-pass filtered at 1Hz. We used EEGLAB  for processing and filtering the data. Additional computational processing was performed in MATLAB (The Mathworks, Natick, MA). EEG data from the artifact battery condition was epoched at [0,500] ms around artifact onset time, while EEG data from the VEP condition was epoched in non-overlapping 500ms windows. In the artifact condition, 160 epochs were recorded from each subject, evenly distributed among the classes None, Eye Blink, Eye Movement, and Muscle. Each subject has approximately 1200 EEG windows for the VEP condition. An EEG signal processing expert with more than 10 years of experience manually labeled all the data for all subjects. The goal of our analysis is to label artifact instances in the VEP dataset (assumed unlabeled) using a classifier trained on the artifact dataset (assumed labeled).
ܻሺݐሻ ൌ σୀଵ ܣ ܻሺ ݐെ ݅ሻ ߳௧
݂݁ܿ݊݁݀݅݊ܥൌ ൫ܲሺଵሻ െ ܲሺଶሻ ൯Ȁܲሺଵሻ
For eachܥ ǡ ݅ ൌ ͳǡ ǥ ǡ ܭ, predict the labels for all epochs in ܷ.
Sort all epochs in ܷ by the amount of disagreement among the committee. For example, if ܭൌ ͷ, then a single epoch with 2 committee members in agreement is more severe than a single epoch with 3 committee members in agreement. a. If using Decision-Confidence QBC, take the sum of all the confidence values for each of the K classifiers. The sum of all confidence values will range from ሾͲǡ ܭሿǤ
Select up to ܯepochs with the greatest amount of committee disagreement for oracle labeling. Remove the ܯepochs from ܷ and add the ܯepochs into ܮ. a. If using Decision Confidence QBC, within each level of disagreement the lowest summed confidence epochs are selected first.
where ܻሺݐሻ is the value of the EEG at time ݐand ߳௧ ̱ܰሺͲǡ ߪ ଶ ሻ denotes a zero-mean process with variance ߪ ଶ . The parameters of the AR(p) model are given as ܣ ǡ ݅ ൌ ͳǡ ǥ ǡ and the variance ߪ ଶ . The method given in  fits an AR(2) model to each EEG channel separately and concatenates the AR coefficients (omitting the variance estimate) to form a single column vector which describes the epoch. For 68 channels of data (64 EEG + 4 EOG) this results in a vector containing 136 features. Classification is done by using a radial basis function support vector machine (RBF-SVM), which is trained using LibSVM . We then use Platt scaling  to obtain a probability distribution over the classes for individual epochs. From this, we derive the confidence for each epoch as:
Use K-fold cross-validation to train K models ܥଵ ǡ ǥ ǡ ܥ on ܮ.
a. If using Decision-Confidence QBC, calculate the confidence for each ܥ for all epochs in ܷ.
C. Artifact Classification We use the autoregressive-based support vector machine classifier previously described in  for our analysis. In brief, an autoregressive model of order p for a single channel of EEG is given as
where ܲሺଵሻ and ܲሺଶሻ denote the largest and second largest probabilities among all the classes for a given epoch. The confidence ranges from [0,1], where values close to 1 indicate high confidence in the prediction, while values close to 0 indicate low confidence.
Re-train ܥଵ ǡ ǥ ǡ ܥ on L using the new data labeled in Step 4.
Using the learned model, calculate the classification accuracy on the validation set ܸ.
Repeat steps 2-6 until desired level of convergence is obtained.
Here, L is the labeled artifact data while U is the unlabeled VEP data. We set ܭൌ ͷ for 5-fold cross-validation and ܯൌ ͷ for labeling 5 EEG epochs per AL iteration. We set the maximum number of AL iterations to 50, although in some situations AL may converge earlier than 50 iterations. We define convergence when all committee members are in agreement and the confidence from each committee member is at least 0.6 for every epoch in ܷ. To assess overall algorithm performance we repeated this procedure 100 times, where within each run we sampled, with replacement, the same number of epochs from the labeled set ܮprior to each AL iteration. One key distinction between QBC and the DecisionConfidence QBC is that, while all committee members may agree on a prediction for a given data point (thereby converging with QBC), the overall confidence in the prediction may be low (close to 0 for every committee member) (thus not converging by Decision Confidence QBC). Thus, having the oracle label this data point may improve overall performance even though the committee members are all in agreement.
D. Active Learning Implementation We consider two forms of AL for our analysis; a standard Query by Committee (QBC) approach, with K committee members learned through K-fold cross-validation, and a Decision-Confidence Weighted QBC where the prediction of each committee member is weighted by its confidence. Pseudocode for our AL implementation is given here:
We also calculate two additional measures for statistical comparisons. First, in the Baseline condition, at each iteration we randomly sampled ܯൌ ͷ epochs from ܷ for oracle labeling, independent of any metric. This measure controls for the effect of increased performance due only to increased sample size. Second, we calculate the full 10-fold cross-validation accuracy, assuming full knowledge of all labels in ܷ . This represents an upper bound on the classification performance. III.
VEP data, such that by labeling some of the VEP data and including it in the artifact data, the classifier can adjust to the new feature space. However, in later iterations AL performance continues to improve overall classification while the Baseline performance measure does not significantly improve, and in some cases significantly decreases. 4.
Fig. 2 shows the results of our analysis for the 7 subjects in the study. From these results we observe that 1.
Active Learning, either by QBC or DecisionConfidence QBC, can significantly improve overall classification over the Baseline measure. In some subjects (Subjects 1, 3, 4, and 7), performance approaching that of full 10-fold cross-validation can be achieved with less than 25% labeled data. This suggests that significant time savings can be achieved in labeling EEG data for artifacts in practical application.
For a few subjects (Subjects 1, 3, 4 and 6) we see that the performance of the Baseline measure actually decreases as more data are labeled. In the case of Subject 6, the performance decreases then increases. This suggests that labeling uninformative data points may actually hurt overall performance.
With the exception of Subject 5, the classifier performance significantly increases after the first iteration. This increase in performance suggests that there is a mismatch between the artifact data and the
The performances of the two AL methods (QBC and Decision Confidence QBC) are not statistically different from each other, as can be seen in the significant overlap in the confidence intervals shown in Figure 2. We do see some slight improvements with Decision Confidence QBC in a subset of subjects, although the overall difference is minor. IV.
In this work we propose an AL-based solution for labeling artifact regions in EEG recordings. Our results show that users, on average, need to look at approximately 20% of their EEG data to learn an accurate artifact classifier that can be used for processing the rest of the dataset. Regions that are tagged with this approach can then be used for any further processing as needed. For example, ICA decompositions tend to model eye artifacts much better than muscle activity, so the experimenter may want to keep eye blinks and saccade regions but omit muscle activity regions. The artifact regions may also be used as features themselves for other classifiers. For example, blink rate and blink duration have been used as features for detecting fatigue and drowsiness .
Figure 2. Average Classification Accuracy on the validation dataset ܸ, averaged over the 100 bootstrap iterations (vertical axis) against percentage of target data labeled (horizontal axis) for all subjects. Dashed lines denote 2 standard errors of the mean. Curves stopping in the middle of the plot indicate that at least 50% of the bootstrap iterations have converged.
One aspect of this work that we did not explicitly consider is the amount of time the user needs to label a set amount of EEG data. We empirically observed that the expert who labeled the data took approximately 2-3 times the length of the EEG data being labeled. For the data length in this study (~20 minutes), the expert took approximately 40 minutes to an hour to label the entire data. We also empirically observed that the SVM classifier used in this study can be trained efficiently through multi-core processing, taking approximately 10-15 seconds per iteration. For 50 iterations of AL (which is the maximum number for the study) this would translate to about 10-12 minutes of computational time. These numbers represent a time savings of about 60% for labeling artifacts in EEG, although more research is needed to determine if this translates to larger data collections. Sparse feature representations may also improve computational efficiency . We expect that computational times will further improve as computers with faster processors and increased number of processing cores become more widely available. We believe that improved AL methods may also significantly reduce computational times as well. Our results are based on a specific configuration of our AL QBC model (5-fold CV for QBC committee, 5 samples selected per AL iteration). We did not explicitly test other parameter configurations to determine if gains in classification accuracy could be obtained earlier in the AL iterative process. For example, labeling 10 or 20 epochs per iteration may produce faster gains in improvement than labeling 5 epochs as was done here. However, this places increased burden on the user to manually label these epochs. Decreasing the number of committee members may also reduce computational times as well, with the potential risk of reduced AL performance. We believe that an appropriate trade-off between user labeling effort, classification performance and model configuration will be data specific.
ACKNOWLEDGMENTS We would like to thank W. David Hairston, Scott Kerick and Theodric Feng for help with data collection and for the original experimental design. We would also like to thank Scott Kerick for help in data annotation and labeling. This project was supported in part by the Office of the Secretary of Defense Autonomy Research Pilot Initiative program MIPR DWAM31168 and by the U.S. Army Research Laboratory, under Cooperative Agreement Number W911NF-10-2-0022 and through the Oak Ridge Institute for Science and Engineering under program MIPR 4LDATBP033. The views and the conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S Government. The U.S Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. REFERENCES 
S. J. M. Smith, “EEG in the diagnosis, classification, and management of patients with epilepsy,” J. Neurol. Neurosurg. Psychiatry, vol. 76, no. suppl 2, pp. ii2–ii7, Jun. 2005. C.-T. Lin, C.-J. Chang, B.-S. Lin, S.-H. Hung, C.-F. Chao, and I.-J. Wang, “A Real-Time Wireless Brain #x2013;Computer Interface
System for Drowsiness Detection,” IEEE Trans. Biomed. Circuits Syst., vol. 4, no. 4, pp. 214–222, Aug. 2010. B. J. Lance, S. E. Kerick, A. J. Ries, K. S. Oie, and K. McDowell, “Brain #x2013;Computer Interface Technologies in the Coming Decades,” Proc. IEEE, vol. 100, no. Special Centennial Issue, pp. 1585–1599, May 2012. J. van Erp, F. Lotte, and M. Tangermann, “Brain-Computer Interfaces: Beyond Medical Applications,” Computer, vol. 45, no. 4, pp. 26–34, Apr. 2012. T. R. H. Cutmore and D. A. James, “Identifying and reducing noise in psychophysiological recordings,” Int. J. Psychophysiol., vol. 32, no. 2, pp. 129–150, May 1999. T.-W. Lee, M. Girolami, and T. J. Sejnowski, “Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources,” Neural Comput., vol. 11, no. 2, pp. 417–441, Feb. 1999. T.-P. Jung, S. Makeig, C. Humphries, T.-W. Lee, M. J. McKEOWN, V. Iragui, and T. J. Sejnowski, “Removing electroencephalographic artifacts by blind source separation,” Psychophysiology, vol. 37, no. 02, pp. 163–178, Mar. 2000. A. Delorme, T. Sejnowski, and S. Makeig, “Enhanced detection of artifacts in EEG data using higher-order statistics and independent component analysis,” NeuroImage, vol. 34, no. 4, pp. 1443–1449, Feb. 2007. H. Nolan, R. Whelan, and R. B. Reilly, “FASTER: Fully Automated Statistical Thresholding for EEG artifact Rejection,” J. Neurosci. Methods, vol. 192, no. 1, pp. 152–162, Sep. 2010. F. Campos Viola, J. Thorne, B. Edmonds, T. Schneider, T. Eichele, and S. Debener, “Semi-automatic identification of independent components representing EEG artifact,” Clin. Neurophysiol., vol. 120, no. 5, pp. 868–877, May 2009. A. Mognon, J. Jovicich, L. Bruzzone, and M. Buiatti, “ADJUST: An automatic EEG artifact detector based on the joint use of spatial and temporal features,” Psychophysiology, vol. 48, no. 2, pp. 229–240, Feb. 2011. B. W. McMenamin, A. J. Shackman, J. S. Maxwell, D. R. W. Bachhuber, A. M. Koppenhaver, L. L. Greischar, and R. J. Davidson, “Validation of ICA-based myogenic artifact correction for scalp and source-localized EEG,” NeuroImage, vol. 49, no. 3, pp. 2416–2432, Feb. 2010. V. Lawhern, W. D. Hairston, K. McDowell, M. Westerfield, and K. Robbins, “Detection and classification of subject-generated artifacts in EEG signals using autoregressive models,” J. Neurosci. Methods, vol. 208, no. 2, pp. 181–189, Jul. 2012. V. Lawhern, W. D. Hairston, and K. Robbins, “DETECT: A MATLAB Toolbox for Event Detection and Identification in Time Series, with Applications to Artifact Detection in EEG Signals,” PLoS ONE, vol. 8, no. 4, p. e62944, Apr. 2013. B. Settles, “Active learning literature survey,” 2010. A. Krogh, J. Vedelsby, and others, “Neural network ensembles, cross validation, and active learning,” Adv. Neural Inf. Process. Syst., pp. 231–238, 1995. D. Sculley, “Online Active Learning Methods for Fast Label-Efficient Spam Filtering.,” in CEAS, 2007, pp. 1–4. A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos, “Scalable Active Learning for Multiclass Image Classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2259–2273, Nov. 2012. D. Wu, B. Lance, and V. Lawhern, “Transfer learning and active transfer learning for reducing calibration data in single-trial classification of visually-evoked potentials,” in 2014 IEEE International Conference on Systems, Man and Cybernetics (SMC), 2014, pp. 2801–2807. US Department of the Army, “Use of volunteers as subjects of research.” Government Printing Office, 1990. US Department of Defense Office of the Secretary of Defense, “Code of federal regulations protection of human subjects.” Government Printing Office, 1999. A. Delorme and S. Makeig, “EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis,” J. Neurosci. Methods, vol. 134, no. 1, pp. 9–21, Mar. 2004.
C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Trans Intell Syst Technol, vol. 2, no. 3, pp. 27:1– 27:27, May 2011. J. C. Platt, “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods,” in ADVANCES IN LARGE MARGIN CLASSIFIERS, 1999, pp. 61–74.
Y. S. Kim, H. J. Baek, J. S. Kim, H. B. Lee, J. M. Choi, and K. S. Park, “Helmet-based physiological signal monitoring system,” Eur. J. Appl. Physiol., vol. 105, no. 3, pp. 365–372, Nov. 2008. V. Lawhern, W. D. Hairston, and K. Robbins, “Optimal Feature Selection for Artifact Classification in EEG Time Series,” in Foundations of Augmented Cognition, D. D. Schmorrow and C. M. Fidopiastis, Eds. Springer Berlin Heidelberg, 2013, pp. 326–334.