Frequently Asked Questions about ICA applied to EEG and MEG Data
Scott Makeig
Institute for Neural Computation
, University of California San DiegoStill More Frequently Asked Questions...
Some publications on ICA applied to EEG and fMRI data
0. What is ICA? Independent Component Analysis (ICA) is a signal processing domain that is currently receiving increasing attention. In its simplest form, the problem is to separate N statistically independent inputs which have been mixed linearly in N output channels, without further knowledge about their distributions or dynamics. This is also called a problem of blind separation. In 1995, Bell & Sejnowski published an elegant neural network "Infomax" algorithm that can approximate the solution to this ICA problem in many cases. Later that year, Makeig, Bell, Jung and Sejnowski showed a first application of ICA to decomposition of spontaneous multi-channel electroencephalographic (EEG) and averaged event-related potential (ERP) data sets at the NIPS meeting in Denver (Makeig et al., 1996).
A toolbox of Matlab routines developed for conveniently applying the Infomax ICA algorithm to EEG or MEG (magnetic EEG) data and a set of measurement and visualization functions for analyzing results of ICA decomposition of averaged or single-trial ERPs and ERFs (event-related fields), is available here. For simple Matlab ICA kernal code, mathematical details, and alternate approaches, see the Salk/CNL ICA web pages. To browse or download a web tutorial on the EEGLAB toolbox for Matlab, Click Here. For information about changes in the latest version of the toolbox, see ica.readme.html
The FAQs below are divided into theoretical and practical questions.
Theoretical Questions
1. Is the assumption that EEG is a linear mixture of underlying brain "sources" correct? In the EEG frequency range, the mixing of brain fields at the scalp electrodes is basically linear. The skull attenuates brain potentials strongly, and "smears" (low-pass filters) them spatially, but this does not affect the linear sumation (and thus mixing) of potentials from the brain (and non-brain artifacts) at the scalp electrodes.
2. If the activations of the brain "sources" are partially overlapping, what does the ICA algorithm do? See the answer to the next question...
3. What are brain "sources" of event-related brain responses, anyway? This is an important question, with seemingly contradictory answers!
First, remember:
____ ICA does not completely solve the inverse problem.
________ ICA does not completely solve the inverse problem.
____________ ICA does not completely solve the inverse problem.
____________________ (In general, nothing but 3-D observation can!)From the viewpoint of ICA, "brain sources" are NOT necessarily "fMRI hot spots" NOR "single equivalent dipoles," but rather "concurrent electromagnetic activity that is spatially fixed and temporally independent of activity arising in other spatially fixed sources and summed with it in the input data." Networks producing such concurrent activity are defined NOT by compact spatial distributions in the brain, but by the covarying field measurements they produce at the scalp sensors. In general, "sources" of ICA components may be (one or more) distributed brain networks rather than "physically compact active brain regions." These networks may be functionally linked, forming (possibly transient) larger network, or may simply be activated concurrently in the input data, by chance as it were.
When (as often happens when decomposing spontaneous or event-related single-trial EEG data) an ICA component map turns out to be compatible with a possible compact generator region in cortex, this is not because ICA had this as an objective, but because a coherent signal source, independent of other sources, projected to the electrodes or SQUID sensors in just this pattern - a welcome result not inherent in the method itself but reflecting the nature of the EEG source signals.
In short, ICA produces an independent component analysis decomposition of the input data rather than an inverse source model decomposition!
Understanding the difference between these two concepts may require almost a paradigm shift in the way one thinks about the brain. Is the brain a collection of physically discrete neural networks which pass information to each other by (occasional) neural impulses, as we send letters or email to each other? Or is the brain a dynamically shifting collection of interpenetrating, distributed, and possibly transient neural networks that constantly shift biases in neural communicate?
The first viewpoint is that of classical anatomy and physiology. The second is that of a slowly emerging dynamic systems perspective on neuroscience. ICA is a tool for discovering independent sources of spatial dependencies in multichannel data. It asks, and answers, a different problem than so-called "source localization." The two questions are complementary, hence the answers they produce may be complementary parts of "the whole story" of "how brains work."
In truth, brain networks are most probably never wholly autonomous. They are neither physically wholly isolated from one another, nor do they act wholly independently! Attempts to solve the inverse problem (source localization) may assume the first (at some level). ICA assumes the second. Might decomposition algorithms be derived which mediate between these two extremes? Perhaps...
4. So, if the activations of some "physically separate brain network" are partially overlapping, what does the ICA algorithm do? If relatively intense, correlated activity in two component occurring at one and the same time period is not strongly diluted by independent activity (including baseline noise activity) at other times in their activations, ICA may consider the period of correlated activity to reflect the emergence of a transitory cooperative source integrating the synchronously active networks. (Note: ICA deals with higher-order spatial dependencies as well as correlations). But what, you ask, if these networks "just happen" to be co-active? How is ICA to know about that?? It tells (as best it can) what independent components make up the data (if the data allows such a decomposition). The interpretation of the components is up to the user. (For some 2-d plots showing the near-independence and near-maximum entropy of ICA components, see an image of pairs of ERP-data activations (after logistic compression) plotted against one another (67k)).
5. How do I know if an ICA component originates in one or more than one physical region in the brain? ICA does not provide information beyond giving the scalp topographies of the independent components. One may attempt to perform source localization on these maps. In general, fewer active brain networks may dominate each component map than in the raw scalp data. Thus, ICA may be useful as a preprocessing step prior to attempting source localization (using whatever relevant additional assumptions).
6. What is the difference between ICA and PCA? (A very "FAQ"). PCA (principal component analysis), usually implemented using singular value decomposition (SVD), finds an orthogonal basis for its given data. The first eigenvector (and its weight, called the first principal component) gives the direction (and from this, the scalp topography) of maximum variance (between channels) in the data (and the size of the data projection onto this direction). If the data is a mixture of linear components (as ICA assumes) this direction will, in general, sum as many of the independent components as possible.
Raw PCA components have orthogonal activations and scalp distributions, while the goal of ICA is to find temporally independent components which may have non-orthogonal (even very similar) scalp distributions. For example, ICA applied to group mean data from a color experiment produced two separate series of components of responses to red and blue checkerboards, respectively. These, naturally, had very similar (but not identical) scalp maps. Thus, PCA and ICA have very different goals, and naturally give quite different results in nearly all cases.
PCA has usually been proposed for EEG analysis essentially as a preprocessing stage prior to "rotation" of the spatial filters using Varimax, Promax, Quartimax, or some other criterion, much like "sphering" is often used to preprocess ICA input data in the Bell & Sejnowski (1995) algorithm. The Varimax criterion (maximize the variance of the basis vectors) is indirect at best, and retains the undesirable feature that the scalp maps it finds must be spatially orthogonal, whereas nearby but functionally differing brain areas may produce highly correlated scalp patterns. (Promax is a postprocessing stage applied on top of Varimax that relaxes Varimax's orthogonality constraint on the maps slightly).
PCA-based EEG decomposition can be implemented in two ways -- so called "temporal" and "spatial" PCA approaches. "Temporal" PCA analysis, such as described by Chapman & McCreary (1994), forces PCA components to have the same time courses of activation in each experimental condition. (Mocks has introduced generalizations that somewhat relax this constraint). "Spatial PCA" and ICA avoid this shortcoming, allowing ICA components to have different time courses in each condition. ICA takes into account higher-order dependencies (and independencies) in the component activations, whereas PCA is based on the second-order covariance matrix. Spatial PCA assumes orthogonality of the component maps, whereas ICA maps are not constrained to be orthogonal ("The brain is not orthogonal").
In sum, ICA as implemented in runica() can be viewed as a new oblique rotation method for post-processing of spatial PCA'd (or sphered, or raw) data. Introducing ICA in this way may be useful when approaching potential users (or reviewers) better versed in PCA decomposition.
7. What are the ICA algorithm's distributional assumptions about the values of the component activations, and how do they affect the output? This is a complicated question. Bell & Sejnowski (1995) proved that their algorithm will solve the independent components problem exactly given several assumptions, among them that the c.d.f. (cumulative density function) of each of the components matches the nonlinearity used in the algorithm. We use the logistic function, but have no proof that brain networks have a logistic c.d.f. In practice, the algorithm performs well when the components have higher tails (more peak values) than the normal distribution (e.g., they are "on" infrequently; technically, their amplitude distributions are super-Gaussian). Amari et al. address the question of why ICA algorithms are 'super-efficient' in a recent paper (see Bell's web page referenced above). Note the relation between the "sparsity" assumption underlying logistic ICA and the "maximum compactness" assumption used in Varimax and Promax. In runica(), another 'extended-ICA' method of identifying components with sub-Gaussian component activations is available (see question 9b below).
8. Isn't the EEG known to be Gaussian? The algorithm will not separate a mixture of Gaussians. However, central limit arguments suggest that the mixture of non-Gaussian components may appear Gaussian, particularly if there are a large number of them (e.g., brain networks with ~1/f temporal and spatial distributions). In practice, infomax ICA can return good results on data having small deviations from normality.
9. What are the differences between the runica() algorithm in the Makeig et al. Matlab package and the original Bell & Sejnowski(1995) algorithm? The main addition is the added W'*W ("WTW") 'natural gradient' term which avoids matrix inversions during ICA training, greatly reducing the computational load. Tony Bell on incorporating the natural gradient into infomax ICA ..."
Note: Instead of sphering the data prior to training, one can simply initialize the weight matrix to the sphering matrix. In runica() this is the default when 'sphering' is turned 'off'.
We also have reintroduced online bias adjustments into runica() as originally given by Bell & Sejnowski (1995).
For skewed (non-symmetrical) activation distributions (very common in ERP analyses), bias adjustment gives better unmixing results (though in simulations using 20 (mostly small) sources and 6 channels, no advantage for using the bias term was found).
runica() also incorporates an optional momentum term into the weight update loop. Though initial simulations did not find an advantage for using this term, it might be of use in some cases.
9b. What is extended-ICA and when should it be used? The logistic nonlinearity used in runica() is capable of separating component activations that are super-Gaussian (kurtosis>0). These are sparse, with infrequent periods of intense activation. Sources whose activations are (for example) more 'on' than 'off' may have sub-Gaussian distributions of activation values. An algorithm for treating these has recently been developed using infomax (Te-Won Lee, Mark Girolami & Terry Sejnowski). Te-Won explains more about extended ICA ...
In runica() there are two ways of calling 'extended-ICA' using the 'extended' keyword. If the following parameter is a positive integer (N), a kurtosis estimate is run on the activation waveforms every N training blocks. If the resulting vector of signs does not change for a number of blocks, the rate of doing (computationally expensive) kurtosis estimates is reduced. The number of components separated using the sub-Gaussian nonlinearity is given by the number of -1's in the signs vector returned by runica() and is indicated at each training step in verbose mode. The number of actual sub-Gaussian component activations can also be found by counting the number of negative-kurtosis component activations returned by >> kurt (activations')
A faster but less flexible method for running extended-ICA is to use a negative integer parameter (-N). In this case, the number of putative sub-Gaussian components is fixed at N, and online kurtosis estimates are not used. If the actual number of separable sub-Gaussian components is not N, results may be difficult to interpret.
A prime impetus for using extended-ICA on psychophysiological data is to remove line current noise from EEG recordings. The 60- or 50-Hz component is sub-Gaussian, and can only be efficiently separated using extended-ICA. (Jung et al,. 1998). Some eye movement components have also been found to be sub-Gaussian.
Practical Questions
10. Can the ICA algorithm be applied to magnetic (MEG) as well as electric (EEG) data? Yes. The relation between brain fields and fields on the scalp is also essentially linear, so long as the positions of the sensors and underlying brain fields are fixed.
11. How many data channels and time points do I need to run the ICA algorithm? The runica() version of the infomax ICA algorithm uses the natural gradient, so requires very few matrix inversions and so runs very quickly on modern workstations. For large datasets (for example, single trials), the binary version of runica(), coded by Sigurd Enghoff, available here, is up to twelve times faster than the Matlab version and uses only a fourth the memory!
In our experience, the number of time points needed is at least several times the square of the number of scalp sensors (data channels). We have decomposed data sets with 2 to 144 channels successfully. The maximum number of time points is also unlimited. However, see the question following -- If you ask ICA to decompose alot of data at once, your number of data channels (and scalp sensors) should exceed the number of expected (appreciable) components that are active in the data! If the component activations returned by runica() are each large at one time point only, then the weight matrix has not trained sufficiently, and you need to retrain using more input data.
Remember that ICA can only separate components that vary independently in its input data. Thus, two components with overlapping but equivalent activations in each input data condition (e.g., each input ERP) should not be separated cleanly. To do so, the experimenter would need to add conditions in which their amplitudes and/or time courses varied independently. Normally, good experiments are exactly those in which the phenomena of interest are made to vary independently, making ICA most useful for analyzing data from scientifically sound experiments.
12. What does the ICA algorithm output? The runica() function outputs two matrices (weights, sphere) which are the heart of the matter. The "activations" of the ICA components (i.e. their time courses of activation) are the rows of the matrix
activations = weights*sphere*(input_data - row_means);Note that these activations are not scaled to the input data units, and that their polarities are also not meaningful. Why not? Because the projection (reconstruction) of each single ICA component's contribution to the original scalp data is produced by first zeroing out all but the chosen component (row) in the activation matrix, then multiplying by the inverse weight matrix:
projection = inv(weights*sphere) * single_activation;The rows of this matrix are the time courses of the chosen component at the respective scalp sensors, in the original input data units and with the same polarity. The original channel means are not included -- to include them,
activations = weights*sphere*input_data; % including row means! projection = inv(weights*sphere) * single_activation;The "activations" are the time course of activation (strengths) for each component. A component "projection" is the actual time course of potential variations at each of the electrodes produced by the component. The projections, as computed above, sum to the data including its row means. However, the offsets of the various component projections may not have physiological significance.
Note: there is no need to save the weight and sphere matrices separately. If desired, weights can be set equal to weights*sphere, and eye(N) can then be used by functions asking for a separate sphering matrix (where N is the number of channels in the data).
13. Does the ICA algorithm always give the same answer when applied to the same data? In repeated trainings, the Matlab random number generator will cause runica() to deliver the data to the training algorithm in different random orders. This has little effect on the outcome; components with large projections are unchanged, though small components may vary. This is an important reason why components are ordered in decreasing order of variance accounted for by their projections onto the scalp. In runica(), component polarity is set so that all the activation waveforms are RMS positive, producing component maps that reflect the actual polarities of their dominant projection onto the scalp (e.g., peak-positive scalp components in red, negative in blue in compmap()).
However, training the algorithm with different amounts of data may produce differences in the results. The ICA algorithm "pays attention" to the relatively large activity in the data. Adding relatively large responses to the training data and retraining may therefore cause the algorithm to "pay less attention" to the original training data, resulting in fewer components active chiefly in the now relatively smaller response conditions.
14. Can I use the ICA output (weight*sphere) matrix from an evoked response decomposition to filter individual EEG traces? Yes, but remember that the (weight*sphere) matrix is essentially a set of spatial filters which in this case were not trained to filter out the possibly numerous EEG components that were not present (or were strongly attenuated) in the original averaged training data. For example, if sufficient averaging succeeded in removing non-phase locked alpha activity from an averaged ERP, all the filters resulting from an ICA decomposition of this ERP might well allow the alpha activity in the single-trial EEG to pass as well as (possibly many) other EEG components.
15. If the number of brain "sources" is larger than the number of sensors, what will the ICA algorithm do? Our head-model simulations show that the presence of relatively large numbers of small independent components degrades the accuracy of ICA in finding large large independent components "gracefully" (e.g., the more or larger the "noise sources" in the simulation, the fewer large sources are accurately extracted). Of course, if the number of large independent components making up the simulated data is larger than the number of sensors, then the ICA results may be poor. This constrains the amount of input data to be given the algorithm -- throwing in additional (and unrelated) data (e.g., "the kitchen sink") may not leave enough component degrees-of-freedom to separate out phenomena of most interest. However, the algorithm has recently been used to usefully decompose as many as 75 31-channel ERPs from related conditions.
16. Should the number of outputs equal the number of inputs? runica() allows you to specify the number of output channels, which may be smaller than the number of input channels. However, in our experience this has not been useful, as the outputs are then forced to be (typically noisy) mixtures of underlying sources. A better strategy, in our experience, is to set the number of output channels to be equal to the number of input channels (this is the default) and then to focus on components of interest, in particular ignoring those ICA components whose projections to the scalp are small, or which represent phenomena of less experimental interest (e.g., eye movements, line noise, etc.).
17. What kinds of data channels can be input? So long as the assumed independent components will be mixed linearly in the data, any kinds of data channels may be input together. For example, EEG channels with common mastoid reference plus bipolar EOG channels. Mixing MEG and EEG channels should be possible as well.
18. Should I make the input data mean-zero? Yes, but runica() does this, and saves the channel means removed in the output argument datamean. ERP and ERF data normally are baseline-zeroed when they are calculated. Baseline-zeroing each epoch in multi-epoch data may be accomplished by:
data = rmbase(data,frames,basevector); In most ERP data, the sources generating the given response are assumed to be inactive during the baseline period. When the channel means of the input data differ significantly from the baseline means, making the data mean-zero prior to ICA training introduces an 'active' DC (square-wave) component without physiological significance into the ICA decomposition. This problem may may be minimized by appending response epochs without baseline offsets to the input data (example: responses to standard stimuli appended to responses to target stimuli containing large monophasic late waves). (Note: a previous idea, symmetrically doubling the input data around the baseline mean zero value, proved flawed, since when multiple skewed source activations are present, this introduces a spurious correlation between them). New ideas are welcome!
19. Can I decompose data from different sessions collected with different sensor montages? This violates the assumption of spatial stationarity of sources, so is possible only when the sensor placements are very near each other. For example, grand averages of subject groups from different experiments all wearing the same electrode cap can be decomposed simultaneously, (one such decomposition gave a single set early visual components for both experiments). ICA decomposition is "montage-free" in the sense that separate decompositions of the same components using different sensor montages will find the same (strongest) components.
20. Can I decompose data averaged across subjects? Yes. Of course, any averaging procedure is valid only if some physical uniformity assumptions are met.
21. Does the ICA algorithm work better with clean or noisy data? Data with more large signal sources than data channels (whether thought of as 'noise' or 'non-noise' sources) cannot all be separated by the ICA algorithm, although ICA may still identify the largest of them accurately. In general, ICA is not a tool designed for pulling small signals out of (true) noise, like techniques that rely on detailed models of the signals to be extracted. However, very small signals (e.g., a small muscle twitch) can be separated out of much larger EEG activity when the total number of large mixed sources is smaller than the number of sensors, and their time course of activation is quite unlike other data sources.
22. Where can I learn more about ICA applied to evoked responses? Some early papers on ICA applied to psychophysiological data are listed on an ICA bibliography page.
23. Reference for the ICA toolbox:
Main EEGLAB reference article:
A Delorme & S Makeig. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics, Journal of Neuroscience Methods 134:9-21 (2004) (download preprint, .pdf, 0.7 MB).Contains details of EEGLAB ICA and time/frequency methods. Please cite this paper to reference EEGLAB in publications.
EEGLAB methods overview:
S Makeig, S Debener, J Onton, & A Delorme. Mining event-related brain dynamics, Trends in Cognitive Science, 8(5):204-210 (2004) (download preprint, .pdf, 0.5 MB). The benefits and pitfalls of combining ICA, time/frequency analysis, and ERP-image visualization.
Reference for the software download:
Comments and suggestions on this tutorial
Delorme, A, Makeig, S, et al. "EEGLAB: Matlab Toolbox for Electrophysiological Research".
WWW Site, Swartz Center for Computational Neuroscience, Institute for Neural Computation,
University of California San Diego, http://www.sccn.ucsd.edu/eeglab
[World Wide Web Publication], 2002-
Reference for this ICA FAQ page:
Makeig, S et al. "Frequently Asked Questions about ICA applied to EEG and MEG data".
WWW Site, Swartz Center for Computational Neuroscience, Institute for Neural Computation,
University of California San Diego, www.sccn.ucsd.edu/eeglab
www.sccn.ucsd.edu/~scott/icafaq.html [World Wide Web Publication], 2002-
are welcome. Email scott@salk.edu