The FAQs below have come to me by email from researchers considering applying ICA to their ERP averages. More FAQs About Applying ICA to EEG/MEG Data ...
Below are my best attempt at answers ... -Scott Makeig
- What are "epochs" and "single trials"?
- How much data is enough data for ICA?
- Are four channels enough for ICA?
- Should I concatenate data conditions?
- Should I worry about discontinuities?
- How should the activations be scaled?
- If I use real input, why do I get complex output?
- How is ICA used with fMRI data?
- What is temporal ICA?
- How can I measure the success of ICA?
How Much Data is Enough Data?
> Others here and myself have been exploring the use of ICA for > some of our research. [Is it true that] one must have at least > the square of the number of sensors to meet the necessary assumptions > of ICA? Because this analysis allows the separation of overlapping > components of the recorded signal, it is ideally suited for addressing > some of the questions we're looking at, but we would like to ensure > that we're "doing it right". > > Since we have 300 points per channel, and 129 channels, we don't > have sufficient number of samples to extract 129 components. > However, in the MatLab program, there is the ability to limit the > number of components via an initial PCA procedure. We have > been using this to limit our number of components, usually to > somewhere between 6 and 20, though it appears 17 should be our > maximum. > > My question is, if one has (as we do) 300 data points per electrode > and 129 electrodes, does limiting the number of components to a > maximum of 17 via the PCA satisfy the criterion for number of > samples and allow us to perform a valid ICA? Or, is it more > advisable to limit the number of electrodes by taking a subset over > the head as well?There is no fixed limit to the number of points needed for a "good" ICA solution - and in fact no fixed way to judge whether an ICA solution is "good" or not. However, given 129 channels and, say, 129 points, infomax will usually find the obvious degenerate solution, one component fitting each time point. This solution will give you *no* additional information or insight into your data, and therefore be useless - or possibly worse, if you try to over-interpret it!
In yhour case there are 129^2 unmixed matrix weights for ICA to learn. My best guess is that the number of data points required may be some multiple of 129^2 (this could differ in some cases, depending on the component mixture -- Also, not all data may be modeled as the sum of independent components!). The PCA option is provided as a principled though imperfect way to make the training tractable for large numbers of channels.
However, I think you should rethink the appropriateness of your experimental design for ICA. Infomax looks for independence -- this means things that happen independently of each other. This means, roughly, things that happen at different times or in different ways at different times.
Your 300-point EEG data with its low-pass spectral character actually contain only 5-10? independent time windows. ICA could extract this many independent components, however, only if each component was active in only one independent time window. If this were the case, you might do better by just measuring their separate manifestations (by their peaks or other features).
Note that in our 1999 J Neurosci and J Royal Society papers on visual P3 and N1 decompositions (Makeig et al., 1999ab), we decomposed twenty-five 256-point 31-channel ERP condition grand averages simultaneously. Across this array of conditions, several EEG phenomena were expressed at different times/conditions - allowing ICA to separate them. Also, the number of input data points (25*256 = 6400) was over six times the number of weights to be learned (31^2 = 961).
Are there other conditional averages you could concatenate into your training data, such as non-target responses in each of N stimulus conditions, or etc?
Note: We are now exploring intensively another, most interesting, exciting and challenging approach of decomposing the whole collection of *single trials* from a subject. We have begun to publish on the results of this method, which requires and suggests a somewhat different framework for assumptions and interpretation than the usual framework for thinking about averaged ERPs. See our tutorial pages on single-trial analysis as well as our latest abstracts.
> Just to be sure I follow you here: So this means that even with the > PCA limiting the outputted number of components to say 6, if we > include all 129 channels of data, we still should have about 129^2 > data points.No, after selecting the largest 6 principal components, the ICA weight matrix is 6^2, so a few hundred points would probably be sufficient -- if the data fit the ICA model. However, there may not be 6 temporally independent components in the few hundred points (as I argue above). Also, PCA is a blunt tool for compressing data, potentially allowing phenomena of interest to be removed from the data or futher mixed.
Another researcher wrote ...
> The data was composed of 26 channels, 230 epochs of 500 samples. > Therefore I passed a matrix of 26 x 160000 to the function.You are training 26^2=676 weights with 15k points - 22 points per weight. This is quite sufficient to return components with compact source maps and distinctive dynamics, in our experience.
Should I Concatenate Data Conditions?
> Would it be more appropriate, therefore, to concatenate our > conditions to get sufficient numbers of data points, and then > perform an ICA?A: Yes, this is what I'm suggesting.
Should I worry about discontinuities?
> I have not attempted to match the ends of the epoch to each other. > I just assumed that the discontinuities will be mapped to the largest > component.But ICA does not see discontinuities - In fact, it shuffles all the time points randomly before each training step! This is, for example, quite unlike FFTs, which operate on time series and does not 'see' between-channel relationships. ICA (operating on EEG data) sees only an unordered pile of maps!
> Would it be possible to "pad" a data set and concatenate this "mock" data, > similiar to padding to a multiple of two for an FFT?No. Zero-padding here makes no sense, nor does noise padding. Only adding extra conditions in which (mainly) the same set of sources act (somewhat) differently with respect to each other will help.
If I use real input, why do I get complex output?
> I am using the toolbox for studying late-potentials (>300 ms) in mismatch > negativity ERP data. I am trying to remove the components that are not > significant in the frontal, parietal and occipital areas. > The data is 26-channel, 21 EEG + EKG + Eyes + EMG > > My problem is that "runica" sometime produces complex (in the math sense) > results and takes 11 hours to compute. > The norm of the imaginary components is 1/200 of the real components. > Before running runica() I have: > (1) removed the baseline from each epoch using the prestimulus data > (2) discarded epochs with amplitudes of over 50 uV in the CZ channel > (3) converted the data to average reference using averef() > (4) removed the baseline again using rmbase() > > Am I omitting some crucial preprocessing ?Failure to converge and complex results usually means your data is not of complete rank. Run >> rank(data) to test this. Applying averef() reduces the data dimension by 1, so running runica() (or better, binica()) with the 'PCA' option set appropriately (e.g., to nchans-1 or lower) is necessary.
In general, I do not believe using averef() is a good idea (unless you have really high-density, whole-head recordings). If you want, you can use averef() at the end of the analysis instead of the beginning.
Are four channels enough for ICA?
> We are currently trying to analyse ERP data collected from an > experiment involving rather complicated paradigm which includes mismatch, > emotional / non-emotional stimuli, as well as infrequent stimuli > (oddball-paradigm-like). However, we have only 4 channels. As you have > advised on your web page, we should have more channels than the expected > number of components. > > From the results of your experiments (J Neurosci 19:2665-2680 (1999)), > you found 4 psychologically meaningful LP components. Although we have > a different paradigm (hence possibly different number of expected components), > would it still be advisable to perform ICA on the data to obtain meaningful > results? > > My supervisor is quite keen to perform ICA on the data. However I am personally > against it, as I expect more components (including 'noise' components) than > we have channels. Please advise.I agree with you that ICA cannot be expected to give optimum results applied to a collection of ERPs with only four channels. However, as an exploratory measure, ICA is easy to perform and visualize (see the new toolbox tutorial). The question is how much belief to put in the functional independence of the resulting components. This should not in any case be blind faith, but faith won through finding convergent behavioral and other physiological evidence.
If you are serious about squeezing maximum information out of 4-channel data, I would advise using short-time moving-window ICA applied to the raw data. This will give a huge collection of components, which must be clustered, etc. (not easy). You could also investigate blind deconvolution applied to the single trials, though this is full of circular confounds, etc... It might be easier to collect the data again with higher density, somewhere. Good luck!
What is temporal ICA?
> As you are aware, there are two implementations of PCA -- temporal > and spatial. From reading your webpage, I get the impression that spatial > PCA and ICA are superior form of temporal PCA. But from some of my readings > I got the idea that spatial PCA (and possibly ICA) and temporal PCA answers > different questions altogether. Can you knindly clarify this? Also, is there > a 'temporal PCA'-equivalent form of ICA? Thanks for your time.These terms are confusing. What you call "spatial ICA" looks at collections of maps (space) to find temporally independent basis maps. In what you call "temporal ICA," the algorithm looks at collections of time courses to find spatially independent basis maps!
What you call temporal ICA is used for fMRI analysis (see McKeown et al., Human Brain Mapping '98, where it makes sense. For EEG it does not, since ~no EEG channel is independent of any other.
As for spatial/temporal PCA, I do believe the point I made in the JNS '99 paper. As Promax (not PCA) tries to minimize the support of each component, spatial Promax will attempt to find superficial sources (projecting strongly to only a few electrodes). Therefore spatial Promax sources have features like 'nipples.' As Chapman and McCrary pointed out, temporal Promax makes more sense, since in this case the component have minimum temporal support - i.e. are 'on' for a minimal period of time.
Be wary of claims that researchers are using "PCA." Note what rotation method they are in fact using. In fact, Varimax, an intermediate step in the Promax routine, does not require PCA pre-processing (Moecks), and so PCA itself plays little or no essential role in results of Varimax or Promax!
PCA itself tries to gather as much activity as possible into the first (and then into second, etc) components, constrained by the quite unreasonable assumption that the scalp maps are orthogonal. PCA may be used for dimension reduction (e.g. prior to ICA), though by removing the many smallest principal components one runs the risk of removing small details of interest.
The same cautions about component interpretation apply to Promax (or other -max) results as apply to (Infomax) ICA. Again, faith in the meaning of components found by any liniear decomposition should not in any case be blind faith, but faith won through discovering and confirming the reliability of convergent behavioral and other physiological evidence.
ICA on fMRI data?
> I am currently implementing the ICA approach for the analysis of > functional MRI data. However, I have some problems with the > methodology of ICA. If I have understood the runica() code correctly, > it is not designed to process large number of channels with a few > number of measurements. But a typical MRI experiment provides > data, which consists of a huge number of channels and only a small > number of images. In your papers published in PNAS and HBM98 you > applied ICA to functional MRI data. Therefore I assume, that you > performed a preselection of the voxels, but I didn't understand > how you did this.For fMRI, the independence is considered over voxels because of brain modularity. i.e., Simplistically, "Different places do different things." Looking for temporal independence of many brain areas over a relatively few time points makes little sense.
, Human Brain Mapping 6(3):160-88 1998.Therefore, the data matrix for fmri is of size (times,voxels). For more details, please read our paper McKeown et al., "Analysis of fMRI by blind separation into independent spatial components"
> Did you use something like a delayed-boxcar function to select > the pixels prior to ICA or did you use all pixels of the brain?We used all brain voxels. Even out-of-brain voxels might help if they contain artifacts (like 'ghosting') that also affect some in-brain voxels.
> If you used all pixels from inside the brain (like it > seems from Fig1. in the PNAS paper) the ICA has to process > around 20000 pixels. Moreover, in a pharmaMRI experiment > one has to include all voxels, because it is not clear a > priory , where something happens and which effect will be > induced by the drug. But processing of all voxels violates > the requirement #Channels << #Frames. Secondly, the sphering > matrix becomes very large and on our workstation > (even with a lot of memory) it is impossible to the perform > the matrix calculations without crashing MATLAB.The sphering dimension is the number of time points (or less, if 'pca' reduction is introduced in runica()).
> Will you be adding functions to visualize independent > fMRI components?Yes, we plan to do this soon, with easy data paths to/from SPM and AFNI.
What are epochs and single trials?
> I have some questions about your ICA EEG toolbox. As far as > single-trial ERPs is concerned,the "single-trial" here means "one > mental task" or not? Moreover,what is the difference between > single-trial ERPs and averaged ERPs? About the sentence "a collection > of single-trial event-related data epochs",if the "epochs" here > means the many experiments data (i.e collections of multichannel > data on the scalp with many times) on the same mental task? I really > get confused about some concepts like "single-trial", "averaged > ERPs" and "epochs",so i want to get your answer eagerly!In our usage, epoch = one continuous time series, usually of a few seconds
single trial = single stimulus or stimulus combination presented to the subject and most often requiring some response.
single-trial epoch = one continuous data time series recorded during a single trialAverages are usually made of single-trial epochs from a class of "identical" (similar) single trials. However, averaging conceals much of the brain dynamics, highlighting only a relatively small portion which may or may not have most functional significance.
How should the activations be scaled?
> To play devil's advocate, *all properties* of the observations are > distributed between the estimated attenuation matrix and the estimated > sources. After all, if X is the matrix of observations (ie the matrix > whose t'th column is the vector of observations at time t) and > A=inv(W) is the estimated attenuation matrix, and S is the matrix of > estimated sources, then X = AS.I don't understand your "*all*" here. The value of the decomposition for EEG/MEG is that it separates the "apparently moving" scalp scalp distribution into a set of spatially fixed maps A=inv(W) and temporally (maximally) independent timecourses S ("activations").
SVD/PCA, on the other hand, are said to separate the decomposition into: X = ADS, where D, a diagonal matrix, gives the (eigenvalue) strengths of the components (e.g. in RMS uV or fT). This formalism could just as well be used for ICA decompositions, and probably should be to avoid confusions.
For anything you want to know about the maps, consult A; about the timecourses, consult S; about the strengths, consult D; about the polarities consult the component projections. If the separability assumptions are incorrect (e.g., if the sources actually move or are dependent), consult some other method! But don't "trust" your *interpretation* of components uncritically (as below).
> We would like to be able to say something about the scale of > the numbers in S, ie their units, without reference to A or X. > That seems reasonable, but it cannot be done without thinking about > the physics of the situation. The suggestion of projecting the > activations back into sensor space and using sensor scale is elegant, > and makes use of the physics of the situation without relying > on a forward model. But it does require us to trust the matrix A > which we might not be prepared to do. After all, we may only trust > *some* of the rows of W, but A is contaminated by all of W, including > rows of W we may not trust. It also conceptually replicates an > estimated source across an array of sensors, when we might wish to > know something about the properties of the source itself as apart from > which sensors it seems connected to.The ICA solution X = SA makes the timecourses A maximally independent. One's yardstick for and degree of "trust" in an interpretation of any of these rests in other considerations (biophysics, psychology etc). The columns of A are maps (spatial weights) of components that make the component time courses maximally independent, i.e. maximally isolated from the influences of the other components - whether or not we "trust" any of them in any particular sense other than independence.
> In the case of EEG or MEG, my lab handles scaling as follows. We know > that the numbers in X all have the same units (volts, fT/cm, > whatever). We know that the sensors all have independent Gaussian > sensor noise of some magnitude, this magnitude being the same for all > our sensors. We distribute information about scale between the > matrices A and S so that the numbers in S have the same units as did > the numbers in X. We do this by making sure that the level of noise > in S induced by sensor noise in X is the same as the level of sensor > noise in X itself. By the additivity of variances, this can be done > by simply scaling each row of W to be of unit vector length.As you say, the maps are attentuation vectors; the source signals (S) are actually much stronger than the data X. Without knowing the absolute attenuations (requiring knowledge of source distances, etc), one cannot derive correct source units.
Back-projecting a component will reconstitute whatever part of the sensor noise that has been assigned to that component, unless you use an algorithm that estimates the noise level separately. If not, you should scale to this partial noise, not the whole noise. If all your sensors have exactly the same noise level(?), then what you are doing is equivalent to normalizing the columns of A, no?
> A third alternative, mentioned by Kevin Knuth, is to scale the columns > of A to maximize consistency with some parameterized forward numeric > model of the mixing process, or even constrain them to be consistent > with such a forward numeric model during the separation process. This > sounds very sensible in theory, but entails a number of practical > difficulties. The primary difficulty is that there is often more than > one forward model available, and often none of the forward models are > highly trusted. This is certainly the case with EEG/MEG, where many > practitioners trust their forward numeric models about as far as they > can throw their MRI machines. Even a trusted forward model may be > problematic, as one usually finds local minima when fitting the model > to a single columns of A, which would give rise to multiple proposed > scalings of the corresponding source depending upon which of the fits > one believes.But what practical good does having an estimate of source strength give you if the biophysical extent, shape and nature of the source is undetermined? Kevin proposes using the Bayesian framework to constrain ICA by a priori probabilities of source locations, shapes, maximum strengths, etc, as well as relative probabilities of independence or other time course characteristics, I suppose. I doubt whether adequate estimates of these are available, but reasonable guesses might prove useful.
> Another difficulty with this approach is numeric: constraints on A are > inconvenient because ICA algorithms usually estimate W directly, so > the forward model amounts to constraints on parts of the inverse of > the unstructured dense ill-conditioned matrix being estimated.Here, results of ICA are only as good as the fit of the data to the ICA assumptions, primarily independence and spatial fixity. The conditioning of the matrix might be an issue in very high dimensions and/or where the component scalp maps overlap highly (i.e., resolve close sources).
How can I measure the success of ICA?
> Is there any notion of gold standard in these problems? > An agreed-upon metric of performance? Or is just being > able to get reasonable-looking results in a reasonable > amount of compute time the goal?For biological data, and more generally for factor analysis and clustering, the issue of ICA model "success" is not cut and dried, since modeling the data as the exact sum of a determinate number of sources, factors or clusters is not in fact possible. The same situation holds for auditory data, by the way, where the number of sound sources in the real-world environment (e.g., a moving car) may not be truly determinate.
In each domain, ICA results may be judged on their functional significance (e.g., Do they differentiate signals from functionally distinct brain brain or sound sources?). But Zeki uses ICA to find transitory fMRI coactivations in sets of brain areas while the subject watches a (James Bond) movie . There, his challenge is to determine if the observed correlations may have any functional significance at all, rather than to establish the SNR gain of ICA outputs relative to the raw data, as is possible in simulation experiments.
A basic question in science is whether scientists should pay more attention to measuring or to observing. Quite often observing inspires modeling whose testing requires measuring! Each stage of this process has its own goals and challenges, and it own metric for success.
- Scott Makeig
More ICA FAQsComments and suggestions on this tutorial
are welcome. Email smakeig@ucsd.edu