EEGLAB Tutorial
VI. EEGLAB Studysets and Independent Component Clustering
Overview
This tutorial describes how to use a new (EEGLAB v5.0-) structure, the STUDY, to manage and process data recorded from multiple subjects, sessions, and/or conditions of an experimental study. EEGLAB uses studysets for performing statistical comparisons, for automated serial (and in future parallel) computation, and for clustering of independent signal components across subjects and sessions. It details the use of a new (v5.0-) set of EEGLAB component clustering functions that allow exploratory definition and visualization of clusters of equivalent or similar ICA data components across any number of subjects, subject groups, experimental conditions, and/or sessions. Clustering functions may be used to assess the consistency of ICA (or, other linearly filtered) decompositions across subjects and conditions, and to evaluate the separate contributions of identified clusters of these data components to the recorded EEG dynamics.
EEGLAB STUDY structures and studysets: EEGLAB v5.0 introduced a new basic concept and data structure, the STUDY. Each STUDY, saved on disk as a "studyset" (.std) file, is a structure containing a set of epoched EEG datasets from one or more subjects, in one or more groups, recorded in one or more sessions, in one or more task conditions -- plus additional (e.g., component clustering) information. In future EEGLAB versions, STUDY structures and studysets will become primary EEGLAB data processing objects. As planned, operations now carried out from the EEGLAB menu or the Matlab command line on datasets will be equally applicable to studysets comprising any number of datasets.
First use of STUDY structures: The first use for EEGLAB studysets is to cluster similar independent components from multiple sessions and to evaluate the results of clustering. As ICA component clustering is a powerful tool for electrophysiological data analysis, and a necessary tool for applying ICA to experimental studies involving planned comparisons between conditions, sessions, and/or subject groups, the STUDY concept has been applied first to independent component clustering. A small studyset of five datasets, released with EEGLAB v5.0b for tutorial use and availaible here, has been used to create the example screens below. We recommend that after following the tutorial using this small example studyset, users next explore component clustering by forming EEGLAB studies for one or more of their existing experimental studies testing the component clustering functions more fully on data they know well by repeating the steps outlined below.
Upgrades to several standard EEGLAB plotting functions also allow them to be applied simultaneously to whole studysets (either sequentially or in parallel) rather than to single datasets, for example allowing users to plot grand average channel data measures (ERPs, channel spectra, etc.) across multiple subjects, sessions, and/or conditions from the EEGLAB menu.
The dataset information contained in a STUDY structure allows straightforward statistical comparisons of component activities and/or source models for a variety of experimental designs. Currently, only a few two-condition comparisons are directly supported. Currently we are writing Matlab functions that will process information in more general STUDY structures and the EEG data structures they contain, potentially applying several types of statistical comparison (ANOVA, permutation-based, etc.) to many types of data measures.
Current STUDY design limitations: Currently (in EEGLAB v6), studyset conditions are organized as a one-dimensional vector rather than as an arbitrary condition matrix implementing a multifactorial experimental design. This limits the statistical models that can be tested directly from the cluster gui. In future versions, studysets will be allowed to have an N-dimensional matrix of conditions, with statistics for marginal effects computed for each dimension or factor.
Matlab toolboxes required for component clustering: Currently, two clustering methods are available: 'kmeans' and 'neural network' clustering. 'Kmeans' clustering requires the Matlab Statistical Toolbox, while 'neural network' clustering uses a function from the Matlab Neural Network Toolbox. To learn whether these toolboxes are installed, type ">> help" on the Matlab command line and see if the line "toolbox/stats - Statistical toolbox" and/or the line "nnet/nnet - Neural Network toolbox" are present. In future, we plan to explore the possibility of providing alternate algorithms that do not require these toolboxes, as well as options to cluster components using other methods.
This tutorial assumes that readers are already familiar with the material covered in the main EEGLAB tutorial and also (for the later part of this chapter) in the EEGLAB script writing tutorial.
VI. Outline
1. Component Clustering
1.1. Why cluster?
2. The study ICA clustering interface
1.2. Before clustering
1.3. Clustering outline
2.1. Creating a new STUDY
3. STUDY data visualization tools
2.2. Loading an existing studyset
2.3. Preparing to cluster (pre-clustering)
2.4. Finding clusters
2.5. Viewing clusters
2.6. Editing clusters
2.7. Hierarchic sub-clustering
2.8. Editing STUDY datasets
4. Study statistics and visualization options
4.1. Parametric and non-parametric statistics
5. EEGLAB study data structures
4.2. Options for computing statistics on and plotting results for scalp channel ERPs
4.3. Computing statistics for studies with multiple groups and conditions
5.1. The STUDY structure
6. Command line STUDY functions
5.2. The STUDY.datasetinfo sub-structure
5.3. The STUDY.cluster sub-structure
5.4. The STUDY.changrp sub-structure
6.1. Creating a STUDY
6.2. Component clustering and pre-clustering
6.3. Visualizing component clusters
6.4. Computing and plotting channel measures
6.5. Plotting statistics and retrieving statistical results
6.6. Modeling condition ERP differences using std_envtopo()
VI.1. Component Clustering
VI.1.1. Why cluster?
Is my Cz your Cz? To compare electrophysiological results across subjects, the usual practice of most researchers has been to identify scalp channels (for instance, considering recorded channel "Cz" in every subject's data to be spatially equivalent). Actually, this is an idealization, since the spatial relationship of any physical electrode site (for instance, Cz, the vertex in the International 10-20 System electrode labeling convention) to the underlying cortical areas that generate the activities summed by the (Cz) channel may be rather different in different subjects, depending on the physical locations, extents, and particularly the orientations of the cortical source areas, both in relation to the 'active' electrode site (e.g., Cz) and/or to its recorded reference channel (for example, the nose, right mastoid, or other site).
That is, data recorded from equivalent channel locations (Cz) in different subjects may sum activity of different mixtures of underlying cortical EEG sources, no matter how accurately the equivalent electrode locations were measured on the scalp. This fact is commonly ignored in EEG research.
Is my IC your IC? Following ICA (or other linear) decomposition, however, there is no natural and easy way to identify a component from one subject with one (or more) component(s) from another subject. A pair of independent components (ICs) from two subjects might resemble and/or differ from each other in many ways and to different degrees -- by differences in their scalp maps, power spectra, ERPs, ERSPs, ITCs, or etc. Thus, there are many possible (distance) measures of similarity, and many different ways of combining activity measures into a global distance measure to estimate component pair similarity.
Thus, the problem of identifying equivalent components across subjects is non-trivial. An attempt at doing this for 31-channel data was published in 2002 and 2004 in papers whose preparation required elaborate custom scripting (by Westerfield, Makeig, and Delorme). A 2005 paper by Onton et al. reported on dynamics of a frontal midline component cluster identified in 71-channel data. EEGLAB now contains functions and supporting structures for flexibly and efficiently performing and evaluating component clustering across subjects and conditions. With its supporting data structures and stand-alone 'std_' prefix analysis functions, EEGLAB makes it possible to summarize results of ICA-based analysis across more than one condition from a large number of subjects. This should make more routine use of linear decomposition and ICA possible to apply to a wide variety of hypothesis testing on datasets from several to many subjects.
The number of EEGLAB clustering and cluster-based functions will doubtless continue to grow in number and power in the future versions, since they allow the realistic use of ICA decomposition in hypothesis-driven research on large or small subject populations.
NOTE: Independent component clustering (like much other data clustering) has no single 'correct' solution. Interpreting results of component clustering, therefore, warrants caution. Claims to discovery of physiological facts from component clustering should be accompanied by thoughtful caveat and, preferably, by results of statistical testing against suitable null hypotheses.
VI.1.2. Before clustering
You should organize your data before running the clustering functions. We suggest creating one directory or folder per subject, then storing the EEG dataset (".set") files for that subject in this folder. The pre-clustering functions will then automatically add pre-clustering measure files to the same subject directories.
We also advise modifying the default EEGLAB memory options. Selecting menu item File > Memory options will pop-up the following window:
![]()
Set the first option so that no more than one dataset is stored in memory at a time. Dataset data arrays are then read from disk whenever EEGLAB requires access to the data, but without cluttering memory. This will allow Matlab to load and hold a large number of dataset structures forming a large STUDY. Also, unset the third option so ICA component activations do not get recomputed automatically. This saves time as datasets are re-saved automatically during pre-clustering and clustering.
VI.1.3. Clustering outline
There are several steps in the independent component clustering process:
1. Identify a set of epoched EEG datasets containing ICA weights to form the STUDY to be clustered.
2. Specify the subject code and group, task condition, and session for each dataset.
3. Identify the components in each dataset to cluster.
4. Specify and compute ("pre-clustering") measures to use in clustering.
5. Perform component clustering using these measures.
6. View the scalp maps, dipole models, and activity measures of the component clusters.
7. Perform signal processing and statistical estimation on the clusters.
VI.2. The STUDY creation and ICA clustering interfaces
This part of the clustering tutorial will demonstrate how to use EEGLAB to interactively preprocess, cluster, and then visualize the dynamics of ICA (or other linear) signal components across one or many subjects by operating on the tutorial study (which includes a few sample datasets and studysets). You may download these data here (~50Mb). Because of memory and bandwidth considerations, the sample datasets do not include the original data matrices (e.g., EEG.data). For these datasets, the measures needed for clustering and cluster visualization have already been computed during pre-clustering, as described below, and are stored with the datasets. This means that selecting individual datasets for visualization in the Datasets EEGLAB menu will give a error message. To avoid this limitation, you may also download the full studyset here (~450Mb).
Description of experiment for tutorial data: These data were acquired by Peter Ullsberger and colleagues from five subjects performing an auditory task in which they were asked to distinguish between synonymous and non-synonymous word pairs (the second word presented 1 second after the first). Data epochs were extracted from 2 sec before the second word onset to 2 sec after the second word onset. After decomposing each subject's data by ICA, two EEG datasets were extracted, one (Synonyms) comprising trials in which the second word was synonymous with the first one, and one (Non-synonyms) in which the second word was not a synonym of the first. Thus the study includes 10 datasets, two condition datasets for each of five subjects. Since both datasets of each subject were recorded in a single session, the decompositions and resulting independent component maps are the same across both conditions for each subject.
Note: With only a few subjects and a few clusters (a necessary limitation, to be able to easily distribute the example), it may not be possible to find six consistent component clusters with uniform and easily identifiable natures. We have obtained much more satisfactory results clustering data from 15 to 30 or more subjects.
After following this tutorial using the sample data, we recommend you create a study for a larger group of datasets, if available, whose properties you know well. Then try clustering components this study in several ways. Carefully study the consistency and properties of the generated component clusters to determine which method of clustering produces clusters adequate for your research purposes.
Note: We recommend performing one ICA decomposition on all the data collected in each data collection session, even when task several conditions are involved. In our experience, ICA can return a more stable decomposition when trained on more data. Having components with common spatial maps also makes it easier to compare component behaviors across conditions. To use the same ICA decomposition for several conditions, simply run ICA on the continuous or epoched data before extracting separate datasets corresponding to specific task conditions of interest. Then extract specific condition datasets; they will automatically inherit the same ICA decomposition.
After downloading the sample clustering data, unzip it in a folder of your choice (preferably in the "sample_data" sub-folder of the EEGLAB release you are currently using; under Linux use the "unzip" command). This will create a sub-folder "5subjects" containing several studysets. Then open a Matlab session and run >> eeglab.
VI.2.1. Creating a new STUDY structure and studyset
Exploratory Step: To create a studyset, select menu item File > Create study > Browse for datasets. Else, you may load into EEGLAB all the datasets you want to include in the study and select menu item File > Create study > Using all loaded datasets). A blank interface similar to the one described above will appear. In this window, first enter a name for the studyset ('N400STUDY'), a short description of the study ('Auditory task: Synonyms Vs. Non-synonyms, N400'). Here, we do not add notes about the study, but we recommend that you do so for your own studies. The accumulated notes will always be loaded with the study, easing later analysis and re-analysis. Click on the Browse button in the first blank location and select a dataset name. The interface window should then look like the following:
![]()
Warning: Under Matlab 7.x and Linux, do not copy and paste information into the edit boxes. Though it appears that this works, Matlab 7 does not correctly register such inputs. Enter all information manually (except the 'browsed' dataset names). This problem does not seem to arise under Windows.
Note that the fields "Subject" and "Condition" (below) have been filled automatically. This is because the datasets already contained this information. For instance, if you were to load this dataset into EEGLAB by selecting menu item Edit > Dataset info, you would be able to edit the "Subject" and "Condition" information for this dataset (as shown below). You may also edit it within the study itself, since dataset information and study dataset information may be different to ensure maximum flexibility. For instance, if you want one dataset to belong to several studies, but play different roles in each. Note: Use the checkbox "Update dataset information..." to maintain consistency between dataset and studyset fields. However, if you check this option, datasets may be modified on disk by clustering functions.
![]()
Enter all datasets for all subjects in the STUDY, so that the STUDY creation gui looks like this:
![]()
After you have finished adding datasets to the study, press "OK" in the pop_study() gui to import all the dataset information. We strongly recommend that you also save the STUDY as a studyset by filling in the bottom edit box in the gui, or by selecting the EEGLAB menu item File > Save study as after closing the pop_study() window.
Important note: Continuous data collected in one task or experiment session are often separated into epochs defining different task conditions (for example, separate sets of epochs time locked to targets and non-targets respectively). Datasets from different conditions collected in the same session are assumed by the clustering functions to have the same ICA component weights (i.e., the same ICA decomposition is assumed to have been applied to the data from all session conditions at once). If this was not the case, then datasets from the different conditions must be assigned to different sessions.
VI.2.2. Loading an existing studyset
KEY STEP 1: Either use the studyset created in the previous section or load another studyset. To load a studyset, select menu item File > Load existing study. Select the file "N400.study" in the folder "5subjects". After loading or creating a study, the main EEGLAB interface should look like this:
![]()
An EEGLAB STUDY (or study) contains descriptions of and links to data contained in one to many epoched datasets, for example a set of datasets from a group of subjects in one or more conditions of the same task, or performing different tasks in the same or different sessions. Other designs are possible, for instance a number of different task conditions from the same subject collected during one or more sessions. The term "STUDY" indicates that these datasets should originate from a single experimental STUDY and have comparable structure and significance. When in use, studyset datasets are partially or totally loaded into EEGLAB. They thus may also be accessed and modified individually, when desired, through the main EEGLAB graphic interface or using EEGLAB command line functions or custom dataset processing scripts.
In the EEGLAB gui (above), "Epoch consistency" indicates whether or not the data epochs in all the datasets have the same lengths and limits. "Channels per frame" indicates the number of channels in each of the datasets (Note: It is possible to process datasets with different numbers of channels). "Channel location" indicates whether or not channel locations are present for all datasets. Note that channel locations may be edited for all datasets at the same time (simply call menu item Edit > Channel locations). The "Clusters" entry indicates the number of component clusters associated with this STUDY. There is always at least one cluster associated with a STUDY. This contains all the pre-selected ICA components from all datasets. The "Status" entry indicates the current status of the STUDY. In the case above, this line indicates that the STUDY is ready for pre-clustering. Below, we detail what the STUDY terms "subjects", "conditions", "sessions", and "groups" mean.
To list the datasets in the STUDY, use menu item Study > Edit study info. The following window will pop up:
![]()
The top of the window contains information about the STUDY, namely its running name, the extended task name for the STUDY, and some notes. The next section contains information about the 10 datasets that are part of the STUDY. For each dataset, we have specified a subject code and condition name. We chose to leave the session and group labels empty, as they were irrelevant for this STUDY. (For this STUDY, there was only one subject group and data for both experimental conditions were collected in a single session, so the same ICA decomposition was used for both conditions). Uniform default values will be used by EEGLAB for those fields. The Components column contains the components for each dataset that will be clustered. Note that if you change the component selection (by pressing the relevant push button), all datasets with the same subject name and the same session number will also be updated (as these datasets are assumed to have the same ICA components).
Each of the datasets EEG structures may also contain subject, group, session, and condition fields. They do not necessarily have to be the same as those present in the STUDY. For example, the same dataset may represent one condition in one STUDY and a different condition in another STUDY.
- In general, however, we prefer the dataset information to be consistent with the studyset information -- thus we check the first checkbox above.
- The second checkbox removes all current cluster information. When cluster information is present, it is not possible to add or remove datasets and to edit certain fields (because this would not be consistent with the already computed clusters). Re-clustering the altered STUDY does not take much time, once the pre-clustering information for the new datasets (if any) has been computed and stored.
- The third checkbox allows the STUDY to be saved (or re-saved) as a studyset (for example, 'MyName.std').
KEY STEP 2: Here, we begin by pre-selecting components for clustering. Simply press the "Select by r.v." (r.v. = residual variance) push button in the gui above. The entry box below will appear. This allows you to set a threshold for residual variance of the dipole model associated with each component. Note: Using this option requires that dipole model information is present in each dataset. Use EEGLAB plug-in DIPFIT2 options and save the resulting dipole models into each dataset before calling the study guis. Otherwise, options related to dipole localization will not be available.
![]()
This interface allows specifying that components used in clustering will only be those whose equivalent dipole models have residual dipole variance of their component map, compared to the best-fitting equivalent dipole model projection to the scalp electrodes, less than a specified threshold (0% to 100%). The default r.v. value is 15%, meaning that only components with dipole model residual variance of less than 15% will be included in clusters. This is useful because of the modeled association between components with near 'dipolar' (or sometimes dual-dipolar) scalp maps with physiologically plausible components, those that may represent the activation in one (or two coupled) brain area(s). For instance, in the interface above, the default residual variance threshold is set to 15%. This means that only component that have an equivalent dipole model with less than 15% residual variance will be selected for clustering. Pressing "OK" will cause the component column to be updated in the main study-editing window. Then press "OK" to save your changes.
VI.2.3. Preparing to cluster (Pre-clustering)
The next step before clustering is to prepare the STUDY for clustering. This requires, first, identifying the components from each dataset to be entered into the clustering (as explained briefly above), then computing component activity measures for each study dataset (described below). For this purpose, for each dataset component the pre-clustering function pop_preclust() first computes desired condition-mean measures used to determine the cluster 'distance' of components from each other. The condition means used to construct this overall cluster 'distance' measure may be selected from a palette of standard EEGLAB measures: ERP, power spectrum, ERSP, and/or ITC, as well as the component scalp maps (interpolated to a standard scalp grid) and their equivalent dipole model locations (if any).
Note: Dipole locations are the one type of pre-clustering information not computed by pop_preclust(). As explained previously, to use dipole locations in clustering and/or in reviewing cluster results, dipole model information must be computed separately and saved in each dataset using the DIPFIT2 EEGLAB plug-in.
The aim of the pre-clustering interface is to build a global distance matrix specifying 'distances' between components for use by the clustering algorithm. This component 'distance' is typically abstract, estimating how 'far' the components' maps, dipole models, and/or activity measures are from one another in the space of the joint, PCA-reduced measures selected. This will become clearer as we detail the use of the graphic interface below.
KEY STEP 3: Computing component measures. Invoke the pre-clustering graphic interface by using menu item Study > Build pre-clustering array.
![]()
The top section of the pop_preclust() gui above allows selecting clusters from which to produce a refined clustering. There is not yet any choice here -- we must select the parent datasets that contain all selected components from all datasets (e.g., the components selected at the end of the previous section).
The checkboxes on the left in the second section of the pop_preclust() interface above allow selection of the component activity measures to include in the cluster location measure constructed to perform clustering. The goal of the pre-clustering function is to compute an N-dimensional cluster position vector for each component. These 'cluster position' vectors will be used to measure the 'distance' of components from each other in the N-dimensional cluster space. The value of N is arbitrary but, for numeric reasons pertaining to the clustering algorithms, should be kept relatively low (e.g., <10). In the cluster position vectors, for example, the three first values might represent the 3-D (x,y,z) spatial locations of the equivalent dipole for each component. The next, say, 10 values might represent the largest 10 principal components of the first condition ERP, the next 10, for the second condition ERP, and so on.
If you are computing (time/frequency) spectral perturbation images, you cannot use all their (near-3000) time-frequency values, which are redundant, in any case. Here also, you should use the "Dim." column inputs to reduce the number of dimensions (for instance, to 10). Note: pop_preclust() reduces the dimension of the cluster position measures (incorporating information from ERP, ERSP, or other measures) by restricting the cluster position vectors to an N-dimensional principal subspace by principal component analysis (PCA).
You may wish to "normalize" these principal dimensions for the location and activity measures you select so their metrics are equivariant across measures. Do this by checking the checkbox under the "norm" column. This 'normalization' process involves dividing the measure data of all principal components by the standard deviation of the first PCA component for this measure. You may also specify a relative weight (versus other measures). For instance if you use two measures (A and B) and you want A to have twice the "weight" of B, you would normalize both measures and enter a weight of 2 for A and 1 for B. If you estimate that measure A has more relevant information than measure B, you could also enter a greater number of PCA dimension for A than for B. Below, for illustration we elect to cluster on all six allowed activity measures.
TIP: All the measures described below, once computed, can be used for clustering and/or for cluster visualization (see the following section of the tutorial, Edit/Visualize Component Cluster Information). If you do not wish to use some of the measures in clustering but still want to be able to visualize it, select it and enter 0 for the PCA dimension. This measure will then be available for cluster visualization although it will not have been used in the clustering process itself. This allows an easy way of performing exploratory clustering on different measure subsets.
Spectra: The first checkbox in the middle right of the pre-clustering window (above) allows you to include the log mean power spectrum for each condition in the pre-clustering measures. Clicking on the checkbox allow you to enter power spectral parameters. In this case, a frequency range [lo hi] (in Hz) is required. Note that for clustering purposes (but not for display), for each subject individually, the mean spectral value (averaged across all selected frequencies) is subtracted from all selected components, and the mean spectral value at each frequency (averaged across all selected components) is subtracted from all components. The reason is that some subjects have larger EEG power than others. If we did not subtract the (log) means, clusters might contain components from only one subject, or from one type of subject (e.g., women, who often have thinner skulls and therefore larger EEG than men).
ERPs: The second checkbox computes mean ERPs for each condition. Here, an ERP latency window [lo hi] (in ms) is required.
Dipole locations: The third checkbox allows you to include component equivalent dipole locations in the pre-clustering process. Dipole locations (shown as [x y z]) automatically have three dimensions (Note: It is not yet possible to cluster on dipole orientations). As mentioned above, the equivalent dipole model for each component and dataset must already have been pre-computed. If one component is modeled using two symmetrical dipoles, pop_preclust() will use the average location of the two dipoles for clustering purposes (Note: this choice is not optimum).
- Scalp maps: The fourth checkbox allows you to include scalp map information in the component 'cluster location'. You may choose to use raw component map values, their laplacians, or their spatial gradients. (Note: We have obtained fair results for main components using laplacian scalp maps, though there are still better reasons to use equivalent dipole locations instead of scalp maps. You may also select whether or not to use only the absolute map values, their advantage being that they do not depend on (arbitrary) component map polarity. As explained in the ICA component section, ICA component scalp maps themselves have no absolute scalp map polarity.
ERSPs and/or ITCs: The last two checkboxes allow including event-related spectral perturbation information in the form of event-related spectral power changes (ERSPs), and event-related phase consistencies (ITCs) for each condition. To compute the ERSP and/or ITC measures, several time/frequency parameters are required. To choose these values, you may enter the relevant timefreq() keywords and arguments in the text box. You may for instance enter 'alpha', 0.01 for significance masking. See the timefreq() help message for information about time/frequency parameters to select.
Final number of dimensions: An additional checkbox at the bottom allows further reduction of the number of dimensions in the component distance measure used for clustering. Clustering algorithms may not work well with measures having more than 10 to 20 dimensions. For example, if you selected all the options above and retained all their dimensions, the accumulated distance measures would have a total of 53 dimensions. This number may be reduced (e.g., to a default 10) using the PCA decomposition invoked by this option. Note that, since this will distort the cluster location space (projecting it down to its smaller dimensional 'shadow'), it is preferable to use this option carefully. For instance, if you decide to use reduced-dimension scalp maps and dipole locations that together have 13 dimensions (13 = the requested 10 dimensions for the scalp maps plus 3 for the dipole locations), you might experiment with using fewer dimensions for the scalp maps (e.g., 7 instead of 10), in place of the final dimension reduction option (13 to 10).
Finally, the pop_preclust() gui allows you to choose to save the updated STUDY to disk.
In the pop_preclust() select all methods and leave all default parameters (including the dipole residual variance filter at the top of the window), then press OK. As explained below, for this tutorial STUDY, measure values are already stored on disk with each dataset, so they need not be recomputed, even if the requested clustering limits (time, frequency, etc.) for these measured are reduced.
Re-using component measures computed during pre-clustering: Computing the spectral, ERP, ERSP, and ITC measures for clustering may, in particular, be time consuming -- requiring up to a few days if there are many components, conditions, and datasets! The time required will naturally depend on the number and size of the datasets and on the speed of the processor. Future EEGLAB releases will implement parallel computing of these measures for cases in which multiple processors are available. Measures previously computed for a given dataset and stored by std_preclust() will not be recomputed, even if you narrow the time and/or frequency ranges considered. Instead, the computed measure information will be loaded from the respective Matlab files in which it was saved by previous calls to pop_preclust().
Measure data files are saved in the same directory/folder as the dataset, and have the same dataset name -- but different filename extensions. For example, component ERSP information for the dataset syn02-S253-clean.set is stored in a file named syn02-S253-clean.icaersp. As mentioned above, for convenience it is recommended that each subject's data be stored in a different directory/folder. If all the possible clustering measures have been computed for this dataset, the following Matlab files should be in the /S02/ dataset directory:
syn02-S253-clean.icaerp (ERPs)
- syn02-S253-clean.icaspec (power spectra)
- syn02-S253-clean.icatopo (scalp maps)
- syn02-S253-clean.icaersp (ERSPs)
- syn02-S253-clean.icaitc (ITCs)
The parameters used to compute each measure are also stored in the file, for example the frequency range of the component spectra. Measure files are standard Matlab files that may be read and processed using standard Matlab commands. The variable names they contain should be self-explanatory.
Note: For ERSP-based clustering, if a parameter setting you have selected is different than the value of the same parameter used to compute and store the same measure previously, a query window will pop up asking you to choose between recomputing the same values using the new parameters or keeping the existing measure values. Again, narrowing the ERSP latency and frequency ranges considered in clustering will not lead to recomputing the ERSP across all datasets.
VI.2.4. Finding clusters
KEY STEP 4: Computing and visualizing clusters. Calling the cluster function pop_clust(), then selecting menu item Study > Cluster components will open the following window.
![]()
Currently, two clustering algorithms are available: 'kmeans' and 'neural network' clustering. As explained earlier, 'kmeans' requires the Matlab Statistics Toolbox, while 'neural network' clustering uses a function from the Matlab Neural Network Toolbox.
Both algorithms require entering a desired number of clusters (first edit box). An option for the kmeans() algorithm can relegate 'outlier' components to a separate cluster. Outlier components are defined as components further than a specified number of standard deviations (3, by default) from any of the cluster centroids. To turn on this option, click the upper checkbox on the left. Identified outlier components will be put into a designated 'Outliers' cluster (Cluster 2). Click on the lower left checkbox to save the clustered studyset to disk. If you do not provide a new filename in the adjacent text box, the existing studyset will be overwritten.
In the pop_clust() gui, enter "10" for the number of clusters and check the "Separate outliers ..." checkbox to detect and separate outliers. Then press 'OK' to compute clusters (clustering is usually quick). The cluster editing interface detailed in the next section will automatically pop up. Alternatively, for the sample data, load the provided studyset 'N400clustedit.study' in which pre-clustering information has already been stored.
VI.2.5. Viewing component clusters
Calling the cluster editing function pop_clustedit() using menu item Study > Edit > plot clusters will open the following window. Note: The previous menu will also call automatically this window after clustering has finished.
![]()
Of the 305 components in the sample N400STUDY studyset, dipole model residual variances for 154 components were above 15%. These components were omitted from clustering. The remaining 151 components were clustered on the basis of their dipole locations, power spectra, ERPs, and ERSP measures into 10 component clusters.
Visualizing clusters : Selecting one of the clusters from the list shown in the upper left box displays a list of the cluster components in the text box on the upper right. Here, SO2 IC33 means "independent component 33 for subject SO2," etc. The "All 10 cluster centroids" option in the (left) text box list will cause the function to display results for all but the 'ParentCluster' and 'Outlier' clusters. Selecting one of the plotting options below (left) will then show all 10 cluster centroids in a single figure. For example, pressing the "Plot scalp maps" option will produce the figure below:
![]()
In computing the mean cluster scalp maps (or scalp map centroids), the polarity of each of the cluster's component maps was first adjusted so as to correlate positively with the cluster mean (recall that component maps have no absolute polarity). Then the map variances were equated. Finally, the normalized means were computed.
To see individual component scalp maps for components in the cluster, select the cluster of interest in the left column (for example, Cluster 8 as in the figure above Then press the 'Plot scalp maps' option in the left column. The following figure will appear. (Note: Your "Cluster 8" scalp map may differ after you have recomputed the clusters for the sample STUDY).
![]()
To see the relationship between one of the cluster centroid maps and the maps of individual components in the cluster, select the cluster of interest (for instance Cluster 8), and press the 'Plot scalp maps' option in the right pop_clustedit() column.
Note: Channels missing from any of the datasets do not affect clustering or visualization of cluster scalp maps. Component scalp maps are interpolated by the toporeplot() function, avoiding the need to restrict STUDY datasets to a common 'always clean' channel subset or to perform 'missing channel' interpolation on individual datasets.
![]()
You may also plot scalp maps for individual components in the cluster by selecting components in the right column and then pressing 'Plot scalp maps' (not shown).
A good way to visualize all the average cluster measures at once is to first select a cluster of interest from the cluster list on the left (e.g., Cluster 8), and then press the 'Plot cluster properties' push button. The left figure below presents the Cluster-8 mean scalp map (same for both conditions), average ERP and spectrum (for these, the two conditions are plotted in different colors), average ERSP and ITC (grand means for both conditions; the individual conditions may be plotted using the 'Plot cluster properties' push button). The 3-D plot on the bottom left presents the locations of the centroid dipole (red) and individual component equivalent dipoles (blue) for this cluster.
![]()
To quickly recognize the nature of component clusters by their activity features requires experience. Here Cluster 8 accounts for some right occipital alpha activity -- note the strong 10-Hz peak in the activity spectra. The cluster ERPs show a very slow (1-Hz) pattern peaking at the appearance of first words of the word pairs (near time -1 s). The apparent latency shift in this slow wave activity between the two conditions may or may not be significant. A positive (though still quite low, 0.06) ITC follows the appearance of the first word in each word pair (see Experimental Design), indicating that quite weakly phase-consistent theta-band EEG activity follows first word onsets. Finally, blocking of spectral power from 7 Hz to at least 25 Hz appears just after onset of the second words of word pairs (at time 0) in the grand mean ERSP plot (blue region on top right)
To review all "Cluster 8" component dipole locations, press the 'Plot dipoles' button in the left column. This will open the plot viewer showing all the cluster component dipoles (in blue), plus the cluster mean dipole location (in red). You may scroll through the dipoles one by one, rotating the plot in 3-D or selecting among the three cardinal views (lower left buttons), etc. Information about them will be presented in the left center side bar (see the image below).
![]()
As for the scalp maps, the pop_clustedit() gui will separately plot the cluster ERPs, spectra, ERSPs or ITCs. Let us review once more the different plotting options for the data spectrum. Pressing the 'Plot spectra' button in the left column will open a figure showing the two mean condition spectra below.
![]()
Pressing the 'Plot spectra' button in the right column with "All components" selected in the left column will open a figure displaying for each condition all cluster component spectra plus (in bold) their mean.
![]()
Finally, to plot the condition spectra for individual cluster components, select one component from the 'Select component(s) to plot' list on the right and press 'Plot spectra' in the right column. For example, selecting component 37 from subject S02 (SO2 IC37) will pop up the figure below. Here, the single component spectra are shown in light blue as well as the mean cluster spectrum (in black).
![]()
VI.2.6. Editing clusters
The results of clustering (by either the 'k-means' or 'Neural network' methods) can also be updated manually in the preview cluster viewing and editing window (called from menu item Study > Edit/plot clusters). These editing options allow flexibility for adjusting the clustering. Components can be reassigned to different clusters, clusters can be merged, new clusters can be created, and 'outlier' components can be rejected from a cluster. Note that if you make changes via the pop_clustedit() gui, then wish to cancel these changes, pressing the Cancel button will cause the STUDY changes to be forgotten.
![]()
Renaming a cluster: The 'Rename selected cluster' option allows you to rename any cluster using a (mnemonic) name of your choice. Pressing this option button opens a pop-up window asking for the new name of the selected cluster. For instance, if you think a cluster contains components accounting for eye blinks you may rename it "Blinks".
Automatically rejecting outlier components: Another editing option is to reject 'outlier' components from a cluster after clustering. An 'outlier' can be defined as a component whose cluster location is more than a given number of standard deviations from the location of the cluster centroid. Note that standard deviation and cluster mean values are defined from the N-dimensional clustering space data created during the pre-clustering process.
For example, if the size of the pre-clustering cluster location matrix is 10 by 200 (for 200 components), then N = 10. The default threshold for outlier rejection is 3 standard deviations. To reject 'outlier' components from a cluster, first select the cluster of interest from the list on the left and then press the 'Auto-reject outlier components' option button. A window will open, asking you to set the outlier threshold. 'Outlier' components selected via either the 'Reject outlier components' option or the 'Remove selected outlier component(s)' option (below) are moved from the current cluster '[Name]' to a cluster named 'Outlier [Name]'.
Removing selected outlier components manually: Another way to remove 'outlier' components from a cluster is to do so manually. This option allows you to de-select seeming 'outlier' components irrespective of their distance from the cluster mean. To manually reject components, first select the cluster of interest from the list on the left, then select the desired 'outlier' component(s) from the component list on the right, then press the 'Remove selected outlier comps.' button. A confirmation window will pop up.
Creating a new cluster: To create a new empty cluster, press the 'Create new cluster' option, this opens a pop-up window asking for a name for the new cluster. If no name is given, the default name is 'Cls #', where '#' is the next available cluster number. For changes to take place, press the 'OK' button in the pop-up window. The new empty cluster will appear as one of the clusters in the list on the left of the editing/viewing cluster window.
Reassigning components to clusters: To move components between any two clusters, first select the origin cluster from the list on the left, then select the components of interest from the component list on the right, and press the 'Reassign selected component(s)' option button. Select the target cluster from the list of available clusters.
Saving changes: As with other pop_ functions, you can save the updated STUDY set to disk, either overwriting the current version - by leaving the default file name in the text box - or by entering a new file name.
VI.2.7. Hierarchic sub-clustering
The clustering tools also allow you to perform hierarchic sub-clustering. For example, imagine clustering all the components from all the datasets in the current STUDY (i.e., the Parent Cluster) into two clusters. One cluster turn out to contain only artifactual non-EEG components (which you thus rename the 'Artifact' cluster) while the other contains non-artifactual EEG components (thus renamed 'Non-artifact').
NOTE: This is only a quite schematic example for tutorial purposes: It may normally not be possible to separate all non-brain artifact components from cortical non-artifact components by clustering all components into only two clusters -- there are too many kinds of scalp maps and artifact activities associated with the various classes of artifacts!
![]()
At this point, we might try further clustering only the 'Artifact' cluster components, e.g., attempting to separate eye movement processes from heart and muscle activity processes, or etc. A schematic of a possible (but probably again not realistic) further decomposition is shown schematically above.
In this case, the parent of the identified eye movement artifact cluster is the 'Artifact' cluster; the child cluster of 'eye' artifacts itself has no child clusters. On the other hand, clustering the 'Non-artifact' cluster produces three child clusters which, presumably after careful inspection, we decide to rename 'Alpha', 'Mu' and 'Other'.
To refine a clustering in this way, simply follow the same sequence of event as described above. Call the pre-clustering tools using menu item Study > Build preclustering array. You may now select a cluster to refine. In the example above, we notice that there seem to be components with different spectra in Cluster 8, so let us try to refine the clustering based on differences among the Cluster-8 component spectra.
![]()
Select the Study > Cluster components menu item. Leave all the defaults (e.g., 2 clusters) and press OK.
![]()
The visualization window will then pop up with two new clusters under Cluster 8.
![]()
Below the component spectra for Cluster 13 are plotted. Note that Cluster-13 components have relatively uniform spectra.
![]()
You may also plot the scalp map of Cluster 13. All the cluster components account for occipital activity and have similarly located equivalent dipole sources.
![]()
VI.2.8. Editing STUDY datasets
As mentioned above, selecting an individual dataset from the Datasets menu allows editing individual datasets in a STUDY. Note, however, that creating new datasets or removing datasets will also remove the STUDY from memory since the study must remain consistent with datasets loaded in memory (here, however, EEGLAB will prompt you to save the study before it is deleted).
EEGLAB (v5.0b) also allows limited parallel processing of datasets of a STUDY in addition to computing clustering measures during pre-clustering. You may, for instance, filter all datasets of a STUDY. To do so, simply select menu item Tools > Filter data. You may also perform ICA decomposition of all datasets by running ICA individually on each of the datasets.
You may also select specific datasets using the Datasets > Select multiple datasets menu item and run ICA on all the datasets concatenated together (the ICA graphic interface contains a self-explanatory checkbox to perform that operation). This is useful, for example, when you have two datasets for two conditions from a subject that were collected in the same session, and want to perform ICA decomposition on their combined data. Using this option, you do not have to concatenate the datasets yourself; EEGLAB will do it for you.
VI.3. STUDY data visualization tools
EEGLAB (v6-) now allows visualization of data properties (ERP, power spectrum, ERSP and ITC) using a similar interface as the one used for ICA components.
Description of experiment for this part of the tutorial: These data were acquired by Arnaud Delorme and colleagues from fourteen subjects. Subjects were presented with pictures that either contained or did not contain animal image. Subjects respond with a button press whenever the picture presented contained an animal. These data are available for download here (380 Mb). A complete description of the task, the raw data (4Gb), and some Matlab files to process it, are all available here.
We have used these data in the following two sections since the released cluster tutorial data used in previous sections are too sparse to allow computing statistical significance. However, for initial training, you might better use that much smaller example STUDY.Before plotting the component distance measures, you must precompute them using the Study > Precompute measures menu item as shown below.
![]()
It is recommended that for clustering on channel data you first interpolate missing channels. Automated interpolation in EEGLAB is based on channel names. If datasets have different channel locations (for instance if the locations of the channels were scanned), you must interpolate missing channels for each dataset from the command line using eeg_interp(). Select all the measures above, or just one if you want to experiment. The channel ERPs have been included in the tutorial dataset; if you select ERPs, they will not be recomputed -- unless you also check the box " Recompute even if present on disk".
After precomputing the channel measures, you may now plot them, using menu item Study > Plot channel measures.
![]()
Here we only illustrate the behavior of pop_chanplot() for plotting ERPs. Spectral and time/frequency (ERSP/ITC) measure data for scalp channels may be plotted in a similar manner, as shown in the previous section on component clustering. To plot data for a single scalp channel ERP, press the Plot ERPs pushbutton on the left column. A plot like the one below should appear:
![]()
You may plot all subjects ERPs by pressing the Plot ERPs pushbutton in the left columns, obtaining a figure similar to the one below.
![]()
Finally, you may also plot all scalp channels simultaneously. To do this, simply click the push button Sel. all to select all data channels. Then again press the Plot ERPs button in the right column.
![]()
Clicking on individual ERPs will make a window plotting the selected channel ERP pop up. Many other plotting options are available in the central column of the pop_chanplot() gui. These will be described in the next section.
VI.4. Study statistics and visualization options
Computing statistics is essential to observation of group, session, and/or condition measure differences. EEGLAB allows users to use either parametric or non-parametric statistics to compute and estimate the reliability of these differences across conditions and/or groups. The same methods for statistical comparison apply both to component clusters and to groups of data channels (see following). Here, we will briefly review, for each measure (ERP, power spectrum, ERPS, ITC), how to compute differences accross the two conditions in a STUDY. At the end, we will show examples of more complex analyses involving 3 groups of subjects and 2 conditions. We will also briefly describe the characteristics of the fonction that perfoms the statistical computations, and discuss how to retrieve the p-values for further processing or publication.
VI.4.1. Parametric and non-parametric statistics
EEGLAB allows performing classical parametric tests (paired t-test, unpaired t-test, ANOVA) on ERPs, power spectra, ERSPs, and ITCs. Below, we will use channel ERPs as an example, though in general we recommend that indpendent component ERPs and other measures be used instead. This is because no data features of interest are actually generated in the scalp, but rather in the brain itself, and in favorable circumstances independent component filtering allows isolation of the separate brain source activities, rather than their mixstures recorded at the scalp electrodes.
For example, on 15 subjects' ERPs in two conditions, EEGLAB functions can perform a simple two-tailed paired t-test at each trial latency on the average ERPs from each subject. If there are different numbers of subjects in each condition, EEGLAB will use an unpaired t-test. If there are more than two STUDY conditions, EEGLAB will use ANOVA instead of a t-test. For spectra, the p-values are computed at every frequency; for ERSPs and ITCs, p-values are computed at every time/frequency point. See the sub-section on component cluster measures below to understand how to perform statistical testing on component measures.
EEGLAB functions can also compute non-parametric statistics. The null hypothesis is that there is no difference among the conditions. In this case, the average difference between the ERPs for two conditions should lie within the average difference between 'surrogate' grand mean condition ERPs, averages of ERPs from the two conditions whose condition assignments have been shuffled randomly. An example follows:
Given 15 subjects and two conditions, let us use a1, a2, a3, ... a15 the scalp channel ERP values (in microvolts) at 300 ms for all 15 subjects in the first condition, and b1, b2, b4, ... b15 the ERP values for the second condition. The grand average ERP condition difference is
d = mean (a1-b1) + (a2-b2) + ... + (a15-b15)).Now, if we repeatedly shuffle these values between the two condition (under the null hypothesis that there are no significant differences between them, and then average the shuffled values,
d1 = mean (b1-a1) + (a2-b2) + ... + (b15-a15).
d2 = mean (a1-b1) + (b2-a2) + ... + (a15-b15).
d3 = mean (b1-a1) + (b2-a2) + ... + (a15-b15).
...
we then obtain a distribution of surrogate condition-mean ERP values dx constructed using the null hypothesis (see their smoothed histogram below). If we observe that the initial d value lies in the very tail of this surrogate value distribution, then the supposed null hypothesis (no difference between conditions) may be rejected as highly unlikely, and the observed condition difference may be said to be statistically valid or significant.
![]()
Note that the surrogate value distribution above can take any shape and does not need to be gaussian. In practice, we do not compute the mean condition ERP difference, but its t-value (the mean difference divided by the standard deviation of the difference and multiplied by the square root of the number of observations less one). The result is equivalent to using the mean difference. The advantage is that when we have more conditions, we can use the comparable ANOVA measure. Computing the probability density distribution of the t-test or ANOVA is only a "trick" to be able to obtain a difference measure across all subjects and conditions. It has nothing to do with relying on a parametric t-test or ANOVA model, which assume underlying gaussian value distributions. Note that only measure that are well-behaved (e.g. are not highly non-linear) should be used in this kind of non-parametric testing.
We suggest reading in a relevant statistics book for more details: An introduction to statistics written by one of us (AD) is available here. A good textbook on non-parametric statistics is the text book by Rand Wilcox, "Introduction to robust estimation and hypothesis testing", Elsevier, 2005.
Below we illustrate the use of these options on scalp channel ERPs, though they apply to all measures and are valid for both scalp channels and independent component (or other) source clusters.
VI.4.2. Options for computing statistics on and plotting results for scalp channel ERPs
Call again menu item Study > Plot channel measures. In the middle of the two Plot ERPs buttons, click on the Params pushbutton. The following graphic interface pops up:
![]()
Check the Plot conditions on the panel checkbox to plot both conditions on the same figure. Click on the Compute condition statistics checkbox. Note that, since there are no groups in this study, the Compute group statistics is not available. Enter 0.01 for the threshold p value as shown above. Press Ok then select channel "Fz" in the left columns and press the Plot ERPs pushbutton in the same column. The following plot appears. The black rectangles under the ERPs indicate regions of significance (p < 0.01).
![]()
To show the actual p-values, call back the ERP parameter interface and remove the entry in the Threshold edit box. You may also unclick Plot conditions on the panel to plot conditions on different panels.
![]()
Press OK. Now click again Plot ERPs; the following figure pops up.
![]()
All of these plots above reported parametric statistics. While parametric statistics might be adequate for explorating your data, it is better to use permutation-based statistics (see above) to plot final results. Call back the graphic interface and select Permutation as the type of statistics to use, as shown below.
![]()
Below, we will use non-parametric statistics for all data channels. Click on the Sel. all pushbutton in the channel selection interface, and then push the Plot ERPs button. The shaded areas behind the ERPs indicate the regions of significance.
![]()
A command line function, std_envtopo()), can also visualize cluster contributions to the grand mean ERP in one or two conditions, and to their difference. See details below.
Finally, for data channels (but not for component clusters) an additional option is available to plot ERP scalp maps at specific latencies. Using the graphic interface once again, enter "200" as the epoch latency (in ms) to plot.
![]()
Select all channels (or a subset of channels) and press the Plot ERPs button. The following figure appears.
![]()
There are options in the gui above we have not discussed -- they should be self-explantory. For instance, the Time range to plot edit box allows plotting a shorter time range than the full epoch. The Plot limits edit box allows setting fixed lower and upper limits to the plotted potentials. Finally the Display filter edit box allows entering a frequency (for instance 20 Hz) below which to filter the ERP. This is only applied for the ERP display and does not affect computation of the statistics. This option is useful when plotting noisy ERPs for single subjects.
The graphic interfaces for both power spectral and ERSP/ITC measures are similar to that used for ERPs and need not be discribed in detail. You may refer to the relevant function help messages for more detail. Below, we describe plotting results for a more complex STUDY containing both subject groups and experimental conditions.
VI.4.3. Computing statistics for studies with multiple groups and conditions
This functionality is under developments and although we believe it is working properly, we are planning to update the graphic interface to make it more understandable. The problem with the current interface is that it is showing the ANOVA interaction term along with marginal statistics for groups and conditions (in future release we are planning to have users toggle either marginal statistics or statistical main effect). Here we just show one plot obtained from a study on three groups of patient of 16 subjects each and for two experimental conditions (KAN, representing responses to the appearance of Kaniza triangles, and NONKAN, representing responses to the appearance of inverted Kaniza triangles; Courtesy of Rael Cahn). Selecting only condition statistics and plotting conditions on the same panel returns the figure below.
![]()
VI.5. EEGLAB study data structures
VI.5.1. The STUDY structure
This section gives details of EEGLAB structures necessary for custom Matlab script, function, and plug-in writing.
The STUDY structure contains information for each of its datasets, plus additional information to allow processing of all datasets simultaneously. After clustering the independent components identified for clustering in each of the datasets, each of the identified components in each dataset is assigned to one component cluster in addition to Cluster 1 which contains all components identified for clustering. The STUDY structure also contains the details of the component clusters.
Below is a prototypical STUDY structure. In this tutorial, the examples shown were collected from analysis of a sample studyset comprising ten datasets, two conditions from each of five subjects. After loading a studyset (see previous sections, or as described below) using the function pop_loadstudy()), typing STUDY on Matlab command line will produce results like this:
>> STUDY
STUDY =
name: 'N400STUDY filename: 'OctN400ClustEditHier.study' filepath: '/eeglab/data/N400/' datasetinfo: [1x10 struct] session: [] subject {1x5 cell} group: {'old' 'young'} condition: {'non-synonyms' 'synonyms'} setind: [2x5 double] cluster: [1x40 struct] notes: ' ' task: 'Auditory task: Synonyms Vs. Non-synonyms, N400' history [1x4154 char] etc: [1x1 struct]
The field STUDY.datasetinfo
is an array of structures
whose length is the number of datasets in the STUDY. Each structure stores information
about one of the datasets, including its subject, condition, session, and group labels.
It also includes a pointer to the dataset file itself
(as explained below in more detail).
STUDY.datasetinfo
sub-fields 'subject', 'session', 'group'
and 'condition' label the subject, subject group,
session, and condition that is associated with each dataset in the
study. This information must be provided by the user when the STUDY structure is created. Otherwise,
default values are assumed.
The
STUDY.setind
field holds the indices of
the datasetinfo cells, ordered in a 2-D matrix in the format
[conditions by (subjects x sessions)].
The
STUDY.cluster field is an array of cluster
structures, initialized when the STUDY
is created and updated after clustering is performed (as explained
below in more detail).
The
STUDY.history
field is equivalent to the
'history' field of the EEG
structure. It stores all the commandline calls to the functions from
the gui. For basic script writing using command history information,
see the EEGLAB script
writing tutorial.
The
STUDY.etc
field contains internal information
that helps manage the use of the
STUDY
structure by the clustering functions. In particular, pre-clustering data
are stored there before clustering is performed.
The
STUDY.datasetinfo
field is used for holding information on the datasets that are part of the study.
Below is an example datasetinfo
structure, one that holds information about the first dataset in the
STUDY
:
>> STUDY.datasetinfo(1) This information was posted
when the STUDY
was created by the user.
The
datasetinfo.filepath
and
datasetinfo.filename
fields give the location of the dataset
on disk.
The
datasetinfo.subject
field attaches a subject code to the dataset.
Note:
Multiple datasets from the same subject
belonging to a
STUDY
must be distinguished as being in different experimental conditions
and/or as representing different experimental sessions.
The
datasetinfo.group
field
attaches a subject group label to the dataset.
The
datasetinfo.condition
and
datasetinfo.session
fields
hold dataset condition and session labels. If the
condition
field is
empty, all datasets are assumed to represent the same condition.
If the
session
field is empty, all datasets in the same condition are assumed to have been recorded
in different sessions.
The
datasetinfo.index
field holds the
dataset index in the ALLEEG
vector of currently-loaded dataset structures.
(Note:
Currently, datasetinfo.index = 1 must correspond to
ALLEEG(1) (typically, the first dataset loaded into EEGLAB),
datasetinfo.index = 2 to ALLEEG(2), etc.
This constraint will be removed in the future).
The
datasetinfo.comps
field holds indices of the
components of the dataset that have been designated for clustering.
When it is empty, all its components are to be clustered.
The
STUDY.cluster
sub-structure stores
information about the clustering methods applied to the STUDY and the results of clustering.
Components identified for clustering in each
STUDY
dataset are
each assigned to one of the several resulting component
clusters. Hopefully, different clusters may have spatially and/or
functionally distinct origins and dynamics in the recorded data.
For instance, one component cluster may account for eye blinks,
another for eye movements, a third for central posterior alpha
band activities, etc.
Each of the clusters is stored in a separate
STUDY.cluster field, namely,
STUDY.cluster(2), STUDY.cluster(3), etc...
The first cluster, STUDY.cluster(1) ,
is composed of all components from all datasets
that were identified for
clustering. It was created when the STUDY was created and is not a
result of clustering; it is the 'ParentCluster'.
This cluster does not contain those components whose equivalent dipole model
exhibit a high percent variance from the component's scalp map. These components
have been excluded from clustering. Typing STUDY.cluster at the Matlab commandline
returns
All this information (including the clustering results) may be edited manually from the
command line, or by using the interactive function
pop_clustedit().
Use of this function is explained above (see Editing clusters).
The cluster.name
sub-field of each cluster is initialized according to the cluster
number, e.g. its index in the cluster array (for example: 'cls 2',
'cls 3', etc.). These cluster names may be changed to any (e.g., more
meaningful) names by the user via the command line or via the
pop_clustedit()
interface.
The cluster.comps and cluster.sets
fields describe which components belong to the current cluster:
cluster.comps holds the component indices and
cluster.sets the indices of their respective datasets. Note that
several datasets may use the same component weights and scalp maps -- for instance
two datasets containing data from different experimental conditions
for the same subject and collected in the same session, therefore using the same ICA decomposition.
The
cluster.preclust
sub-field is a sub-structure
holding pre-clustering information for the component contained in the
cluster. This sub-structure includes the pre-clustering method(s),
their respective parameters, and the resulting pre-clustering PCA data
matrices (for example, mean component ERPs, ERSPs, and/or ITCs in each
condition). Additional information about the preclust sub-structure is given
in the following section in which its use in further (hierarchic) sub-clustering is explained.
The
cluster.centroid
field holds the cluster measure centroids
for each measure used to cluster the components (e.g., the mean or centroid of the cluster component
ERSPs, ERPs, ITCs, power spectra, etc. for each
STUDY
condition), plus
measures not employed in clustering but available for plotting in
the interactive cluster visualization and editing function,
pop_clustedit().
The
cluster.algorithm
sub-field holds the clustering
algorithm chosen (for example 'kmeans') and the input parameters
that were used in clustering the pre-clustering data.
The cluster.parent and cluster.child sub-fields are used in
hierarchical clustering (see hierarchic clustering).
The cluster.child sub-field contains indices of any clusters
that were created by clustering on components from this cluster
(possibly, together with additional cluster components).
The cluster.parent field contains the index of the parent cluster.
The cluster.erpdata field contains the grand average ERP data for the component cluster. For instance for two conditions (as shown in the next example), the size of this cell will contain two elements of size [750x7] (for 750 time points and 7 components in the cluster). This cell array may be given as input to the statcond() function to compute statistics. Note that when both groups and conditions are present, the cell array may be of size { 2x3 } for 2 conditions and 3 groups, each element of the cell array containing the ERP for all the component of the cluster. The cluster.erptimes contains time point latencies in ms.
The cluster.specdata field contains spectrum data stored in a form similar to the one described above, and the cluster.specfreqs array contains the frequencies at which the spectrum was computed.
The cluster.erspdata field contains the ERSP for each individual component of the cluster. For instance, if this cell array contains array of size [50x60x7], this means that there was 50 frequencies, 60 time points in the ERSP and that 7 components are present for each condition. cluster.erspfreqs and cluster.ersptimes contain the time latencies and frequency values for the ERSP. The cluster.itcdata and other ITC arrays are structured in the same way.
The cluster.topo field contains the average topography of a component cluster. Its size is 67x67 and the coordinate of the pixels are given by cluster.topox and cluster.topoy (both of them of size [1x67]). This contains the interpolated activity on the scalp so different subjects having scanned electrode positions may be visualized on the same topographic plot. The cluster.topoall cell array contains one element for each component and each condition. The cluster.topopol is an array of -1 and 1 containing the polarity for each components. Component polarities are not fixed (inverting both one component activity and its scalp map does not modify the result of the ICA compontation). The topographic polarity is also taken into account when displaying component ERPs.
Finaly, the cluster.dipole structure contains the average localization of the component cluster. This structure is the same as a single element of the model structure for a given dipole (see DIPFIT2 tutorial).
Continuing with the hierarchic design introduced briefly
above (in Hierarchic Clustering), suppose that Cluster 2
('artifacts') comprises 15 components from four of the
datasets. The cluster structure will have the following values:
>> STUDY.cluster(2)
This structure field information says that this cluster has
no other parent cluster than the ParentCluster (as always,
Cluster 1), but has three child clusters (Clusters 4, 5, and 6).
It was created by the 'Kmeans' 'algorithm' and the requested number
of clusters was '2'.
Preclustering details are stored in the STUDY.cluster(2).preclust sub-structure (not
shown above but detailed below). For instance, in this case, the cluster.preclust
sub-structure may contain the PCA-reduced mean activity spectra (in
each of the two conditions) for all 15 components in the cluster.
The cluster.preclust
sub-structure contains several fields, for example:
>>
STUDY.cluster(2).preclust
The preclustparams field holds an array of
cell arrays. Each cell array contains a string that indicates what
component measures were used in the clustering (e.g., component spectra
('spec'), component ersps ('ersp'), etc...), as well as parameters relevant to the measure.
In this example there is only one cell array, since only one measure
(the power spectrum) was used in the clustering.
For example: >>
STUDY.cluster(1).preclust.preclustparams The data measures used in the clustering were the component
spectra in a given frequency range ('freqrange'
[3 25]), the spectra were reduced to 10 principal
dimensions ( 'npca' [10]),
normalized ('norm' [1]),
and each given a weight of 1 ('weight'
[1]). When more than one method is used for clustering, then
preclustparams will contain several cell arrays.
The preclust.preclustdata
field contains the data given to
the clustering algorithm ('Kmeans'). The data size width is the number of
ICA components (15) by the number of retained principal components of the
spectra (10) shown above. To prevent redundancy, only the measure values of the 15
components in the cluster were left in the data. The other components'
measure data was retained in the other clusters.
The preclust.preclustcomps
field is a cell array of size (nsubjects x nsessions)
in which each cell holds the components clustered (i.e., all the components of the parent cluster).
The STUDY.changrp sub-structure is the equivalent of the the STUDY.cluster structure for data channels. There is usually as many element in STUDY.changrp as there are data channels. Each element of STUDY.changrp contains one data channels and regroup information for this data channel accross all subjects. For instance, after precomputing channel measures, typing STUDY.changrp(1) may return
The changrp.name field contains the name of the channel (i.e. 'FP1'). The changrp.channels field contains the channels in this group. This is because a group may contain several channels (for instance for computing measures like ERP accross a group of channels, or for instance for computing the RMS accross all data channels; note that these features are not yet completely supported in the GUI). The changrp.chaninds array contains the index of the channel in each dataset. If no channels are missing in any of the datasets, this array contains a constant for all datasets (for instance [1 1 1 1 1 1 1 1 1 ...]).
As for the component cluster structure, the cluster.erpdata field contains the grand average ERP data for the given channel. For instance for two conditions, the size of this cell will contain two elements of size [750x9] (for 750 time points and 9 subjects). As in cluster structure, this cell array may also be given as input to the
statcond()
function to compute statistics.
When both groups and conditions are present, the cell array may expand in size similar to its component cluster counterpart: it will be of size { 2x3 } for 2 conditions and 3 groups, each element of the cell array containing the ERP for a given channel for all subjects. The cluster.erptimes contains time point latencies in ms. For the spectrum, ERSP and ITC array, you may refer to the cluster sub-substructure since the organization is identical (except that the last dimensions of elements in cell arrays 'specdata', 'erspdata', and 'itcdata' contain subjects and not components).
Building a
STUDY
from
The graphic interface (as described in previous sections) calls
eponymous Matlab functions that may also be called directly by users.
Below we briefly describe these functions. See their Matlab help messages for more information.
Functions whose names begin with 'std_' take
STUDY and/or EEG structures
as arguments and perform signal processing and/or plotting directly on cluster activities.
If a STUDY contains many
datasets, you might prefer to write a small script to build the
STUDY instead of using the
pop_study()
gui. This is also helpful when you need to build many studysets
or to repeatedly add files to an existing studyset.
Below is a Matlab
script calling the GUI-equivalent command line function std_editset() from the
"5subjects" folder:
[STUDY ALLEEG] = std_editset( STUDY, [], 'commands', { ...
Above, each line of the command loads a dataset.
The last line preselects components whose equivalent dipole models have less than 15% residual variance from
the component scalp map.
See >> help std_editset for more information.
Once you have created a new studyset (or loaded it from disk), both the STUDY structure and its corresponding
ALLEEG vector of resident
EEG
structures will be variables in the Matlab workspace. Typing >> STUDY on the Matlab command line
will list field values: >> STUDY = To select
components of a specified cluster for sub-clustering
from the command line, the call to
pop_preclust()
should have the following format: >> [ALLEEG, STUDY] = pop_preclust(ALLEEG, STUDY,
cluster_id); where 'cluster_id' is the
number of the cluster you wish to sub-cluster (start with Cluster 1 if no other clusters are yet present).
Components rejected because of high residual variance (see the help message of the std_editset() function above)
will not be considered for clustering.
VI.5.2. The STUDY.datasetinfo sub-structure
ans =
filepath: '/eeglab/data/N400/S01/' filename: 'syn01-S253-clean.set' subject: 'S01' group: 'young' condition: 'synonyms' session: [] index: 1 comps: [] VI.5.3.
The STUDY.cluster sub-structure
>>
STUDY.cluster
ans =
1x23 struct array with fields:
name
parent
child
comps
sets
algorithm [cell]
preclust [struct]
erpdata [cell]
erptimes [array]
specdata [cell]
specfreqs [array]
erspdata [cell]
ersptimes [array]
erspfreqs [array]
itcdata [cell]
itctimes [array]
itcfreqs [array]
topo [2-D array]
topox [array]
topoy [array]
topoall [cell]
topopol [array]
dipole [struct]
ans =
name: 'artifacts'
parent: {'ParentCluster 1'} child: {'muscle 4' 'eye 5'
'heart 6'} comps: [6
10 15 23 1 5 20 4 8 11 17 25 3 4 12]
sets: [1 1 1 1 2 2 2 3 3 3 3 3 4 4 4] algorithm: {'Kmeans'
[2]} preclust: [1x1 struct]
erpdata: { [750x7 double]; [750x7 double] }
erpdata: [750x1 double]
ans =
preclustdata:
[15x10 double]
preclustparams:
{{1x9 cell}}
preclustcomps:
{1x4 cell}
ans =
'spec' 'npca' [10] 'norm' [1] 'weight' [1]
'freqrange' [3 25]
VI.5.4. The STUDY.changrp sub-structure
>>
STUDY.cluster
ans =
1x14 struct array with fields:
name
channels
chaninds
erpdata [cell]
erptimes [array]
specdata [cell]
specfreqs [array]
erspdata [cell]
ersptimes [array]
erspfreqs [array]
itcdata [cell]
itctimes [array]
itcfreqs [array]
VI.6. Command line STUDY functions
VI.6.1. Creating a STUDY
{ 'index' 1 'load' 'S02/syn02-S253-clean.set' 'subject' 'S02' 'condition' 'synonyms' }, ...
{ 'index' 2 'load' 'S05/syn05-S253-clean.set' 'subject' 'S05' 'condition' 'synonyms' }, ...
{ 'index' 3 'load' 'S07/syn07-S253-clean.set' 'subject' 'S07' 'condition' 'synonyms' }, ...
{ 'index' 4 'load' 'S08/syn08-S253-clean.set' 'subject' 'S08' 'condition' 'synonyms' }, ...
{ 'index' 5 'load' 'S10/syn10-S253-clean.set' 'subject' 'S10' 'condition' 'synonyms' }, ...
{ 'index' 6 'load' 'S02/syn02-S254-clean.set' 'subject' 'S02' 'condition' 'non-synonyms' }, ...
{ 'index' 7 'load' 'S05/syn05-S254-clean.set' 'subject' 'S05' 'condition' 'non-synonyms' }, ...
{ 'index' 8 'load' 'S07/syn07-S254-clean.set' 'subject' 'S07' 'condition' 'non-synonyms' }, ...
{ 'index' 9 'load' 'S08/syn08-S254-clean.set' 'subject' 'S08' 'condition' 'non-synonyms' }, ...
{ 'index' 10 'load' 'S10/syn10-S254-clean.set' 'subject' 'S10' 'condition' 'non-synonyms' }, ...
{ 'dipselect' 0.15 } });
name:
'N400STUDY' filename:
'N400empty.study' filepath: './' datasetinfo: [1x10
struct] group:
[] session:
[] subject:
{'S02' 'S05' 'S07' 'S08' 'S10'} condition: {'non-synonyms'
'synonyms'} setind: [2x5
double] cluster: [1x1
struct] notes:
'' task: 'Auditory
task: Synonyms Vs. Non-synonyms, N400' history: '' etc: '' VI.6.2.
Component clustering and pre-clustering