Randomise

randomise is FSL's tool for nonparametric permutation inference on neuroimaging data.

If you use randomise in your research please cite this article:

Winkler AM, Ridgway GR, Webster MA, Smith SM, Nichols TE. Permutation inference for the general linear model. NeuroImage, 2014;92:381-397. (Open Access)

Overview

Permutation methods (also known as randomisation methods) are used for inference (thresholding) on statistic maps when the null distribution is not known. The null distribution is unknown because either the noise in the data does not follow a simple distribution, or because non-statandard statistics are used to summarize the data. randomise allows modelling and inference using standard GLM design setup as used for example in FEAT. It can output voxelwise, cluster-based and TFCE-based tests, and also offers variance smoothing as an option.

Test Statistics in Randomise

randomise produces a test statistic image (e.g., ADvsNC_tstat1, if your chosen output rootname is ADvsNC) and sets of P-value images (stored as 1-P for more convenient visualization, as bigger is then "better"). The table below shows the filename suffices for each of the different test statistics available.

Voxel-wise uncorrected P-values are generally only useful when a single voxel is selected a priori (i.e., you don't need to worry about multiple comparisons across voxels). The significance of suprathreshold clusters (defined by the cluster-forming threshold) can be assessed either by cluster size or cluster mass. Size is just cluster extent measured in voxels. Mass is the sum of all statistic values within the cluster. Cluster mass has been reported to be more sensitive than cluster size (Bullmore et al, 1999; Hayasaka & Nichols, 2003).

Accounting for Repeated Measures

Permutation tests do not easily accommodate correlated datasets (e.g., temporally smooth timeseries), as such dependence violates null-hypothesis exchangeability. However, the case of "repeated measurements", or more than one measurement per subject in a multisubject analysis, can sometimes be accommodated.

randomise allows the definition of exchangeability blocks, as specified by the group_labels option. If specfied, the program will only permute observations within block, i.e., only observations with the same group label will be exchanged. See the repeated measures example in the Guide below for more detail.

Confound Regressors

Unlike with the previous version of randomise, you no longer need to treat confound regressors in a special way (e.g. putting them in a separate design matrix). You can now include them in the main design matrix, and randomise will work out from your contrasts how to deal with them. For each contrast, an "effective regressor" is formed using the original full design matrix and the contrast, as well as a new set of "effective confound regressors", which are then pre-removed from the data before the permutation testing begins. One side-effect of the new, more powerful, approach is that the full set of permutations is run for each contrast separately, increasing the time that randomise takes to run.

More information on the theory behind randomise can be found in the Theory section below.

References

The primary reference for randomise, which describes the algorithm for creating permutation tests with the GLM, is:

Winkler AM, Ridgway GR, Webster MA, Smith SM, Nichols TE. Permutation inference for the general linear model. NeuroImage, 2014;92:381-397. (Open Access)

For a gentle introduction to permutation inference, see:

Nichols TE, Holmes AP. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Hum Brain Mapp. 2002 Jan;15(1):1-25.

For more details, see:

Anderson MJ, Robinson J. Permutation Tests for Linear Models. Aust New Zeal J Stat Stat. 2001;43(1):75-88.

Bullmore ET, Suckling J, Overmeyer S, Rabe-Hesketh S, Taylor E, Brammer MJ. Global, voxel, and cluster tests, by theory and permutation, for a difference between two groups of structural MR images of the brain. IEEE Trans Med Imaging. 1999;18(1):32-42.

Freedman D, Lane D. A Nonstochastic Interpretation of Reported Significance Levels. J Bus Econ Stat. 1983;1(4):292.

Hayasaka S, Nichols TE. Validating cluster size inference: random field and permutation methods. Neuroimage. 2003;20(4):2343-2356.

Holmes AP, Blair RC, Watson JD, Ford I. Nonparametric analysis of statistic images from functional mapping experiments. J Cereb Blood Flow Metab. 1996 Jan;16(1):7-22.

Kennedy PE. Randomization Tests in Econometrics. J Bus Econ Stat. 1995;13(1):85–94.

Salimi-Khorshidi G, Smith SM, Nichols TE. Adjusting the effect of nonstationarity in cluster-based and TFCE inference. Neuroimage. 2011;54(3):2006-2019.

Smith SM, Nichols TE. Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage. 2009;44(1):83-98.

Using `randomise`

A typical simple call to randomise uses the following syntax:

randomise -i <4D_input_data> -o <output_rootname> -d design.mat -t design.con -m <mask_image> -n 500 -D -T

design.mat and design.con are text files containing the design matrix and list of contrasts required; they follow the same format as generated by FEAT (see below for examples). The -n 500 option tells randomise to generate 500 permutations of the data when building up the null distribution to test against. The -D option tells randomise to demean the data before continuing - this is necessary if you are not modelling the mean in the design matrix. The -T option tells randomise that the test statistic that you wish to use is TFCE (threshold-free cluster enhancement - see below for more on this).

When using the demeaning option, -D, randomise will also demean the EVs in the design matrix, providing a warning if they initially had non-zero mean (and as long as this doesn't cause the matrix/contrasts to become rank deficient then this warning can be ignored).

There are two programs that make it easy to create the design matrix, contrast and exchangeability-block files design.mat / design.con / design.grp . The first is the Glm GUI which allows the specification of designs in the same way as in FEAT, and the second is a simple script to allow you to easily generate design files for the two-group unpaired t-test case, called design_ttest2.

randomise has the following thresholding/output options:

Voxel-based thresholding, both uncorrected and corrected for multiple comparisons by using the null distribution of the max (across the image) voxelwise test statistic. Uncorrected outputs are: <output>_vox_p_tstat / <output>_vox_p_fstat. Corrected outputs are: <output>_vox_corrp_tstat / <output>_vox_corrp_fstat. To use this option, use -x.
TFCE (Threshold-Free Cluster Enhancement) is a new method for finding "clusters" in your data without having to define clusters in a binary way. Cluster-like structures are enhanced but the image remains fundamentally voxelwise; you can use the -tfce option in fslmaths to test this on an existing stats image. See the TFCE research page for more information. The "E", "H" and neighbourhood-connectivity parameters have been optimised and should be left unchanged. These optimisations are different for different "dimensionality" of your data; for normal, 3D data (such as in an FSL-VBM analysis), you should just just the -T option, while for TBSS analyses (that is in effect on the mostly "2D" white matter skeleton), you should use the --T2 option.
Cluster-based thresholding corrected for multiple comparisons by using the null distribution of the max (across the image) cluster size (so passé!): <output>_clustere_corrp_tstat / <output>_clustere_corrp_fstat. To use this option, use -c <thresh> for t contrasts and -F <thresh> for F contrasts, where the threshold is used to form supra-threshold clusters of voxels.
Cluster-based thresholding corrected for multiple comparisons by using the null distribution of the max (across the image) cluster mass: <output>_clusterm_corrp_tstat / <output>_clusterm_corrp_fstat. To use this option, use -C <thresh> for t contrasts and -S <thresh> for F contrasts.

These filename extensions are summarized in table below.

	Voxel-wise	TFCE	Cluster-wise
	Voxel-wise	TFCE	Extent	Mass
Raw test statistic	`_tstat` `_fstat`	`_tfce_tstat` `_tfce_fstat`	`n/a`	`n/a`
1 - Uncorrected P	`_vox_p_tstat` `_vox_p_fstat`	`_tfce_p_tstat` `_tfce_p_fstat`	`n/a`	`n/a`
1 - FWE-Corrected P	`_vox_corrp_tstat` `_vox_corrp_fstat`	`_tfce_corrp_tstat` `_tfce_corrp_fstat`	`_clustere_corrp_tstat` `_clustere_corrp_fstat`	`_clusterm_corrp_tstat_` `_clusterm_corrp_tstat_`

FWE-corrected means that the family-wise error rate is controlled. If only FWE-corrected P-values less than 0.05 are accepted, the chance of one more false positives occurring over all space is no more than 5%. Equivalently, one has 95% confidence of no false positives in the image.

Note that these output images are 1-P images, where a value of 1 is therefore most significant (arranged this way to make display and thresholding convenient). Thus to "threshold at p<0.01", threshold the output images at 0.99 etc.

If your design is simply all 1s (for example, a single group of subjects) then randomise needs to work in a different way. Normally it generates random samples by randomly permuting the rows of the design; however in this case it does so by randomly inverting the sign of the 1s. In this case, then, instead of specifying design and contrast matrices on the command line, use the -1 option.

You can potentially improve the estimation of the variance that feeds into the final "t" statistic image by using the variance smoothing option -v <std> where you need to specify the spatial extent of the smoothing in mm.

Viewing and Reporting Results

When viewing the 1-p results in FSLeyes the min/max display range should be set to 0.95/1.0 so that values less than 0.95 (equivalent to p>0.05) are not shown. If these are corrected values (i.e. corrp) then the visible areas correspond to the statistically significant regions.

To report cluster results (size, maxima, locations, labels) you can use the cluster and atlasquery tools.

Parallelising Randomise

If you are have an SGE-capable system then a randomise job can be split in parallel with randomise_parallel which takes the same input options as the standard randomise binary and then calculates and batches an optimal number of randomise sub-tasks. The parallelisation has two stages - firstly the randomise sub-jobs are run, and then the sub-job results are combined in the final output.

Examples

One-Sample T-test

To perform a nonparametric 1-sample t-test (e.g., on COPEs created by FEAT FMRI analysis), create a 4D image of all of the images. There should be no repeated measures, i.e., there should only be one image per subject. Because this is a single group simple design you don't need a design matrix or contrasts. Just use:

randomise -i OneSamp4D -o OneSampT -1 -T

Note you do not need the -D option (as the mean is in the model), and omit the -n option, so that 5000 permutations will be performed.

If you have fewer than 20 subjects (approx. 20 DF), then you will usually see an increase in power by using variance smoothing, as in

randomise -i OneSamp4D -o OneSampT -1 -v 5 -T

which does variance smoothing with a sigma of 5mm.

Note also that randomise will automatically select one-sample mode for appropriate design/contrast combinations.

Two-Sample Unpaired T-test

To perform a nonparametric 2-sample t-test, create 4D image of all of the images, with the subjects in the right order! Create appropriate design.mat and design.con files. Once you have your design files run:

randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con -m mask -T

Two-Sample Unpaired T-test with nuisance variables

To perform a nonparametric 2-sample t-test in the presence of nuisance variables, create a 4D image of all of the images. Create appropriate design.mat and design.con files, where your design matrix has additional nuisance variables that are (appropriately) ignored by your contrast. Once you have your design files the call is as before:

randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con -m mask -T

Two-Sample Paired T-test (Paired Two-Group Difference)

We will follow the setup in the FEAT manual where we have a group of 8 subjects scanned under two different conditions, A and B. Condition A will be entered as the first 8 inputs, and condition B as the second 8 inputs, with the subjects in the same order in each case (ordering is, naturally, very important).

The design matrix (design.mat) looks like

 1 1 0 0 0 0 0 0 0
 1 0 1 0 0 0 0 0 0
 1 0 0 1 0 0 0 0 0
 1 0 0 0 1 0 0 0 0
 1 0 0 0 0 1 0 0 0
 1 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 1 0
 1 0 0 0 0 0 0 0 1
-1 1 0 0 0 0 0 0 0
-1 0 1 0 0 0 0 0 0
-1 0 0 1 0 0 0 0 0
-1 0 0 0 1 0 0 0 0
-1 0 0 0 0 1 0 0 0
-1 0 0 0 0 0 1 0 0
-1 0 0 0 0 0 0 1 0
-1 0 0 0 0 0 0 0 1

where the first column models the group difference, and the individual subject means are modelled by columns 2 to 9.

The t-contrast (design.con) for the group difference, A-B, is simply

1 0 0 0 0 0 0 0 0

and for B-A, just replace 1 with -1 in this contrast.

The exchangeability-block information (design.grp) can be entered via the "Group" column in the GLM Gui (see the example GLM setup here or, alternatively, it can be created independently as a simple text file). The values in this need to specify which data can be exchanged, that is, only condition A and B within the same subject (since the null hypothesis is that there is no difference between A and B, but each subject could have a different mean value and so we cannot exchange between subjects). Therefore there are eight different blocks and the file looks like

Once you have set up these design files you can run:

randomise -i PairedT4D -o PairedT -d design.mat -t design.con -e design.grp -m mask -T

Note that a completely alternative way to do a paired analysis is to calculate the difference values within-subject initially (e.g., using fslmaths) and then do a simple one-group t-test (across subjects) on these within-subject differences.

Repeated measures ANOVA

Following the ANOVA: 1-factor 4-levels (Repeated Measures) example from the FEAT manual, assume we have 2 subjects with 1 factor at 4 levels. We therefore have eight input images and we want to test if there is any difference over the 4 levels of the factor. The design matrix looks like

where the first two columns model subject means and the 3rd through 5th column model the categorical effect (Note the different arrangement of rows relative to the FEAT example). Three t-contrasts for the categorical effect

0 0 1 0 0
0 0 0 1 0
0 0 0 0 1

are selected together into a single F-contrast

1 1 1

Modify the exchangeability-block information in design.grp to match

This will ensure that permutations will only occur within subject, respecting the repeated measures structure of the data. The number of permutations can be computed for each group, and then multiplied together to find the total number of permutations. We use the ANOVA computation for 4 levels, and hence (1+1+1+1)!/1!/1!/1!/1! = 24 possible permutations for one subject, and hence 24 × 24 = 576 total permutations. The call is then similar to the above examples:

randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con -f design.fts -m mask -e design.grp -T

Getting Cluster and Peak Information from Randomise Output

Assuming that the best image to get the clusters and peak information from is the raw tstat image, begin by masking this with the significant voxels from corrp:

fslmaths grot_tfce_corrp_tstat1 -thr 0.95 -bin -mul grot_tstat1 grot_thresh_tstat1

Now run cluster to extract the clusters and local maxima in several different outputs:

cluster --in=grot_thresh_tstat1 --thresh=0.0001 --oindex=grot_cluster_index --olmax=grot_lmax.txt --osize=grot_cluster_size

If the data is already in standard space (MNI152) and co-ordinate reporting is wanted in MNI (mm) values, add --mm to the end of the command.

If clusters need to be forced to be more highly split, the --thresh value can be raised to an appropriate level. Because the input image was already masked by the corrp image, regardless of the thresh value used in the cluster command, only significant voxels will ever be reported.

Two-pass mode

Currently this mode operates on cluster (-c) and TFCE statistics only.

Associated Tools

Missing Data and/or Lesion Masking

For many studies there can be missing data (due to limited FOV) or there can be pathologies or artefacts present (we will call them "lesions" here, but they could be anything) and it is desirable to exclude the voxels that do not contain useful data from the analysis. In order to make this task easier in randomise there is a script called setup_masks. This takes a set of user-defined masks (normally drawn by hand) and creates a suitable set of files that can be used in randomise, including modified design matrices and contrasts to include the mask-based EVs. The masks should be in the form that can be used directly in randomise; that is, in the same space and containing a value of 1 in the voxels to be excluded and 0 in all other voxels (e.g., a lesion would be filled with ones in this case).

Usage of the script is:

setup_masks <input matrix> <input contrast> <output basename> <mask1> <mask2> ...

The list of mask images (<mask1> <mask2> ...) must be from subjects in the same order as used in the design matrix

New versions of both the design matrix and contrasts are created (starting with the specified output basename). In addition, a set of images is created, suitable as voxelwise regressors for randomise, and a list file tailored for randomise.

The final output of the script is an example call to randomise (printed to the screen) that shows exactly how to use the output files in a call to randomise. Other options (as would have been used without the voxelwise regressors) can then be added at the end of the randomise command to easily form a complete randomise call.

Note that the example call generated by setup_masks uses a negative number in the --vxl option ( e.g. --vxl=-3 ). This signals to randomise that the input is a single 4D volume that needs to be "expanded" across the design matrix to provide subject-specific confounding.

Background theory

A standard nonparametric test is exact, in that the false positive rate is exactly equal to the specified α level. Using randomise with a GLM that corresponds to one of the following simple statistical models will result in exact inference:

One sample t-test on difference measures Two sample t-test One-way ANOVA Simple correlation Use of almost any other GLM will result in approximately exact inference. In particular, when the model includes both the effect tested (e.g., difference in FA between two groups) and nuisance variables (e.g., age), exact tests are not generally available. Permutation tests rely on an assumption of exchangeability; with the models above, the null hypothesis implies complete exchangeability of the observations. When there are nuisance effects, however, the null hypothesis no longer assures the exchangeability of the data (e.g. even when the null hypothesis of no FA difference is true, age effects imply that you can't permute the data without altering the structure of the data).

Permutation tests for the General Linear Model

For an arbitrary GLM randomise uses the method of Freeman & Lane (1983). Based on the contrast (or set of contrasts defining an F test), the design matrix is automatically partitioned into tested effects and nuisance (confound) effects. The data are first fit to the nuisance effects alone and nuisance-only residuals are formed. These residuals are permuted, and then the estimated nuisance signal is added back on, creating an (approximate) realization of data under the null hypothesis. This realization is fit to the full model and the desired test statistic is computed as usual. This process is repeated to build a distribution of test statistics equivalent under the null hypothesis specified by the contrast(s). For the simple models above, this method is equivalent to the standard exact tests; otherwise, it accounts for nuisance variation present under the null. Note, that randomise v2.0 and earlier used a method due to Kennedy (1995). While both the Freedman-Lane and Kennedy methods are accurate for large n, for small n the Kennedy method can tend to false inflate significances. For a review of these issues and even more possible methods, see Anderson & Robinson (2001) and Winkler et al (2014) [references below].

For simple models, like the 2-sample t-test, regression with a single covariate, and 1-sample t-test on differences, a permutation test has exact control of the desired false positive rate with very simple assumptions. For more complex cases, like an arbitrary contrast on a linear regression model with 2 or more covariates, "exactness" can't be guaranteed, but exhaustive simulations under various (punishing, small sample sizes) cases has shown Freeman-Lane to be highly accurate in practice (Anderson & Robinson, 2001; Winkler et al. 2014).

Conditional Monte Carlo Permutation Tests

A proper "exact" test arises from evaluating every possible permutation. Often this is not feasible, e.g., a simple correlation with 12 scans has nearly a half a billion possible permutations. Instead, a random sample of possible permutations can be used, creating a Conditional Monte Carlo (CMC) permutation test. On average, the CMC test is exact and will give similar results to carrying out all possible permutations.

If the number of possible permutations is large, one can show that a true, exhaustive P-value of p will produce P-values between p ± 2√(p(1-p)/n) about 95% of the time, where n is the number of CMC permutations. The table below shows confidence limits for p=0.05 for various n. At least 5,000 permutations are required to reduce the uncertainty appreciably, though 10,000 permutations are required to reduce the margin-of-error to below 10% of the nominal alpha.

n	Confidence limits for p=0.05
100	0.0500 ± 0.0436
500	0.0500 ± 0.0195
1,000	0.0500 ± 0.0138
5,000	0.0500 ± 0.0062
10,000	0.0500 ± 0.0044
50,000	0.0500 ± 0.0019

In randomise the number of permutations to use is specified with the -n option. If this number is greater than or equal to the number of possible permutations, an exhaustive test is run. If it is less than the number of possible permutations a Conditional Monte Carlo permutation test is performed. The default is 5000, though if time permits, 10000 is recommended.

Counting Permutations

Exchangeabilty under the null hypothesis justifies the permutation of the data. For n scans, there are n! (n factorial, n×(n-1)×(n-2)×...×2) possible ways of shuffling the data. For some designs, though, many of these shuffles are redundant. For example, in a two-sample t-test, permuting two scans within a group will not change the value of the test statistic. The number of possible permutations for different designs are given below.

Model	Sample Size(s)	Number of Permutations
One sample t-test on difference measures	\(n\)	\(2^n\)
Two sample t-test	\(n_1\), \(n_2\)	\(\frac{(n_1+n_2)!}{n_1!\cdot n_2!}\)
One-way ANOVA	\(n_1\),...,\(n_k\)	\(\frac{(\sum_{i=1}^{k}n_i)!}{\prod_{i=1}^kn_i!}\)
Simple correlation	\(n\)	\(n!\)

Note that the one-sample t-test is an exception. Data are not permuted, but rather their signs are randomly flipped. For all designs except a one-sample t-test, randomise uses a generic algorithm which counts the number of unique possible permutations for each contrast. If X is the design matrix and c is the contrast of interest, then Xc is sub-design matrix of the effect of interest. The number of unique rows in Xc is counted and a one-way ANOVA calculation is used.