FSL and BLAS

Many FSL tools use BLAS and LAPACK routines internally. FSL uses the OpenBLAS BLAS/LAPACK implementation.

FSL doesn't use OpenBLAS directly, but rather uses the armadillo linear algebra library, which in turn uses OpenBLAS.

Many OpenBLAS operations can be parallelised on multi-core systems. However, in all FSL C++ tools, parallelisation of OpenBLAS calls is disabled by default. This is because the types of BLAS/LAPACK operations that are commonly used by many FSL tools do not benefit from parallelisation, and often result in poorer performance. The reasons for this are outlined in more detail below.

If you want, you can enable OpenBLAS parallelisation by setting the FSL_SKIP_GLOBAL environemnt variable, e.g.:

export FSL_SKIP_GLOBAL=1
export OPENBLAS_NUM_THREADS=12
fnirt ...

While FSL uses OpenBLAS by default, the FSL C++ tools are compiled such that the BLAS implementation can be changed without having to recompile the tools - refer to the conda-forge knowledge base for more details.

Performance of FSL tools with BLAS parallelisation

This discussion pertains to FSL 6.0.6 and newer, where we are using a version of OpenBLAS which has been compiled to use OpenMP for parallelisation. Only a small selection of FSL tools have been benchmarked (melodic, topup, fnirt), so the results presented here may or may not be applicable to other FSL tools.

Info

In FSL 6.0.7.8, we switched to a pthreads-based OpenBLAS variant on Linux, which does not suffer from the performance degradations outlined here. However, FSL does still use an OpenMP-based OpenBLAS variant on macOS.

In FSL 6.0.6 and newer, BLAS parallelisation is explicitly disabled programmatically in all tools, via a call to openblas_set_num_threads(1) (see fslStartup.cc and threading.h/cpp, both in the fsl/utils project).

In general, enabling parallelisation for BLAS routines only gives minor performance benefits and so, when working at small scale, it is probably not worth the hassle of parallelising.

This is because the default settings used by libgomp (the GNU OpenMP implementation) often result in poorer performance, and an excess of wasted CPU cycles. The GOMP_SPINCOUNT setting controls how long worker threads will busy-spin whilst waiting for a new job to be scheduled; the default behaviour for this setting appears to be far too high for the types of calls that are typical in the FSL tools tested.

This can be avoided by explicitly setting GOMP_SPINCOUNT=0 (or equivalently setting OMP_WAIT_POLICY=PASSIVE). However, because this is the default behaviour, and because the performance benefits gained by paralellisation are so modest, we have decided to explicitly disable OpenBLAS paralellisation by default for the time being.

However, the initial call to openblas_set_num_threads(1) can be skipped by setting a FSL environment variable FSL_SKIP_GLOBAL=1. This will have the effect that the standard OpenBLAS environment variables (e.g. OPENBLAS_NUM_THREADS) will be honoured.

Performance benchmarks

Some benchmark timings for melodic, topup and fnirt are listed below. Each command is taken from the FSL course, and was run on the FSL course data sets.

These timings were performed with FSL 6.0.7.3, which uses the OpenBLAS 0.3.21 (OpenMP build) available on conda-forge. Each tool was executed on two platforms (macOS and Linux), and under two scenarios:

With BLAS parallelisation disabled - this is the default behaviour for FSL tools, when FSL_SKIP_GLOBAL is unset, or set to 0.
With BLAS parallelisation enabled, using all available cores - this is accomplished by setting FSL_SKIP_GLOBAL=1.

Specifications for the two platforms used for benchmarking were as follows:

	macOS	Linux
Machine	Intel Macbook Pro (late 2013 edition)	Dell XPS 13 Plus 9320
OS	11.7.4 (Big Sur)	Ubuntu 22.04
CPU	Intel i7-4960HQ, 4 cores/8 threads	13th Gen Intel i7-1360P, 12 cores/16 threads
RAM	16GB	32GB

`fnirt`

cd ~/fsl_course_data/registration/
fnirt --in=STRUCT                                       \
      --ref=$FSLDIR/data/standard/MNI152_T1_2mm         \
      --config=$FSLDIR/etc/flirtsch/T1_2_MNI152_2mm.cnf \
      --fout=struct2std_warp

	Total time	CPU time	CPU load
macOS
No parallelisation	256.82	248.29	99%
With parallelisation	263.95	665.76	267%
Linux
No parallelisation	170.94	168.45	99%
With parallelisation	157.94	155.68	99%

`melodic`

cd ~/fsl_course_data/fmri3/motion/
melodic -i nice -o melout.ica

	Total time	CPU time	CPU load
macOS
No parallelisation	119.64	114.85	99%
With parallelisation	116.67	400.44	366%
Linux
No parallelisation	469.05	459.36	99%
With parallelisation	480.55	637.12	157%

`topup`

cd ~/fsl_course_data/fdt1/subj1_preproc/
fslroi dwidata nodif 0 1
fslmerge -t AP_PA_b0 nodif nodif_PA
echo "0 -1 0 0.0759" >  acqparams.txt
echo "0  1 0 0.0759" >> acqparams.txt

topup --imain=AP_PA_b0           \
      --datain=acqparams.txt     \
      --config=b02b0.cnf         \
      --out=topup_AP_PA_b0       \
      --iout=topup_AP_PA_b0_iout \
      --fout=topup_AP_PA_b0_fout

	Total time	CPU time	CPU load
macOS
No parallelisation	309.30	305.78	99%
With parallelisation	327.21	586.12	183%
Linux
No parallelisation	176.52	176.31	99%
With parallelisation	177.04	176.60	99%