FSL and BLAS
Many FSL tools use BLAS and LAPACK routines internally. FSL uses the OpenBLAS BLAS/LAPACK implementation.
FSL doesn't use OpenBLAS directly, but rather uses the armadillo linear algebra library, which in turn uses OpenBLAS.
Many OpenBLAS operations can be parallelised on multi-core systems. However, in all FSL C++ tools, parallelisation of OpenBLAS calls is disabled by default. This is because the types of BLAS/LAPACK operations that are commonly used by many FSL tools do not benefit from parallelisation, and often result in poorer performance. The reasons for this are outlined in more detail below.
If you want, you can enable OpenBLAS parallelisation by setting the FSL_SKIP_GLOBAL
environemnt variable, e.g.:
While FSL uses OpenBLAS by default, the FSL C++ tools are compiled such that the BLAS implementation can be changed without having to recompile the tools - refer to the conda-forge knowledge base for more details.
Performance of FSL tools with BLAS parallelisation
This discussion pertains to FSL 6.0.6 and newer, where we are using a version of OpenBLAS which has been compiled to use OpenMP for parallelisation. Only a small selection of FSL tools have been benchmarked (melodic
, topup
, fnirt
), so the results presented here may or may not be applicable to other FSL tools.
Info
In FSL 6.0.7.8, we switched to a pthreads
-based OpenBLAS variant on Linux, which does not suffer from the performance degradations outlined here. However, FSL does still use an OpenMP-based OpenBLAS variant on macOS.
In FSL 6.0.6 and newer, BLAS parallelisation is explicitly disabled programmatically in all tools, via a call to openblas_set_num_threads(1)
(see fslStartup.cc
and threading.h
/cpp
, both in the fsl/utils
project).
In general, enabling parallelisation for BLAS routines only gives minor performance benefits and so, when working at small scale, it is probably not worth the hassle of parallelising.
This is because the default settings used by libgomp
(the GNU OpenMP implementation) often result in poorer performance, and an excess of wasted CPU cycles. The GOMP_SPINCOUNT
setting controls how long worker threads will busy-spin whilst waiting for a new job to be scheduled; the default behaviour for this setting appears to be far too high for the types of calls that are typical in the FSL tools tested.
This can be avoided by explicitly setting GOMP_SPINCOUNT=0
(or equivalently setting OMP_WAIT_POLICY=PASSIVE
). However, because this is the default behaviour, and because the performance benefits gained by paralellisation are so modest, we have decided to explicitly disable OpenBLAS paralellisation by default for the time being.
However, the initial call to openblas_set_num_threads(1)
can be skipped by setting a FSL environment variable FSL_SKIP_GLOBAL=1
. This will have the effect that the standard OpenBLAS environment variables (e.g. OPENBLAS_NUM_THREADS
) will be honoured.
Performance benchmarks
Some benchmark timings for melodic
, topup
and fnirt
are listed below. Each command is taken from the FSL course, and was run on the FSL course data sets.
These timings were performed with FSL 6.0.7.3, which uses the OpenBLAS 0.3.21 (OpenMP build) available on conda-forge. Each tool was executed on two platforms (macOS and Linux), and under two scenarios:
- With BLAS parallelisation disabled - this is the default behaviour for FSL tools, when
FSL_SKIP_GLOBAL
is unset, or set to0
. - With BLAS parallelisation enabled, using all available cores - this is accomplished by setting
FSL_SKIP_GLOBAL=1
.
Specifications for the two platforms used for benchmarking were as follows:
macOS | Linux | |
---|---|---|
Machine | Intel Macbook Pro (late 2013 edition) | Dell XPS 13 Plus 9320 |
OS | 11.7.4 (Big Sur) | Ubuntu 22.04 |
CPU | Intel i7-4960HQ, 4 cores/8 threads | 13th Gen Intel i7-1360P, 12 cores/16 threads |
RAM | 16GB | 32GB |
fnirt
cd ~/fsl_course_data/registration/
fnirt --in=STRUCT \
--ref=$FSLDIR/data/standard/MNI152_T1_2mm \
--config=$FSLDIR/etc/flirtsch/T1_2_MNI152_2mm.cnf \
--fout=struct2std_warp
Total time | CPU time | CPU load | |
---|---|---|---|
macOS | |||
No parallelisation | 256.82 | 248.29 | 99% |
With parallelisation | 263.95 | 665.76 | 267% |
Linux | |||
No parallelisation | 170.94 | 168.45 | 99% |
With parallelisation | 157.94 | 155.68 | 99% |
melodic
Total time | CPU time | CPU load | |
---|---|---|---|
macOS | |||
No parallelisation | 119.64 | 114.85 | 99% |
With parallelisation | 116.67 | 400.44 | 366% |
Linux | |||
No parallelisation | 469.05 | 459.36 | 99% |
With parallelisation | 480.55 | 637.12 | 157% |
topup
cd ~/fsl_course_data/fdt1/subj1_preproc/
fslroi dwidata nodif 0 1
fslmerge -t AP_PA_b0 nodif nodif_PA
echo "0 -1 0 0.0759" > acqparams.txt
echo "0 1 0 0.0759" >> acqparams.txt
topup --imain=AP_PA_b0 \
--datain=acqparams.txt \
--config=b02b0.cnf \
--out=topup_AP_PA_b0 \
--iout=topup_AP_PA_b0_iout \
--fout=topup_AP_PA_b0_fout
Total time | CPU time | CPU load | |
---|---|---|---|
macOS | |||
No parallelisation | 309.30 | 305.78 | 99% |
With parallelisation | 327.21 | 586.12 | 183% |
Linux | |||
No parallelisation | 176.52 | 176.31 | 99% |
With parallelisation | 177.04 | 176.60 | 99% |