# Fsl-pipe tutorial Building a pipeline in fsl-pipe consists of three steps: 1. Define the directory structure containing the pipeline output in a FileTree. 2. Write python functions representing individual steps in the pipeline. 3. Run the CLI interface. ## Exploring the command line interface (CLI) ### Basic pipeline runs To start with a trivial example, copy the following to a text file called "pipe.py": ```python from file_tree import FileTree from fsl_pipe import pipe, In, Out tree = FileTree.from_string(""" A.txt B.txt C.txt """) @pipe def gen_A(A: Out): with open(A, 'w') as f: f.write('A\n') @pipe def gen_B(B: Out): with open(B, 'w') as f: f.write('B\n') @pipe def gen_C(A: In, B: In, C: Out): with open(C, 'w') as writer: with open(A, 'r') as reader: writer.writelines(reader.readlines()) with open(B, 'r') as reader: writer.writelines(reader.readlines()) if __name__ == "__main__": pipe.cli(tree) ``` This is a fully functioning pipeline with 3 jobs: 1. Generating the file A.txt containing the text "A". 2. Generating the file B.txt containing the text "B". 3. Generating the file C.txt with the concatenation of A and B Let's check the command line interface by running ```bash python pipe.py --help ``` This will generate the following output: ``` usage: Runs the pipeline [-h] [-m {local,submit,dask}] [-o] [-d] [-j JOB_HOLD] [--datalad] [templates ...] positional arguments: templates Set to one or more template keys or file patterns (e.g., "*.txt"). Only these templates/files will be produced. If not provided all templates will be produced (A, B, C). optional arguments: -h, --help show this help message and exit -m {local,submit,dask}, --pipeline_method {local,submit,dask} method used to run the jobs (default: local) -o, --overwrite If set overwrite any requested files. -d, --overwrite_dependencies If set also overwrites dependencies of requested files. -j JOB_HOLD, --job-hold JOB_HOLD Place a hold on the whole pipeline until job has completed. --skip-missing If set skip running any jobs depending on missing data. This replaces the default behaviour of raising an error if any required input data is missing. -q, --quiet Suppresses the report on what will be run (might speed up starting up the pipeline substantially). -b, --batch Batch jobs based on submit parameters and pipeline labels before running them. -g, --gui Start a GUI to select which output files should be produced by the pipeline. ``` The main argument here is the `templates`, which allow you to select which of the potential pipeline output files you want to produce. By default, the full pipeline will be run. Let's try this out by producing just "B.txt": ```bash python pipe.py B # Alternative using file pattern # python pipe.py B.txt ``` Only a single job in the pipeline is run, which is sufficient to produce "B.txt". The other files ("A.txt" and "C.txt") are not required and hence their jobs are not run. Note that either the template key could be given ("B") or the full filename ("B.txt"). The full filename can also contain shell-style wildcards (e.g., "*" or "?") to match multiple files. Let's now request file "C.txt": ```bash python pipe.py C ``` In this case two jobs are run. To generate "C.txt", the pipeline first needs to create "A.txt". Although, "B.txt" is also required, that file already existed and hence is not regenerated. In other words, fsl-pipe is lazy. It only runs what is strictly necessary to produce the requested output files. So far, we have defined our targets based on the template keys in the file-tree. However, we can also define the output files directly based on the filename, for example: ```bash python pipe.py C.txt ``` Unix-style filename patterns can be used to define multiple target filenames. For example, - `python pipe.py *.txt`: produce all text files in the top-level directory - `python pipe.py **/*.pdf`: produce any PDFs, no matter where they are in the output file-tree. - `python pipe.py **/summary.dat`: produce any file named "summary.dat" irrespective of where it is defined in the file-tree. - `python pipe.py subject-A/**`: produce any files in the `subject-A` directory. Alternatively, the output files can be selected using a graphical user interface (GUI). This GUI can be started using the `-g/--gui` flag. The GUI is terminal-based, which means that it will work when connecting to a computing cluster over SSH. "fsl-pipe-gui" needs to be installed for the GUI to run (`conda/pip install fsl-pipe-gui`). More details on the GUI can be found at [https://git.fmrib.ox.ac.uk/fsl/fsl-pipe-gui](the fsl-pipe-gui gitlab repository). When defining output files in this way, the pipeline will run *only* those jobs that are necessary to produce the user-define outputs. These jobs might produce intermediate or additional files that do not match the output pattern. So, even though you did not explicitly request these files, `fsl-pipe` will still produce them. ### Overwriting existing files If we want to regenerate any files we can use the `-o/--overwrite` and/or `-d/--overwrite-dependencies` flags. The former will just overwrite the files that we explicitly request (in addition to generating any missing dependencies): ```bash python pipe.py C -o ``` This will just run the single job to generate "C.txt". The `-d/--overwrite-dependencies` will also overwrite any dependencies of the requested files: ```bash python pipe.py C -do ``` This will rerun the full pipeline overwriting all of the files. Setting the `-d/--overwrite-dependencies` flag does not imply the `-o/--overwrite` flag. If the overwrite flag is not provided, the dependencies are only rerun if "C.txt" does not exist. So, the following does nothing: ```bash python pipe.py C -d ``` However, after removing "C.txt" the same command will rerun the full pipeline: ```bash rm C.txt python pipe.py C -d ``` ### Understanding the stdout fsl-pipe writes some helpful information to the terminal while running First, we can see a file-tree representation looking like: ``` . ├── A.txt [1/0/0] ├── B.txt [1/0/0] └── C.txt [0/1/0] ``` This shows the *relevant* part of the file-tree for this specific pipeline run. The numbers indicate the number of files matching the template that are involved in this pipeline. The first number (in red) shows the number of files that will be overwritten. The second number (in yellow) shows the number of new files that will be produced. The third number (in blue) shows the number of files that will be used as input. After this file-tree representation, fsl-pipe will report which (if any) jobs were run or submitted. ## Iterating over subjects or other placeholders Let's adjust the pipeline above to iterate over multiple subjects: ```python from file_tree import FileTree from fsl_pipe import pipe, In, Out, Var tree = FileTree.from_string(""" subject = 01, 02 A.txt sub-{subject} B.txt C.txt """) @pipe def gen_A(A: Out): with open(A, 'w') as f: f.write('A\n') @pipe def gen_B(B: Out, subject: Var): with open(B, 'w') as f: f.write(f'B\n{subject.value}\n') @pipe def gen_C(A: In, B: In, C: Out): with open(C, 'w') as writer: with open(A, 'r') as reader: writer.writelines(reader.readlines()) with open(B, 'r') as reader: writer.writelines(reader.readlines()) if __name__ == "__main__": pipe.cli(tree) ``` Copy this text to "pipe.py". You might also want to remove the output from the old pipeline (```rm *.txt```). There are two changes here to the pipeline The main one is with the file-tree: ```python tree = FileTree.from_string(""" subject = 01, 02 A.txt sub-{subject} B.txt C.txt """) ``` Here we have defined two subjects ("01" and "02"). For each subject we will get a subject directory (matching the subject ID with the prefix "sub-"), which will contain "B.txt" and "C.txt". The file "A.txt" is not subject-specific, which we can see because it is outside of the subject directory. The second change is with the function generating "B.txt", which now takes the subject ID as an input and writes it into "B.txt". ```python @pipe def gen_B(B: Out, subject: Var): with open(B, 'w') as f: f.write(f'B\n{subject.value}\n') ``` Let's run this pipeline again: ```bash python pipe.py C ``` There is now also a table in the terminal output listing the subject IDs: ``` Placeholders with multiple options ┏━━━━━━━━━┳━━━━━━━━┓ ┃ name ┃ value ┃ ┡━━━━━━━━━╇━━━━━━━━┩ │ subject │ 01, 02 │ └─────────┴────────┘ . ├── A.txt [0/1/0] └── {subject} ├── B.txt [0/2/0] └── C.txt [0/2/0] ``` In total, 5 jobs got run now. One to create "A.txt" and two each for generating "B.txt" and "C.txt" for both subjects. Feel free to check that the files "B.txt" and "C.txt" contain the correct subject IDs. When iterating over multiple placeholder values, the command line interface is expanded to allow one to set the parameters. In this case, we now have an additional flag to set the subject ID: ``` --subject SUBJECT [SUBJECT ...] Use to set the possible values of subject to the selected values (default: 01,02) ``` We can use this, for example, to rerun only subject "01": ``` python pipe.py C -do --subject=01 ``` An alternative way to just produce the files for a single subject, is to explicitly match the files that need to be produced. For example, the following will reproduce all the files matching "02/*.txt" (namely, "02/B.txt" and "02/C.txt"): ``` python pipe.py "02/*.txt" -o ``` The quotes around 02/*.txt are required here to prevent the shell from expanding the filenames. Note that when defining the function to generate "C.txt", there is no distinction made between the input that depends on subject ("{subject}/B.txt") and the one that does not ("A.txt"): ``` @pipe def gen_C(A: In, B: In, C: Out): ... ``` fsl-pipe figures out which of the input or output files depend on the subject placeholder based on file-tree. This is particularly useful when considering some input configuration file, which could be either made global or subject-specific just by changing the file-tree without altering the pipeline itself. ## Concatenating/merging across subjects or other placeholders By default, when any input or output template depends on some placeholder in the file-tree (e.g., subject ID) the pipeline will run that job for each subject independently. However, in some cases we might not want this default behaviour, for example if we are merging or comparing data across subjects. In that case we can use "no_iter": ```python from file_tree import FileTree from fsl_pipe import pipe, In, Out, Var tree = FileTree.from_string(""" subject = 01, 02 A.txt sub-{subject} B.txt C.txt merged.txt """) @pipe def gen_A(A: Out): with open(A, 'w') as f: f.write('A\n') @pipe def gen_B(B: Out, subject: Var): with open(B, 'w') as f: f.write(f'B\n{subject.value}\n') @pipe def gen_C(A: In, B: In, C: Out): with open(C, 'w') as writer: with open(A, 'r') as reader: writer.writelines(reader.readlines()) with open(B, 'r') as reader: writer.writelines(reader.readlines()) @pipe(no_iter=['subject']) def merge(C: In, merged: Out): with open(merged, 'w') as writer: for fn in C.data: with open(fn, 'r') as reader: writer.writelines(reader.readlines()) if __name__ == "__main__": pipe.cli(tree) ``` Again there are two changes. First, we added a new file "merged.txt" to the file-tree. This file will contain the concatenated text from all the subject-specific "C.txt", which is defined by adding a new job to the pipeline: ```python @pipe(no_iter=['subject']) def merge(C: In, merged: Out): with open(merged, 'w') as writer: for fn in C.data: with open(fn, 'r') as reader: writer.writelines(reader.readlines()) ``` Here the "no_iter" flag tells fsl-pipe that this function should not iterate over the subject ID and rather pass on all of subject-specific "C.txt" filenames, so that we can iterate over them within the function. ## Writing your own pipeline ### Importing fsl-pipe A pipeline script will typically start by importing fsl-pipe: ```python from fsl_pipe import pipe, In, Out, Ref, Var ``` Here `pipe` is an empty pipeline already initialized for you. Using this pipeline is fine for scripts, however in a python library we strongly recommend using ```python from fsl_pipe import Pipeline, In, Out, Ref, Var pipe = Pipeline() ``` This ensures you work on your own pipeline without interfering with any user-defined pipelines or pipelines in other python libraries. ### Adding functions to the pipeline The next step is to wrap individual python functions: ```python @pipe def myfunc(input_file: In, other_input: In, output_file: Out, reference_file: Ref, placeholder_value: Var, some_config_var=3): ``` Here we use the python [typing syntax](https://docs.python.org/3/library/typing.html) to define what different keyword arguments are: - Any input files (i.e., files that need to exist before this function can run) are marked with `In`. - Any output files (i.e., files that are expected to exist after the function runs) are marked with `Out`. - Reference files are marked with `Ref`. These will be extracted from the file-tree like input or output files, but their existence is not checked at any point and they do not form part of the pipeline build. - Placeholder values (e.g., subject ID used above) are marked with `Var`. fsl-pipe will determine which jobs to run based on the inputs (`In`) and outputs (`Out`) produced by all the functions added to the pipeline. Alternatively, these keyword argument definitions can be set directly when adding the functions to the pipeline. This will override any definitions set as type hints. This can be useful when wrapping a pre-existing function. For example, if we want to create a job that copies "A" to "B": ``` from shutil import copyfile pipe(copyfile, kwargs={'src': In('A'), 'dst': Out('B')}) ``` Here we use one new feature, namely that we can explicitly set the template key (for `In`, `Out`, or `Ref`) or placeholder (for `Var`) by using syntax like (`In(