FUNPACK command-line interface

FUNPACK is controlled entirely through its command-line interface. You can control which data fields and subjects to import, which cleaning/processing rules to apply, and how the output should be formatted.

Configuration files

FUNPACK can also be controlled through configuration files - a FUNPACK configuration file simply contains a list of command-line options, without the leading --. For example, the options in the following command-line:

fmrib_unpack                     \
  --overwrite                    \
  --write_log                    \
  --icd10_map_file icd_codes.tsv \
  --category 10                  \
  --category 11                  \
  output.tsv input1.tsv input2.tsv

Could be stored in a configuration file config.txt:

overwrite
write_log
icd10_map_file icd_codes.tsv
category       10
category       11

And then executed as follows:

fmrib_unpack -cfg config.txt output.tsv input1.tsv input2.tsv

A FUNPACK configuration file can include other configuration files, making it possible to build a configuration proflie which is organised across several configuration files. For example, you may wish to store variable selectors in one file variables.cfg:

variable 31    # sex
variable 33    # dob
variable 41202 # ICD10 main diagnoses

Then your main configuration file can include variables.cfg via the config_file argument, along with any other options:

overwrite
write_log
config_file variables.cfg

The FMRIB configuration profile (described here) uses this feature to organise the options and rules that it applies.

Command-line reference

All of the command-line options that FUNPACK accespts are documented below:

usage: funpack [-h] [-e ENCODING] [-l FILE LOADER] [-i FILE INDEX]
               [-ma {0,rows,subjects,1,cols,columns,variables}]
               [-ms {naive,union,intersection,inner,outer}] [-rm] [-rd]
               [-cfg CONFIG_FILE] [-vf VARIABLE_FILE] [-df DATACODING_FILE]
               [-tf TYPE_FILE] [-pf PROCESSING_FILE] [-cf CATEGORY_FILE]
               [-s SUBJECT] [-v VARIABLE] [-co COLUMN] [-c CATEGORY]
               [-vi VISIT] [-ex EXCLUDE] [-ev EXCLUDE_VARIABLE]
               [-ec EXCLUDE_CATEGORY] [-aa] [-iv] [-tt] [-fm] [-sn] [-scv]
               [-scf] [-sr] [-nv VID NAVALUES] [-re VID RAWLEVELS NEWLEVELS]
               [-cv VID EXPRS VALUES] [-cl VID EXPR] [-tc TYPE EXPR]
               [-gc EXPR] [-sp] [-ppr VARS PROCS] [-apr VARS PROCS] [-io]
               [-f FORMAT] [-edf DATE_FORMAT] [-etf TIME_FORMAT]
               [-evf VID FORMATTER] [-esn] [-nr NUM_ROWS] [-ts TSV_SEP] [-en]
               [-tm TSV_MISSING_VALUES] [-hk HDF5_KEY] [-hs {pandas,funpack}]
               [-wl] [-wnn] [-wu] [-wim] [-wde] [-ws] [-lf LOG_FILE]
               [-nnf NON_NUMERICS_FILE] [-uf UNKNOWN_VARS_FILE]
               [-imf ICD10_MAP_FILE] [-def DESCRIPTION_FILE]
               [-sf SUMMARY_FILE] [-V] [-dn] [-d] [-nb] [-nd] [-nj NUM_JOBS]
               [-p FILE] [-ow] [-n] [-q] [-eh ECHO]
               outfile infile [infile ...]

options:
  -h, --help            show this help message and exit

Input/output files:
  outfile               Location to store output data.
  infile                File(s) containing input data.
  -e ENCODING, --encoding ENCODING
                        Character encoding. A single encoding can be
                        specified, or this option can be used multiple times,
                        one for each input file.
  -l FILE LOADER, --loader FILE LOADER
                        Use custom loader for file. Can be used multiple
                        times.
  -i FILE INDEX, --index FILE INDEX
                        Position of index column for file (starting from 0).
                        Can be used multiple times. Defaults to 0. Specify
                        multi-column indexes as comma-separated lists.
  -ma {0,rows,subjects,1,cols,columns,variables}, --merge_axis {0,rows,subjects,1,cols,columns,variables}
                        Axis to concatenate multiple input files along
                        (default: "variables"). Options are
                        "subjects"/"rows"/"0" or "variables"/"columns"/"1".
  -ms {naive,union,intersection,inner,outer}, --merge_strategy {naive,union,intersection,inner,outer}
                        Strategy for merging multiple input files (default:
                        "intersection"). Options are "naive",
                        "intersection"/"inner", or "union"/"outer".
  -rm, --remove_duplicates
                        Remove duplicate columns, only retaining the first.
  -rd, --rename_duplicates
                        Rename any duplicate columns so that all columns have
                        a unique name.
  -cfg CONFIG_FILE, --config_file CONFIG_FILE
                        File containing default command line arguments. Can be
                        used multiple times.
  -vf VARIABLE_FILE, --variable_file VARIABLE_FILE
                        File(s) containing rules for handling variables
  -df DATACODING_FILE, --datacoding_file DATACODING_FILE
                        File(s) containing rules for handling data codings.
  -tf TYPE_FILE, --type_file TYPE_FILE
                        File containing rules for handling types.
  -pf PROCESSING_FILE, --processing_file PROCESSING_FILE
                        File containing variable processing rules.
  -cf CATEGORY_FILE, --category_file CATEGORY_FILE
                        File containing variable categories.

Import options:
  -s SUBJECT, --subject SUBJECT
                        Subject ID, range, comma-separated list, or file
                        containing a list of subject IDs, or variable
                        expression, denoting subjects to include. Can be used
                        multiple times.
  -v VARIABLE, --variable VARIABLE
                        Variable ID, range, comma-separated list, or file
                        containing a list of variable IDs, to import. Can be
                        used multiple times.
  -co COLUMN, --column COLUMN
                        Name of column to import, or file containing a list of
                        column names. Can also be a glob-style wildcard
                        pattern - columns with a name matching the pattern
                        will be imported. Can be used multiple times.
  -c CATEGORY, --category CATEGORY
                        Category ID or label to import. Can be used multiple
                        times.
  -vi VISIT, --visit VISIT
                        Import this visit. Can be used multiple times.
                        Allowable values are 'first', 'last', or a visit
                        number.
  -ex EXCLUDE, --exclude EXCLUDE
                        Subject ID, range, comma-separated list, or file
                        containing a list of subject IDs specifying subjects
                        to exclude. Can be used multiple times.
  -ev EXCLUDE_VARIABLE, --exclude_variable EXCLUDE_VARIABLE
                        Variable ID, range, comma-separated list, or file
                        containing a list of variable IDs, to exclude. Takes
                        precedence over the --variable and --category options.
                        Can be used multiple times.
  -ec EXCLUDE_CATEGORY, --exclude_category EXCLUDE_CATEGORY
                        Category ID or label to exclude. Takes precedence over
                        the --variable and --category options. Can be used
                        multiple times.
  -aa, --add_aux_vars   Automatically import auxillary variables which are
                        specified in processing rules if not already selected.
                        Note that this only affects auxillary variables - see
                        the "take" option to the binariseCategorical
                        processing function for an example of an auxillary
                        variable.
  -iv, --index_visits   If set, the data is re-arranged so that visits form
                        part of the row indices, rather than being stored in
                        separate columns for each variable. Only applied to
                        variables which are labelled with instancing 2
                        (Biobank assessment centre visit).
  -tt, --trust_types    Assume that columns in the input data with a known
                        numeric type do not contain any bad/unparseable
                        values. Using this flag will improve import
                        performance, but will cause funpack to crash if the
                        input file(s) do contain bad values.
  -fm, --fail_if_missing
                        Causes an error to be raised if a requested variable
                        is not present in the input file(s).

Cleaning options:
  -sn, --skip_insertna  Skip NA insertion.
  -scv, --skip_childvalues
                        Skip child value replacement.
  -scf, --skip_clean_funcs
                        Skip cleaning functions defined in variable table.
  -sr, --skip_recoding  Skip raw->new level recoding.
  -nv VID NAVALUES, --na_values VID NAVALUES
                        Replace these values with NA (overrides any NA values
                        specified in variable table). Can be used multiple
                        times. Values must be specified as a comma-separated
                        list.
  -re VID RAWLEVELS NEWLEVELS, --recoding VID RAWLEVELS NEWLEVELS
                        Recode categorical variables (overrides any raw/new
                        level recodings specified in variable table). Can be
                        used multiple times. Raw and new level values must be
                        specified as comma-separated lists.
  -cv VID EXPRS VALUES, --child_values VID EXPRS VALUES
                        Replace NA with the given values where the given
                        expressions evaluate to true (overrides any
                        parent/child values specified in variable table). Can
                        be used multiple times. Parent value expressions and
                        corresponding child values must be specified as comma-
                        separated lists.
  -cl VID EXPR, --clean VID EXPR
                        Apply cleaning function(s) to variable (overrides any
                        cleaning specified in variable table).
  -tc TYPE EXPR, --type_clean TYPE EXPR
                        Apply clean function(s) to all variables of the
                        specified type (overrides any cleaning specified in
                        type table).
  -gc EXPR, --global_clean EXPR
                        Apply cleaning function(s) to every variable (after
                        variable-specific cleaning specified in variable table
                        or via --clean).

Processing options:
  -sp, --skip_processing
                        Skip processing functions defined in processing table
                        (but not those specified via --prepend_process and
                        --append_process).
  -ppr VARS PROCS, --prepend_process VARS PROCS
                        Apply processing function(s) before processing defined
                        in processing table.
  -apr VARS PROCS, --append_process VARS PROCS
                        Apply processing function(s) after processing defined
                        in processing table.

Export options:
  Non-numeric columns are exported to the main output file by default,
  but you can control this behaviour using one of the following options:
  
   - The --suppress_non_numerics option tells FUNPACK to only save
     numeric columns to the main output file.
   - The --write_non_numerics option (described in the "Auxillary output
     file options" section) tells FUNPACK to save non-numeric columns to
     a separate output file.
  
  Note that the --suppress_non_numerics option is independent from the
  --write_non_numerics option - if you want to save non-numeric columns
  to a separate file, instead of to the main file, you must use both
  options together.

  -io, --ids_only       Do not output any data - instead, output a plain-text
                        file which just contains the subject IDs, one per
                        line. When this option is used, all cleaning and
                        processing routines will be skipped, and no other
                        output files (with the exception of a log file) will
                        be produced.
  -f FORMAT, --format FORMAT
                        Output file format (default: inferred from the output
                        file suffix - one of "tsv", "csv", or "h5").
  -edf DATE_FORMAT, --date_format DATE_FORMAT
                        Formatter to use for date variables (default:
                        "default").
  -etf TIME_FORMAT, --time_format TIME_FORMAT
                        Formatter to use for time variables (default:
                        "default").
  -evf VID FORMATTER, --var_format VID FORMATTER
                        Apply custom formatter to the specified variable.
  -esn, --suppress_non_numerics
                        Do not save non-numeric columns to the main output
                        file.
  -nr NUM_ROWS, --num_rows NUM_ROWS
                        Number of rows to write at a time. Ignored if
                        --num_jobs is set to 1.

TSV/CSV export options:
  -ts TSV_SEP, --tsv_sep TSV_SEP
                        Column separator string to use in output file
                        (default: "," for csv, "t" for tsv).
  -en, --escape_newlines
                        Escape any newline characters in all non-numeric
                        columns, replacing them with a literal " " (with the
                        same logic applied to any other escape characters).
  -tm TSV_MISSING_VALUES, --tsv_missing_values TSV_MISSING_VALUES
                        String to use for missing values in output file
                        (default: empty string).

HDF5 export options:
  -hk HDF5_KEY, --hdf5_key HDF5_KEY
                        Key/name to use for the HDF5 group (default:
                        "funpack").
  -hs {pandas,funpack}, --hdf5_style {pandas,funpack}
                        HDF5 style to save as (default: "pandas").

Auxillary output file options:
  If the --write_log option is used, a default name, based on the main
  output file name, will be given to the log file. Alternatively, the
  --log_file option can be used with a specific name to use for the log
  file. This logic also applies to the other auxillary output files.
  
  The --unknown_vars_file allows a file to be saved containing
  information about columns which were in the input data, but were either
  not in the variable table, or were uncategorised and did not have any
  cleaning/processing rules specified. It contains four columns - the
  column name, the originating input file, the reason the column is being
  included (either unknown or uncategorised), and whether the column was
  exported.

  -wl, --write_log      Save log messages to file.
  -wnn, --write_non_numerics
                        Save non-numeric columns to file.
  -wu, --write_unknown_vars
                        Save list of unknown/uncategorised variables/columns
                        to file.
  -wim, --write_icd10_map
                        Save converted ICD10 code mappings to file
  -wde, --write_description
                        Save descriptions of each column to file
  -ws, --write_summary  Save summary of cleaning applied to each column to
                        file
  -lf LOG_FILE, --log_file LOG_FILE
                        Save log messages to file.
  -nnf NON_NUMERICS_FILE, --non_numerics_file NON_NUMERICS_FILE
                        Save non-numeric columns to file.
  -uf UNKNOWN_VARS_FILE, --unknown_vars_file UNKNOWN_VARS_FILE
                        Save list of unknown/uncategorised variables/columns
                        to file.
  -imf ICD10_MAP_FILE, --icd10_map_file ICD10_MAP_FILE
                        Save converted ICD10 code mappings to file
  -def DESCRIPTION_FILE, --description_file DESCRIPTION_FILE
                        Save descriptions of each column to file
  -sf SUMMARY_FILE, --summary_file SUMMARY_FILE
                        Save summary of cleaning applied to each column to
                        file

Miscellaneous options:
  -V, --version         Print version and exit.
  -dn, --drop_na_rows   Drop rows which do not contain data for any column.
                        will take place on both import and export.
  -d, --dry_run         Print a summary of what would happen and exit.
  -nb, --no_builtins    Do not use the built in variable or data coding
                        tables.
  -nd, --no_download    Do not download UKB schema files - use built-in copies
                        instead.
  -nj NUM_JOBS, --num_jobs NUM_JOBS
                        Maximum number of jobs to run in parallel. Set to 1
                        (the default) to disable parallelisation. Set to -1 to
                        use all available cores (2 on this platform).
  -p FILE, --plugin_file FILE
                        Path to file containing custom funpack plugins. Can be
                        used multiple times.
  -ow, --overwrite      Overwrite output file if it already exists
  -n, --noisy           Print debug statements. Can be used up to three times.
  -q, --quiet           Be quiet.
  -eh ECHO, --echo ECHO
                        Echo the given value when parsed. This option is
                        intended for use within FUNPACK configuration files,
                        for example to print out the configuration file
                        version.

Available cleaning routines:
  - remove               remove()
                         Remove (do not load) all columns associated with vid.
  - keepVisits           keepVisits(visit[, visit, ...])
                         Only load columns for vid which correspond to the specified visits.
  - fillVisits           fillVisits([method=(mode|mean)])
                         Fill missing visits from available visits.
  - fillMissing          fillMissing(fill_value)
                         Fill missing values with a constant.
  - flattenHierarchical  flattenHierarchical([level][, name][, numeric][, convertNumeric])
                         Replace leaf values with parent values in hierarchical variables.
  - codeToNumeric        codeToNumeric()
                         Convert hierarchical coding labels to numeric equivalents (node IDs).
  - parseSpirometryData  parseSpirometryData()
                         Parse spirometry (lung volume measurement) data.
  - makeNa               makeNa(expression)
                         Replace values which pass the expression with NA.

Available processing routines:
  - removeIfSparse          removeIfSparse([minpres][, minstd][, mincat][, maxcat][, abspres][, abscat][, naval])
                            Removes columns deemed to be sparse.
  - removeIfRedundant       removeIfRedundant(corrthres, [nathres])
                            Removes columns deemed to be redundant.
  - binariseCategorical     binariseCategorical([acrossVisits][, acrossInstances][, minpres][, nameFormat][, replace][, take][, fillval][, replaceTake])
                            Replace a categorical column with one binary column per category.
  - expandCompound          expandCompound([nameFormat][, replace])
                            Expand a compound column into a set of columns, one for each value.
  - createDiagnosisColumns  createDiagnosisColumns(primvid, secvid)
                            Create binary columns for (e.g.) ICD10 codes, denoting them as either primary or secondary.
  - removeField             removeField()
                            Remove all columns associated with one or more datafields/variables.