FUNPACK command-line interface
FUNPACK is controlled entirely through its command-line interface. You can control which data fields and subjects to import, which cleaning/processing rules to apply, and how the output should be formatted.
Configuration files
FUNPACK can also be controlled through configuration files - a FUNPACK
configuration file simply contains a list of command-line options, without the
leading --
. For example, the options in the following command-line:
fmrib_unpack \
--overwrite \
--write_log \
--icd10_map_file icd_codes.tsv \
--category 10 \
--category 11 \
output.tsv input1.tsv input2.tsv
Could be stored in a configuration file config.txt
:
overwrite
write_log
icd10_map_file icd_codes.tsv
category 10
category 11
And then executed as follows:
fmrib_unpack -cfg config.txt output.tsv input1.tsv input2.tsv
A FUNPACK configuration file can include other configuration files, making
it possible to build a configuration proflie which is organised across
several configuration files. For example, you may wish to store variable
selectors in one file variables.cfg
:
variable 31 # sex
variable 33 # dob
variable 41202 # ICD10 main diagnoses
Then your main configuration file can include variables.cfg
via the
config_file
argument, along with any other options:
overwrite
write_log
config_file variables.cfg
The FMRIB configuration profile (described here) uses this feature to organise the options and rules that it applies.
Command-line reference
All of the command-line options that FUNPACK accespts are documented below:
usage: funpack [-h] [-e ENCODING] [-l FILE LOADER] [-i FILE INDEX]
[-ma {0,rows,subjects,1,cols,columns,variables}]
[-ms {naive,union,intersection,inner,outer}] [-rm] [-rd]
[-cfg CONFIG_FILE] [-vf VARIABLE_FILE] [-df DATACODING_FILE]
[-tf TYPE_FILE] [-pf PROCESSING_FILE] [-cf CATEGORY_FILE]
[-s SUBJECT] [-v VARIABLE] [-co COLUMN] [-c CATEGORY]
[-vi VISIT] [-ex EXCLUDE] [-ev EXCLUDE_VARIABLE]
[-ec EXCLUDE_CATEGORY] [-aa] [-iv] [-tt] [-fm] [-sn] [-scv]
[-scf] [-sr] [-nv VID NAVALUES] [-re VID RAWLEVELS NEWLEVELS]
[-cv VID EXPRS VALUES] [-cl VID EXPR] [-tc TYPE EXPR]
[-gc EXPR] [-sp] [-ppr VARS PROCS] [-apr VARS PROCS] [-io]
[-f FORMAT] [-edf DATE_FORMAT] [-etf TIME_FORMAT]
[-evf VID FORMATTER] [-esn] [-nr NUM_ROWS] [-ts TSV_SEP] [-en]
[-tm TSV_MISSING_VALUES] [-hk HDF5_KEY] [-hs {pandas,funpack}]
[-wl] [-wnn] [-wu] [-wim] [-wde] [-ws] [-lf LOG_FILE]
[-nnf NON_NUMERICS_FILE] [-uf UNKNOWN_VARS_FILE]
[-imf ICD10_MAP_FILE] [-def DESCRIPTION_FILE]
[-sf SUMMARY_FILE] [-V] [-dn] [-d] [-nb] [-nd] [-nj NUM_JOBS]
[-p FILE] [-ow] [-n] [-q] [-eh ECHO]
outfile infile [infile ...]
options:
-h, --help show this help message and exit
Input/output files:
outfile Location to store output data.
infile File(s) containing input data.
-e ENCODING, --encoding ENCODING
Character encoding. A single encoding can be
specified, or this option can be used multiple times,
one for each input file.
-l FILE LOADER, --loader FILE LOADER
Use custom loader for file. Can be used multiple
times.
-i FILE INDEX, --index FILE INDEX
Position of index column for file (starting from 0).
Can be used multiple times. Defaults to 0. Specify
multi-column indexes as comma-separated lists.
-ma {0,rows,subjects,1,cols,columns,variables}, --merge_axis {0,rows,subjects,1,cols,columns,variables}
Axis to concatenate multiple input files along
(default: "variables"). Options are
"subjects"/"rows"/"0" or "variables"/"columns"/"1".
-ms {naive,union,intersection,inner,outer}, --merge_strategy {naive,union,intersection,inner,outer}
Strategy for merging multiple input files (default:
"intersection"). Options are "naive",
"intersection"/"inner", or "union"/"outer".
-rm, --remove_duplicates
Remove duplicate columns, only retaining the first.
-rd, --rename_duplicates
Rename any duplicate columns so that all columns have
a unique name.
-cfg CONFIG_FILE, --config_file CONFIG_FILE
File containing default command line arguments. Can be
used multiple times.
-vf VARIABLE_FILE, --variable_file VARIABLE_FILE
File(s) containing rules for handling variables
-df DATACODING_FILE, --datacoding_file DATACODING_FILE
File(s) containing rules for handling data codings.
-tf TYPE_FILE, --type_file TYPE_FILE
File containing rules for handling types.
-pf PROCESSING_FILE, --processing_file PROCESSING_FILE
File containing variable processing rules.
-cf CATEGORY_FILE, --category_file CATEGORY_FILE
File containing variable categories.
Import options:
-s SUBJECT, --subject SUBJECT
Subject ID, range, comma-separated list, or file
containing a list of subject IDs, or variable
expression, denoting subjects to include. Can be used
multiple times.
-v VARIABLE, --variable VARIABLE
Variable ID, range, comma-separated list, or file
containing a list of variable IDs, to import. Can be
used multiple times.
-co COLUMN, --column COLUMN
Name of column to import, or file containing a list of
column names. Can also be a glob-style wildcard
pattern - columns with a name matching the pattern
will be imported. Can be used multiple times.
-c CATEGORY, --category CATEGORY
Category ID or label to import. Can be used multiple
times.
-vi VISIT, --visit VISIT
Import this visit. Can be used multiple times.
Allowable values are 'first', 'last', or a visit
number.
-ex EXCLUDE, --exclude EXCLUDE
Subject ID, range, comma-separated list, or file
containing a list of subject IDs specifying subjects
to exclude. Can be used multiple times.
-ev EXCLUDE_VARIABLE, --exclude_variable EXCLUDE_VARIABLE
Variable ID, range, comma-separated list, or file
containing a list of variable IDs, to exclude. Takes
precedence over the --variable and --category options.
Can be used multiple times.
-ec EXCLUDE_CATEGORY, --exclude_category EXCLUDE_CATEGORY
Category ID or label to exclude. Takes precedence over
the --variable and --category options. Can be used
multiple times.
-aa, --add_aux_vars Automatically import auxillary variables which are
specified in processing rules if not already selected.
Note that this only affects auxillary variables - see
the "take" option to the binariseCategorical
processing function for an example of an auxillary
variable.
-iv, --index_visits If set, the data is re-arranged so that visits form
part of the row indices, rather than being stored in
separate columns for each variable. Only applied to
variables which are labelled with instancing 2
(Biobank assessment centre visit).
-tt, --trust_types Assume that columns in the input data with a known
numeric type do not contain any bad/unparseable
values. Using this flag will improve import
performance, but will cause funpack to crash if the
input file(s) do contain bad values.
-fm, --fail_if_missing
Causes an error to be raised if a requested variable
is not present in the input file(s).
Cleaning options:
-sn, --skip_insertna Skip NA insertion.
-scv, --skip_childvalues
Skip child value replacement.
-scf, --skip_clean_funcs
Skip cleaning functions defined in variable table.
-sr, --skip_recoding Skip raw->new level recoding.
-nv VID NAVALUES, --na_values VID NAVALUES
Replace these values with NA (overrides any NA values
specified in variable table). Can be used multiple
times. Values must be specified as a comma-separated
list.
-re VID RAWLEVELS NEWLEVELS, --recoding VID RAWLEVELS NEWLEVELS
Recode categorical variables (overrides any raw/new
level recodings specified in variable table). Can be
used multiple times. Raw and new level values must be
specified as comma-separated lists.
-cv VID EXPRS VALUES, --child_values VID EXPRS VALUES
Replace NA with the given values where the given
expressions evaluate to true (overrides any
parent/child values specified in variable table). Can
be used multiple times. Parent value expressions and
corresponding child values must be specified as comma-
separated lists.
-cl VID EXPR, --clean VID EXPR
Apply cleaning function(s) to variable (overrides any
cleaning specified in variable table).
-tc TYPE EXPR, --type_clean TYPE EXPR
Apply clean function(s) to all variables of the
specified type (overrides any cleaning specified in
type table).
-gc EXPR, --global_clean EXPR
Apply cleaning function(s) to every variable (after
variable-specific cleaning specified in variable table
or via --clean).
Processing options:
-sp, --skip_processing
Skip processing functions defined in processing table
(but not those specified via --prepend_process and
--append_process).
-ppr VARS PROCS, --prepend_process VARS PROCS
Apply processing function(s) before processing defined
in processing table.
-apr VARS PROCS, --append_process VARS PROCS
Apply processing function(s) after processing defined
in processing table.
Export options:
Non-numeric columns are exported to the main output file by default,
but you can control this behaviour using one of the following options:
- The --suppress_non_numerics option tells FUNPACK to only save
numeric columns to the main output file.
- The --write_non_numerics option (described in the "Auxillary output
file options" section) tells FUNPACK to save non-numeric columns to
a separate output file.
Note that the --suppress_non_numerics option is independent from the
--write_non_numerics option - if you want to save non-numeric columns
to a separate file, instead of to the main file, you must use both
options together.
-io, --ids_only Do not output any data - instead, output a plain-text
file which just contains the subject IDs, one per
line. When this option is used, all cleaning and
processing routines will be skipped, and no other
output files (with the exception of a log file) will
be produced.
-f FORMAT, --format FORMAT
Output file format (default: inferred from the output
file suffix - one of "tsv", "csv", or "h5").
-edf DATE_FORMAT, --date_format DATE_FORMAT
Formatter to use for date variables (default:
"default").
-etf TIME_FORMAT, --time_format TIME_FORMAT
Formatter to use for time variables (default:
"default").
-evf VID FORMATTER, --var_format VID FORMATTER
Apply custom formatter to the specified variable.
-esn, --suppress_non_numerics
Do not save non-numeric columns to the main output
file.
-nr NUM_ROWS, --num_rows NUM_ROWS
Number of rows to write at a time. Ignored if
--num_jobs is set to 1.
TSV/CSV export options:
-ts TSV_SEP, --tsv_sep TSV_SEP
Column separator string to use in output file
(default: "," for csv, "t" for tsv).
-en, --escape_newlines
Escape any newline characters in all non-numeric
columns, replacing them with a literal " " (with the
same logic applied to any other escape characters).
-tm TSV_MISSING_VALUES, --tsv_missing_values TSV_MISSING_VALUES
String to use for missing values in output file
(default: empty string).
HDF5 export options:
-hk HDF5_KEY, --hdf5_key HDF5_KEY
Key/name to use for the HDF5 group (default:
"funpack").
-hs {pandas,funpack}, --hdf5_style {pandas,funpack}
HDF5 style to save as (default: "pandas").
Auxillary output file options:
If the --write_log option is used, a default name, based on the main
output file name, will be given to the log file. Alternatively, the
--log_file option can be used with a specific name to use for the log
file. This logic also applies to the other auxillary output files.
The --unknown_vars_file allows a file to be saved containing
information about columns which were in the input data, but were either
not in the variable table, or were uncategorised and did not have any
cleaning/processing rules specified. It contains four columns - the
column name, the originating input file, the reason the column is being
included (either unknown or uncategorised), and whether the column was
exported.
-wl, --write_log Save log messages to file.
-wnn, --write_non_numerics
Save non-numeric columns to file.
-wu, --write_unknown_vars
Save list of unknown/uncategorised variables/columns
to file.
-wim, --write_icd10_map
Save converted ICD10 code mappings to file
-wde, --write_description
Save descriptions of each column to file
-ws, --write_summary Save summary of cleaning applied to each column to
file
-lf LOG_FILE, --log_file LOG_FILE
Save log messages to file.
-nnf NON_NUMERICS_FILE, --non_numerics_file NON_NUMERICS_FILE
Save non-numeric columns to file.
-uf UNKNOWN_VARS_FILE, --unknown_vars_file UNKNOWN_VARS_FILE
Save list of unknown/uncategorised variables/columns
to file.
-imf ICD10_MAP_FILE, --icd10_map_file ICD10_MAP_FILE
Save converted ICD10 code mappings to file
-def DESCRIPTION_FILE, --description_file DESCRIPTION_FILE
Save descriptions of each column to file
-sf SUMMARY_FILE, --summary_file SUMMARY_FILE
Save summary of cleaning applied to each column to
file
Miscellaneous options:
-V, --version Print version and exit.
-dn, --drop_na_rows Drop rows which do not contain data for any column.
will take place on both import and export.
-d, --dry_run Print a summary of what would happen and exit.
-nb, --no_builtins Do not use the built in variable or data coding
tables.
-nd, --no_download Do not download UKB schema files - use built-in copies
instead.
-nj NUM_JOBS, --num_jobs NUM_JOBS
Maximum number of jobs to run in parallel. Set to 1
(the default) to disable parallelisation. Set to -1 to
use all available cores (2 on this platform).
-p FILE, --plugin_file FILE
Path to file containing custom funpack plugins. Can be
used multiple times.
-ow, --overwrite Overwrite output file if it already exists
-n, --noisy Print debug statements. Can be used up to three times.
-q, --quiet Be quiet.
-eh ECHO, --echo ECHO
Echo the given value when parsed. This option is
intended for use within FUNPACK configuration files,
for example to print out the configuration file
version.
Available cleaning routines:
- remove remove()
Remove (do not load) all columns associated with vid.
- keepVisits keepVisits(visit[, visit, ...])
Only load columns for vid which correspond to the specified visits.
- fillVisits fillVisits([method=(mode|mean)])
Fill missing visits from available visits.
- fillMissing fillMissing(fill_value)
Fill missing values with a constant.
- flattenHierarchical flattenHierarchical([level][, name][, numeric][, convertNumeric])
Replace leaf values with parent values in hierarchical variables.
- codeToNumeric codeToNumeric()
Convert hierarchical coding labels to numeric equivalents (node IDs).
- parseSpirometryData parseSpirometryData()
Parse spirometry (lung volume measurement) data.
- makeNa makeNa(expression)
Replace values which pass the expression with NA.
Available processing routines:
- removeIfSparse removeIfSparse([minpres][, minstd][, mincat][, maxcat][, abspres][, abscat][, naval])
Removes columns deemed to be sparse.
- removeIfRedundant removeIfRedundant(corrthres, [nathres])
Removes columns deemed to be redundant.
- binariseCategorical binariseCategorical([acrossVisits][, acrossInstances][, minpres][, nameFormat][, replace][, take][, fillval][, replaceTake])
Replace a categorical column with one binary column per category.
- expandCompound expandCompound([nameFormat][, replace])
Expand a compound column into a set of columns, one for each value.
- createDiagnosisColumns createDiagnosisColumns(primvid, secvid)
Create binary columns for (e.g.) ICD10 codes, denoting them as either primary or secondary.
- removeField removeField()
Remove all columns associated with one or more datafields/variables.