Cleaning and processing functions
This page contains a list of all cleanng and processing functions that are available within FUNPACK. Recall that running FUNPACK involves four stages:
Import: The requested rows (subjects) and columns (variables/datafields) are loaded into memory.
Cleaning: A number of steps are applied to each column, including NA value replacement, categorical recoding, child value replacement, and variable-specific cleaning functions.
Processing: Columns may be added or removed by a number of processing functions, such as removing columns due to sparsity, or generating new binary columns from an existing categorical column.
Export: The data is saved to a new file.
This page covers the variable-specific cleaning functions, and processing functions, that are built into FUNPACK [*].
Cleaning functions and processing functions differ in the following ways:
Cleaning functions are applied to a single variable / data field at a time, whereas processing functions may be applied to multiple variables / data fields at once.
Cleaning functions cannot add or remove columns, whereas processing functions are able to remove existing columns and add new columns.
Cleaning and processing functions may be specified on the command-line, or in
separate variable or processing tables (via the --variable_file
and
--processing_file
command-line options). In all cases:
You may specify a single function, or a comma-separated list of functions to be applied in order.
When no arguments are specified, the function parentheses are optional.
Function arguments may be passed as positional or keyword (named) arguments.
For example, this command would run cleaning functinos func1
, func2
,
func3
, and func4
on the columns of data field 12345:
fmrib_unpack \
-cl 12345 "func1,func2(),func3(99),func4('arg1', arg2=1234)" \
output.tsv input.csv
Cleaning functions
Built-in cleaning functions are defined in the
funpack.cleaning_functions
module. These functions may be used with
the -cl
/ --clean
command-line option. For example, to apply
fillMissing
and flattenHierarchical
to the columns of data field
41202:
fmrib_unpack -cl 20002 "fillVisits,flattenHierarchical" output.tsv input.csv
- funpack.cleaning_functions.fillVisits([method=(mode|mean)])[source]
Fill missing visits from available visits.
For a variable with multiple visits, fills missing values from the visits that do contain data. The
method
argument can be set to one of:'mode'
(the default) - fill missing visits with the most frequently occurring value in the available visits'mean'
- fill missing visits with the mean of the values in the available visits
- funpack.cleaning_functions.fillMissing(fill_value)[source]
Fill missing values with a constant.
Fills all missing values in columns for the given variable with
fill_value
.
- funpack.cleaning_functions.makeNa(expression)[source]
Replace values which pass the expression with NA.
All values in the columns of the variable for which
func
evaluates toTrue
are replaced withnan
.The
expression
must be of the form'<operator> <value>'
, where<operator>
is one of==
,!=
,<
,<=
,>
,>=
, orcontains
(only applicable to non-numeric variables). For example,makeNa('> 25')
would cause all values greater than 25 to be replaced withnan
.
- funpack.cleaning_functions.codeToNumeric()[source]
Convert hierarchical coding labels to numeric equivalents (node IDs).
Given a hierarchical variable which contains alpha-numeric codings, this function will replace the codings with numeric equivalents.
For example, UKB data field 41270 (ICD10 diagnoses) uses data coding 19, which uses ICD10 alpha-numeric codes to denote medical diagnoses.
For data field 41270,
codeToNumeric
would replace codingD730
with the corresponding node ID22180
when using the'node'
conversion scheme, or39730
when using the'alpha'
scheme.The
name
option is unnecessary in most cases, but may be used to specify which encoding scheme to use. Currently,'icd9'
,'icd10'
,'opsc3'
and'opsc4'
are accepted.The
scheme
option determines the conversion method - it can be set to one of:'node'
(default): Convert alpha-numeric codes to the corresponding numeric node IDs, as specified in the UK Biobank showcase.'alpha'
: Convert alpha-numeric codes to integer numbers, using an immutable scheme that will not change over time - this scheme uses theicd10.toNumeric()
function.
Warning
Note that node IDs (which are used by the
'node'
conversion scheme) are not present for non-leaf nodes in all hierarchical UK Biobank data fields (in which case this function will return nan). Also note that node IDS may change across UK Biobank showcase releases.Warning
The
'alpha'
conversion scheme is intended for use with ICD10 alpha-numeric codes (and similar coding schemes). This scheme may result in unrepresentable integer values if used with longer strings.
- funpack.cleaning_functions.flattenHierarchical([level][, name][, numeric][, convertNumeric])[source]
Replace leaf values with parent values in hierarchical variables.
For hierarchical variables such as the ICD10 disease categorisations, this function replaces leaf values with a parent value.
The
level
argument allows the depth of the parent value to be selected -0
(the default) replaces a value with the top-level (i.e. its most distant) parent value,1
the second-level parent value, etc.If, for a particular value, the
level
argument is greater than the number of ancestors of the value, the value is unchanged.By default,
flattenHierarchical
expects to be given coding values. For example, for data field 41202 (ICD10 diagnoses), codingA009
(Cholera, unspecified) would be replaced withChapter I
(Certain infectious and parasitic diseases).The
numeric
argument tells the function to expect numeric node IDs (as output bycodeToNumeric()
), rather than coding labels. This allowsflattenHierarchical
to be used in conjunction withcodeToNumeric
, e.g.:-cl 41202 "codeToNumeric,flattenHierarchical(numeric=True)"
Continuing the above example, the
codeToNumeric
function would replaceA009
with its node ID2890
, andflattenHierarchical(numeric=True)
would replace2890
with10
(the node ID forChapter I
).The
convertNumeric
argument is a shortcut for the above scenario - using this impliesnumeric=True
, and will cause thecodeToNumeric
routine to be called automatically. The following example is equivalent to the above:-cl 41202 "flattenHierarchical(convertNumeric=True)"
The
name
option is unnecessary in most cases, but may be used to specify an encoding scheme to use. Currently,'icd9'
,'icd10'
,'opsc3'
and'opsc4'
are accepted.
- funpack.cleaning_functions.parseSpirometryData()[source]
Parse spirometry (lung volume measurement) data.
Parses values which contain spirometry (lung volume measurement) test data.
Columns for UK Biobank data fields 3066 and 10697 contain comma-separated time series data. This function converts the text values into
numpy
arrays, and potentially re-orders instances of the variable within each visit - subjects potentially underwent a series of tests, with each test stored as a separate instance within one visit. However, the instance order does not necessarily correspond to the test order.
Processing functions
Built-in processing functions are defined in the
funpack.processing_functions
module. These functions may be used with
the -ppr
/ --prepend_process
and -apr
/ --append_process
command-line flags. You have the option to prepend or append processing
functions in case you are using a pre-defined processing table or
configuration profile, but wish to perform some additional steps for specific
data fields.
For example, to apply removeIfSparse
and binariseCategorical
to the
columns of data fields 20001 and 20002:
fmrib_unpack \
-apr 20001,20002 \
"removeIfSparse(50),binariseCategorical(acrossVisits=True)" \
output.tsv input.csv
- funpack.processing_functions.removeIfSparse([minpres][, minstd][, mincat][, maxcat][, abspres][, abscat][, naval])[source]
Removes columns deemed to be sparse.
Removes columns for all specified variables if they fail a sparsity test. The test is based on the following criteria:
The number/proportion of non-NA values must be greater than or equal to
minpres
.The standard deviation of the data must be greater than
minstd
.For integer and categorical types, the number/proportion of the largest category must be greater than
mincat
.For integer and categorical types, the number/proportion of the largest category must be less than
maxcat
.
If any of these criteria are not met, the data is considered to be sparse. Each criteria can be disabled by passing in
None
for the relevant parameter.The
minstd
test is only performed on numeric columns, and themincat
/maxcat
tests are only performed on integer/categorical columns.If
abspres=True
(the default),minpres
is interpreted as an absolute count. Ifabspress=False
,minpres
is interpreted as a proportion.Similarly, If
abscat=True
(the default),mincat
andmaxcat
are interpreted as absolute counts. Otherwisemincat
andmaxcat
are interpreted as proportionsThe
naval
argument can be used to customise the value to consider as “missing” - it defaults tonp.nan
.
- funpack.processing_functions.removeIfRedundant(corrthres[, nathres])[source]
Removes columns deemed to be redundant.
Removes columns from the specified group of variables if they are found to be redundant with any other columns in the group.
Redundancy is determined by calculating the correlation between all pairs of columns - columns with an absolute correlation greater than
corrthres
are identified as redundant.The test can optionally take the patterns oof missing values into account - if
nathres
is provided, the missingness correlation is also calculated between all column pairs. Columns must have absolute correlation greater thancorrthres
and absolute missingness correlation greater thannathres
to be identified as redundant.The
skipUnknowns
option defaults toFalse
. If it is set toTrue
, columns which are deemed to be redundant with respect to an unknown or uncategorised column are not dropped.The
precision
option can be set to either'double'
(the default) or'single'
- this controls whether 32 bit (single) or 64 bit (double) precision floating point is used for the correlation calculation. Double precision is recommended, as the correlation calculation algorithm can be unstable for data with large values (>10e5).
- funpack.processing_functions.binariseCategorical([acrossVisits][, acrossInstances][, minpres][, nameFormat][, replace][, take][, fillval][, replaceTake])[source]
Replace a categorical column with one binary column per category.
Binarises categorical variables - replaces their columns with one new column for each unique value, containing
1
for subjects with that value, and0
otherwise. Thos procedure is applied independently to all variables that are specified.The
acrossVisits
option controls whether the binarisation is applied across visits for each variable. It defaults toFalse
, meaning that the binarisation is applied separately to the columns within each visit. If set toTrue
, the binarisation will be applied to the columns for all visits. Similarly, theacrossInstances
option controls whether the binarisation is applied across instances. This defaults toTrue
, which is usually desirable - for example, data field 41202 contains multiple ICD10 diagnoses, separated across different instances.If the
minpres
option is specified, it is used as a threshold - categorical values with less than this many occurrences will not be added as columns.The
nameFormat
argument controls how the new data columns should be named - it must be a format string using named replacement fields'vid'
,'visit'
,'instance'
, and'value'
. The'visit'
and'instance'
fields may or may not be necessary, depending on the value of theacrossVisits
andacrossInstances
arguments.The default value for the
nameFormat
string is as follows:acrossVisits
acrossInstances
nameFormat
False
False
'{vid}-{visit}.{instance}_{value}'
False
True
'{vid}-{visit}.{value}'
True
False
'{vid}-{value}.{instance}'
True
True
'{vid}-{value}'
The
replace
option controls whether the original un-binarised columns should be removed - it defaults toTrue
, which will cause them to be removed.By default, the new binary columns (one for each unique value in the input columns) will contain a
1
indicating that the value is present, or a0
indicating its absence. As an alternative to this, thetake
option can be used to specify another variable from which to take values when populating the output columns.take
may be set to a variable ID, or sequence of variable IDs (one for each of the input variables) to take values from. If provided, the generated columns will have values from the column(s) of this variable, instead of containinng binary 0/1 values.A
take
variable must have columns that match the columns of the corresponding variable (by both visits and instances).If
take
is being used, thefillval
option can be used to specify the the value to use forFalse
/0
rows. It defaults tonp.nan
.The
replaceTake
option is similar toreplace
- it controls whether the columns associated with thetake
variables are removed (True
- the defailt), or retained.
- funpack.processing_functions.expandCompound([nameFormat][, replace])[source]
Expand a compound column into a set of columns, one for each value.
Expands compound variables into a set of columns, one for each value. Rows with different number of values are padded with
np.nan
.This procedure is applied independently to each column of each specified variable.
The
nameFormat
option can be used to control how the new columns should be named - it must be a format string using named replacement fields'vid'
,'visit'
,'instance'
, and'index'
. The default value fornameFormat
is'{vid}-{visit}.{instance}_{index}'
.The
replace
option controls whether the original columns are removed (True
- the default), or retained.
- funpack.processing_functions.createDiagnosisColumns(primvid, secvid)[source]
Create binary columns for (e.g.) ICD10 codes, denoting them as either primary or secondary.
This function is intended for use with data fields containing ICD9, ICD10, OPSC3, and OPSC4 diagnosis codes. The UK Biobank publishes these diagnosis/operative procedure codes twice:
The codes are published in a “main” data field containing all codes
The codes are published again in two other data fields, containing separate “primary” and “secondary” codes.
Code
Main data field
Primary data field
Secondary data field
ICD10
41270
41202
41204
ICD9
41271
41203
41205
OPSC4
41272
41200
41210
OPSC3
41273
41256
41258
For example, this function may be applied to the ICD10 diagnosis codes like so:
41270 createDiagnosisColumns(41202, 41204)
When applied to one of the main data fields, (e.g. 41270 - ICD10 diagnoses), this function will create two new columns for every unique ICD10 diagnosis code:
the first column contains a 1 if the code corresponds to a primary diagnosis (i.e. is also in 41202).
the second column contains a 1 if the code corresponds to a secondary diagnosis (i.e. is also in 41204).
The
replace
option defaults toTrue
- this causes the primary and secondary code columns to be removed from the data set.The
binarised
option defaults toFalse
, which causes this function to expect the input columns to be in their raw format, as described in the UKB showcase (e.g. https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270). Thebinarised
option can be set toTrue
, which will allow this function to be applied to data fields which have been passed through thebinariseCategorical()
function.
- funpack.processing_functions.removeField()[source]
Remove all columns associated with one or more datafields/variables.
This function can be used to simply remove columns from the data set. This can be useful if a variable is required for some processing step, but is not needed in the final output file.