funpack.cleaning_functions
This module contains definitions of cleaning functions - functions
which may be specified in the Clean
column of the variable table,
and which are appiled during the data import stage.
The following cleaning functions are available:
Remove (do not load) all columns associated with |
|
Only load columns for |
|
Fill missing visits from available visits. |
|
Fill missing values with a constant. |
|
Replace leaf values with parent values in hierarchical variables. |
|
Convert hierarchical coding labels to numeric equivalents (node IDs). |
|
Parse spirometry (lung volume measurement) data. |
|
Replace values which pass the expression with NA. |
All cleaning functions (with two exceptions - remove()
and
keepVisits()
, explained below) will be passed the following as their
first positional arguments, followed by any arguments specified in the
variable table:
The
DataTable
object, containing references to the data, variable, and processing table.The integer ID of the variable to process
These functions are expected to perform in-place pre-processing on the data.
The remove()
and keepVisits()
functions are called before the data
is loaded, as they are used to determine which columns should be loaded
in. They therefore have a different function signature to the other cleaning
functions, which are called after the data has been loaded. These functions
are passed the following positional arguments, followed by arguments specified
in the variable table:
The variable table, containing metadata about all known variables
The integer ID of the variable to be checked
A list of
Column
objects denoting all columns associated with this variable that are present in the data
These functions are expected to return a modified list containing the columns that should be loaded.
- funpack.cleaning_functions.codeToNumeric()[source]
Convert hierarchical coding labels to numeric equivalents (node IDs).
Given a hierarchical variable which contains alpha-numeric codings, this function will replace the codings with numeric equivalents.
For example, UKB data field 41270 (ICD10 diagnoses) uses data coding 19, which uses ICD10 alpha-numeric codes to denote medical diagnoses.
For data field 41270,
codeToNumeric
would replace codingD730
with the corresponding node ID22180
when using the'node'
conversion scheme, or39730
when using the'alpha'
scheme.The
name
option is unnecessary in most cases, but may be used to specify which encoding scheme to use. Currently,'icd9'
,'icd10'
,'opsc3'
and'opsc4'
are accepted.The
scheme
option determines the conversion method - it can be set to one of:'node'
(default): Convert alpha-numeric codes to the corresponding numeric node IDs, as specified in the UK Biobank showcase.'alpha'
: Convert alpha-numeric codes to integer numbers, using an immutable scheme that will not change over time - this scheme uses theicd10.toNumeric()
function.
Warning
Note that node IDs (which are used by the
'node'
conversion scheme) are not present for non-leaf nodes in all hierarchical UK Biobank data fields (in which case this function will return nan). Also note that node IDS may change across UK Biobank showcase releases.Warning
The
'alpha'
conversion scheme is intended for use with ICD10 alpha-numeric codes (and similar coding schemes). This scheme may result in unrepresentable integer values if used with longer strings.
- funpack.cleaning_functions.fillMissing(fill_value)[source]
Fill missing values with a constant.
Fills all missing values in columns for the given variable with
fill_value
.
- funpack.cleaning_functions.fillVisits([method=(mode|mean)])[source]
Fill missing visits from available visits.
For a variable with multiple visits, fills missing values from the visits that do contain data. The
method
argument can be set to one of:'mode'
(the default) - fill missing visits with the most frequently occurring value in the available visits'mean'
- fill missing visits with the mean of the values in the available visits
- funpack.cleaning_functions.flattenHierarchical([level][, name][, numeric][, convertNumeric])[source]
Replace leaf values with parent values in hierarchical variables.
For hierarchical variables such as the ICD10 disease categorisations, this function replaces leaf values with a parent value.
The
level
argument allows the depth of the parent value to be selected -0
(the default) replaces a value with the top-level (i.e. its most distant) parent value,1
the second-level parent value, etc.If, for a particular value, the
level
argument is greater than the number of ancestors of the value, the value is unchanged.By default,
flattenHierarchical
expects to be given coding values. For example, for data field 41202 (ICD10 diagnoses), codingA009
(Cholera, unspecified) would be replaced withChapter I
(Certain infectious and parasitic diseases).The
numeric
argument tells the function to expect numeric node IDs (as output bycodeToNumeric()
), rather than coding labels. This allowsflattenHierarchical
to be used in conjunction withcodeToNumeric
, e.g.:-cl 41202 "codeToNumeric,flattenHierarchical(numeric=True)"
Continuing the above example, the
codeToNumeric
function would replaceA009
with its node ID2890
, andflattenHierarchical(numeric=True)
would replace2890
with10
(the node ID forChapter I
).The
convertNumeric
argument is a shortcut for the above scenario - using this impliesnumeric=True
, and will cause thecodeToNumeric
routine to be called automatically. The following example is equivalent to the above:-cl 41202 "flattenHierarchical(convertNumeric=True)"
The
name
option is unnecessary in most cases, but may be used to specify an encoding scheme to use. Currently,'icd9'
,'icd10'
,'opsc3'
and'opsc4'
are accepted.
- funpack.cleaning_functions.keepVisits(visit[, visit, ...])[source]
Only load columns for
vid
which correspond to the specified visits.This test is only applied to variables which have an instancing equal to 2:
https://biobank.ctsu.ox.ac.uk/crystal/instance.cgi?id=2
Such variables are not associated with a specific visit, so it makes no sense to apply this test to them. See the
loadtables.addNewVariable()
function for more information on variable instancing.- Parameters:
vartable – Variable table
vid – Integer variable ID.
columns – List of all
Column
objects that are associated withvid
.tokeep – Visit IDs (integers), or
'first'
, or'last'
, indicating that the first or last visit should be loaded.
- Returns:
List of columns that should be loaded.
- funpack.cleaning_functions.makeNa(expression)[source]
Replace values which pass the expression with NA.
All values in the columns of the variable for which
func
evaluates toTrue
are replaced withnan
.The
expression
must be of the form'<operator> <value>'
, where<operator>
is one of==
,!=
,<
,<=
,>
,>=
, orcontains
(only applicable to non-numeric variables). For example,makeNa('> 25')
would cause all values greater than 25 to be replaced withnan
.
- funpack.cleaning_functions.parseSpirometryData()[source]
Parse spirometry (lung volume measurement) data.
Parses values which contain spirometry (lung volume measurement) test data.
Columns for UK Biobank data fields 3066 and 10697 contain comma-separated time series data. This function converts the text values into
numpy
arrays, and potentially re-orders instances of the variable within each visit - subjects potentially underwent a series of tests, with each test stored as a separate instance within one visit. However, the instance order does not necessarily correspond to the test order.