funpack.cleaning_functions
This module contains definitions of cleaning functions - functions
which may be specified in the Clean column of the variable table,
and which are appiled during the data import stage.
The following cleaning functions are available:
Remove (do not load) all columns associated with |
|
Only load columns for |
|
Fill missing visits from available visits. |
|
Fill missing values with a constant. |
|
Replace leaf values with parent values in hierarchical variables. |
|
Convert hierarchical coding labels to numeric equivalents (node IDs). |
|
Parse spirometry (lung volume measurement) data. |
|
Replace values which pass the expression with NA. |
All cleaning functions (with two exceptions - remove() and
keepVisits(), explained below) will be passed the following as their
first positional arguments, followed by any arguments specified in the
variable table:
The
DataTableobject, containing references to the data, variable, and processing table.The integer ID of the variable to process
These functions are expected to perform in-place pre-processing on the data.
The remove() and keepVisits() functions are called before the data
is loaded, as they are used to determine which columns should be loaded
in. They therefore have a different function signature to the other cleaning
functions, which are called after the data has been loaded. These functions
are passed the following positional arguments, followed by arguments specified
in the variable table:
The variable table, containing metadata about all known variables
The integer ID of the variable to be checked
A list of
Columnobjects denoting all columns associated with this variable that are present in the data
These functions are expected to return a modified list containing the columns that should be loaded.
- funpack.cleaning_functions.codeToNumeric()[source]
Convert hierarchical coding labels to numeric equivalents (node IDs).
Given a hierarchical variable which contains alpha-numeric codings, this function will replace the codings with numeric equivalents.
For example, UKB data field 41270 (ICD10 diagnoses) uses data coding 19, which uses ICD10 alpha-numeric codes to denote medical diagnoses.
For data field 41270,
codeToNumericwould replace codingD730with the corresponding node ID22180when using the'node'conversion scheme, or39730when using the'alpha'scheme.The
nameoption is unnecessary in most cases, but may be used to specify which encoding scheme to use. Currently,'icd9','icd10','opsc3'and'opsc4'are accepted.The
schemeoption determines the conversion method - it can be set to one of:'node'(default): Convert alpha-numeric codes to the corresponding numeric node IDs, as specified in the UK Biobank showcase.'alpha': Convert alpha-numeric codes to integer numbers, using an immutable scheme that will not change over time - this scheme uses theicd10.toNumeric()function.
Warning
Note that node IDs (which are used by the
'node'conversion scheme) are not present for non-leaf nodes in all hierarchical UK Biobank data fields (in which case this function will return nan). Also note that node IDS may change across UK Biobank showcase releases.Warning
The
'alpha'conversion scheme is intended for use with ICD10 alpha-numeric codes (and similar coding schemes). This scheme may result in unrepresentable integer values if used with longer strings.
- funpack.cleaning_functions.fillMissing(fill_value)[source]
Fill missing values with a constant.
Fills all missing values in columns for the given variable with
fill_value.
- funpack.cleaning_functions.fillVisits([method=(mode|mean)])[source]
Fill missing visits from available visits.
For a variable with multiple visits, fills missing values from the visits that do contain data. The
methodargument can be set to one of:'mode'(the default) - fill missing visits with the most frequently occurring value in the available visits'mean'- fill missing visits with the mean of the values in the available visits
- funpack.cleaning_functions.flattenHierarchical([level][, name][, numeric][, convertNumeric])[source]
Replace leaf values with parent values in hierarchical variables.
For hierarchical variables such as the ICD10 disease categorisations, this function replaces leaf values with a parent value.
The
levelargument allows the depth of the parent value to be selected -0(the default) replaces a value with the top-level (i.e. its most distant) parent value,1the second-level parent value, etc.If, for a particular value, the
levelargument is greater than the number of ancestors of the value, the value is unchanged.By default,
flattenHierarchicalexpects to be given coding values. For example, for data field 41202 (ICD10 diagnoses), codingA009(Cholera, unspecified) would be replaced withChapter I(Certain infectious and parasitic diseases).The
numericargument tells the function to expect numeric node IDs (as output bycodeToNumeric()), rather than coding labels. This allowsflattenHierarchicalto be used in conjunction withcodeToNumeric, e.g.:-cl 41202 "codeToNumeric,flattenHierarchical(numeric=True)"
Continuing the above example, the
codeToNumericfunction would replaceA009with its node ID2890, andflattenHierarchical(numeric=True)would replace2890with10(the node ID forChapter I).The
convertNumericargument is a shortcut for the above scenario - using this impliesnumeric=True, and will cause thecodeToNumericroutine to be called automatically. The following example is equivalent to the above:-cl 41202 "flattenHierarchical(convertNumeric=True)"
The
nameoption is unnecessary in most cases, but may be used to specify an encoding scheme to use. Currently,'icd9','icd10','opsc3'and'opsc4'are accepted.
- funpack.cleaning_functions.keepVisits(visit[, visit, ...])[source]
Only load columns for
vidwhich correspond to the specified visits.This test is only applied to variables which have an instancing equal to 2:
https://biobank.ctsu.ox.ac.uk/crystal/instance.cgi?id=2
Such variables are not associated with a specific visit, so it makes no sense to apply this test to them. See the
loadtables.addNewVariable()function for more information on variable instancing.- Parameters:
vartable – Variable table
vid – Integer variable ID.
columns – List of all
Columnobjects that are associated withvid.tokeep – Visit IDs (integers), or
'first', or'last', indicating that the first or last visit should be loaded.
- Returns:
List of columns that should be loaded.
- funpack.cleaning_functions.makeNa(expression)[source]
Replace values which pass the expression with NA.
All values in the columns of the variable for which
funcevaluates toTrueare replaced withnan.The
expressionmust be of the form'<operator> <value>', where<operator>is one of==,!=,<,<=,>,>=, orcontains(only applicable to non-numeric variables). For example,makeNa('> 25')would cause all values greater than 25 to be replaced withnan.
- funpack.cleaning_functions.parseSpirometryData()[source]
Parse spirometry (lung volume measurement) data.
Parses values which contain spirometry (lung volume measurement) test data.
Columns for UK Biobank data fields 3066 and 10697 contain comma-separated time series data. This function converts the text values into
numpyarrays, and potentially re-orders instances of the variable within each visit - subjects potentially underwent a series of tests, with each test stored as a separate instance within one visit. However, the instance order does not necessarily correspond to the test order.