funpack.cleaning_functions

This module contains definitions of cleaning functions - functions which may be specified in the Clean column of the variable table, and which are appiled during the data import stage.

The following cleaning functions are available:

remove

Remove (do not load) all columns associated with vid.

keepVisits

Only load columns for vid which correspond to the specified visits.

fillVisits

Fill missing visits from available visits.

fillMissing

Fill missing values with a constant.

flattenHierarchical

Replace leaf values with parent values in hierarchical variables.

codeToNumeric

Convert hierarchical coding labels to numeric equivalents (node IDs).

parseSpirometryData

Parse spirometry (lung volume measurement) data.

makeNa

Replace values which pass the expression with NA.

All cleaning functions (with two exceptions - remove() and keepVisits(), explained below) will be passed the following as their first positional arguments, followed by any arguments specified in the variable table:

  • The DataTable object, containing references to the data, variable, and processing table.

  • The integer ID of the variable to process

These functions are expected to perform in-place pre-processing on the data.

The remove() and keepVisits() functions are called before the data is loaded, as they are used to determine which columns should be loaded in. They therefore have a different function signature to the other cleaning functions, which are called after the data has been loaded. These functions are passed the following positional arguments, followed by arguments specified in the variable table:

  • The variable table, containing metadata about all known variables

  • The integer ID of the variable to be checked

  • A list of Column objects denoting all columns associated with this variable that are present in the data

These functions are expected to return a modified list containing the columns that should be loaded.

funpack.cleaning_functions.codeToNumeric()[source]

Convert hierarchical coding labels to numeric equivalents (node IDs).

Given a hierarchical variable which contains alpha-numeric codings, this function will replace the codings with numeric equivalents.

For example, UKB data field 41270 (ICD10 diagnoses) uses data coding 19, which uses ICD10 alpha-numeric codes to denote medical diagnoses.

For data field 41270, codeToNumeric would replace coding D730 with the corresponding node ID 22180 when using the 'node' conversion scheme, or 39730 when using the 'alpha' scheme.

The name option is unnecessary in most cases, but may be used to specify which encoding scheme to use. Currently, 'icd9', 'icd10', 'opsc3' and 'opsc4' are accepted.

The scheme option determines the conversion method - it can be set to one of:

  • 'node' (default): Convert alpha-numeric codes to the corresponding numeric node IDs, as specified in the UK Biobank showcase.

  • 'alpha': Convert alpha-numeric codes to integer numbers, using an immutable scheme that will not change over time - this scheme uses the icd10.toNumeric() function.

Warning

Note that node IDs (which are used by the 'node' conversion scheme) are not present for non-leaf nodes in all hierarchical UK Biobank data fields (in which case this function will return nan). Also note that node IDS may change across UK Biobank showcase releases.

Warning

The 'alpha' conversion scheme is intended for use with ICD10 alpha-numeric codes (and similar coding schemes). This scheme may result in unrepresentable integer values if used with longer strings.

funpack.cleaning_functions.fillMissing(fill_value)[source]

Fill missing values with a constant.

Fills all missing values in columns for the given variable with fill_value.

funpack.cleaning_functions.fillVisits([method=(mode|mean)])[source]

Fill missing visits from available visits.

For a variable with multiple visits, fills missing values from the visits that do contain data. The method argument can be set to one of:

  • 'mode' (the default) - fill missing visits with the most frequently occurring value in the available visits

  • 'mean' - fill missing visits with the mean of the values in the available visits

funpack.cleaning_functions.flattenHierarchical([level][, name][, numeric][, convertNumeric])[source]

Replace leaf values with parent values in hierarchical variables.

For hierarchical variables such as the ICD10 disease categorisations, this function replaces leaf values with a parent value.

The level argument allows the depth of the parent value to be selected - 0 (the default) replaces a value with the top-level (i.e. its most distant) parent value, 1 the second-level parent value, etc.

If, for a particular value, the level argument is greater than the number of ancestors of the value, the value is unchanged.

By default, flattenHierarchical expects to be given coding values. For example, for data field 41202 (ICD10 diagnoses), coding A009 (Cholera, unspecified) would be replaced with Chapter I (Certain infectious and parasitic diseases).

The numeric argument tells the function to expect numeric node IDs (as output by codeToNumeric()), rather than coding labels. This allows flattenHierarchical to be used in conjunction with codeToNumeric, e.g.:

-cl 41202 "codeToNumeric,flattenHierarchical(numeric=True)"

Continuing the above example, the codeToNumeric function would replace A009 with its node ID 2890, and flattenHierarchical(numeric=True) would replace 2890 with 10 (the node ID for Chapter I).

The convertNumeric argument is a shortcut for the above scenario - using this implies numeric=True, and will cause the codeToNumeric routine to be called automatically. The following example is equivalent to the above:

-cl 41202 "flattenHierarchical(convertNumeric=True)"

The name option is unnecessary in most cases, but may be used to specify an encoding scheme to use. Currently, 'icd9', 'icd10', 'opsc3' and 'opsc4' are accepted.

funpack.cleaning_functions.keepVisits(visit[, visit, ...])[source]

Only load columns for vid which correspond to the specified visits.

This test is only applied to variables which have an instancing equal to 2:

https://biobank.ctsu.ox.ac.uk/crystal/instance.cgi?id=2

Such variables are not associated with a specific visit, so it makes no sense to apply this test to them. See the loadtables.addNewVariable() function for more information on variable instancing.

Parameters:
  • vartable – Variable table

  • vid – Integer variable ID.

  • columns – List of all Column objects that are associated with vid.

  • tokeep – Visit IDs (integers), or 'first', or 'last', indicating that the first or last visit should be loaded.

Returns:

List of columns that should be loaded.

funpack.cleaning_functions.makeNa(expression)[source]

Replace values which pass the expression with NA.

All values in the columns of the variable for which func evaluates to True are replaced with nan.

The expression must be of the form '<operator> <value>', where <operator> is one of ==, !=, <, <=, >, >=, or contains (only applicable to non-numeric variables). For example, makeNa('> 25') would cause all values greater than 25 to be replaced with nan.

funpack.cleaning_functions.parseSpirometryData()[source]

Parse spirometry (lung volume measurement) data.

Parses values which contain spirometry (lung volume measurement) test data.

Columns for UK Biobank data fields 3066 and 10697 contain comma-separated time series data. This function converts the text values into numpy arrays, and potentially re-orders instances of the variable within each visit - subjects potentially underwent a series of tests, with each test stored as a separate instance within one visit. However, the instance order does not necessarily correspond to the test order.

funpack.cleaning_functions.remove()[source]

Remove (do not load) all columns associated with vid.

Parameters:
  • vartable – Variable table

  • vid – Integer variable ID.

  • columns – List of all Column objects associated with vid.

Returns:

An empty list.