funpack.processing_functions
This module contains definitions of processing functions - functions which may be specifeid in the processing table.
A processing function may perform any sort of processing on one or more
variables. A processing function may add, remove, or manipulate the columns of
the DataTable.
All processing functions must accept the following as their first two positional arguments:
The
DataTableobject, containing references to the data, variable, and processing table.A list of integer ID of the variables to process.
Furthermore, all processing functions must return one of the following:
None, indicating that no columns are to be added or removed.A
list(must be alist) ofColumnobjects describing the columns that should be removed from the data.A
tuple(must be atuple) of length 2, containing:
A list of
pandas.Seriesthat should be added to the data.A list of variable IDs to use for each new
Series. This list must have the same length as the list of newSeries, but if they are not associated with any specific variable,Nonemay be used.A
tupleof length 3, containing:
List of columns to be removed
List of
Seriesto be addedList of variable IDs for each new
Series.A
tupleof length 4, containing the above, and:
List of dicts associated with each of the new
Series. These will be passed as keyword arguments to theColumnobjects that represent each of the newSeries.
The following processing functions are defined:
Removes columns deemed to be sparse.
Removes columns deemed to be redundant.
Replace a categorical column with one binary column per category.
Expand a compound column into a set of columns, one for each value.
Create binary columns for (e.g.) ICD10 codes, denoting them as either primary or secondary.
Remove all columns associated with one or more datafields/variables.
- funpack.processing_functions.binariseCategorical([acrossVisits][, acrossInstances][, minpres][, nameFormat][, replace][, take][, fillval][, replaceTake])[source]
Replace a categorical column with one binary column per category.
Binarises categorical variables - replaces their columns with one new column for each unique value, containing
1for subjects with that value, and0otherwise. Thos procedure is applied independently to all variables that are specified.The
acrossVisitsoption controls whether the binarisation is applied across visits for each variable. It defaults toFalse, meaning that the binarisation is applied separately to the columns within each visit. If set toTrue, the binarisation will be applied to the columns for all visits. Similarly, theacrossInstancesoption controls whether the binarisation is applied across instances. This defaults toTrue, which is usually desirable - for example, data field 41202 contains multiple ICD10 diagnoses, separated across different instances.If the
minpresoption is specified, it is used as a threshold - categorical values with less than this many occurrences will not be added as columns.The
nameFormatargument controls how the new data columns should be named - it must be a format string using named replacement fields'vid','visit','instance', and'value'. The'visit'and'instance'fields may or may not be necessary, depending on the value of theacrossVisitsandacrossInstancesarguments.The default value for the
nameFormatstring is as follows:acrossVisitsacrossInstancesnameFormatFalseFalse'{vid}-{visit}.{instance}_{value}'FalseTrue'{vid}-{visit}.{value}'TrueFalse'{vid}-{value}.{instance}'TrueTrue'{vid}-{value}'The
replaceoption controls whether the original un-binarised columns should be removed - it defaults toTrue, which will cause them to be removed.By default, the new binary columns (one for each unique value in the input columns) will contain a
1indicating that the value is present, or a0indicating its absence. As an alternative to this, thetakeoption can be used to specify another variable from which to take values when populating the output columns.takemay be set to a variable ID, or sequence of variable IDs (one for each of the input variables) to take values from. If provided, the generated columns will have values from the column(s) of this variable, instead of containinng binary 0/1 values.A
takevariable must have columns that match the columns of the corresponding variable (by both visits and instances).If
takeis being used, thefillvaloption can be used to specify the the value to use forFalse/0rows. It defaults tonp.nan.The
replaceTakeoption is similar toreplace- it controls whether the columns associated with thetakevariables are removed (True- the defailt), or retained.
- funpack.processing_functions.createDiagnosisColumns(primvid, secvid)[source]
Create binary columns for (e.g.) ICD10 codes, denoting them as either primary or secondary.
This function is intended for use with data fields containing ICD9, ICD10, OPSC3, and OPSC4 diagnosis codes. The UK Biobank publishes these diagnosis/operative procedure codes twice:
The codes are published in a “main” data field containing all codes
The codes are published again in two other data fields, containing separate “primary” and “secondary” codes.
Code
Main data field
Primary data field
Secondary data field
ICD10
41270
41202
41204
ICD9
41271
41203
41205
OPSC4
41272
41200
41210
OPSC3
41273
41256
41258
For example, this function may be applied to the ICD10 diagnosis codes like so:
41270 createDiagnosisColumns(41202, 41204)
When applied to one of the main data fields, (e.g. 41270 - ICD10 diagnoses), this function will create two new columns for every unique ICD10 diagnosis code:
the first column contains a 1 if the code corresponds to a primary diagnosis (i.e. is also in 41202).
the second column contains a 1 if the code corresponds to a secondary diagnosis (i.e. is also in 41204).
The
replaceoption defaults toTrue- this causes the primary and secondary code columns to be removed from the data set.The
binarisedoption defaults toFalse, which causes this function to expect the input columns to be in their raw format, as described in the UKB showcase (e.g. https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270). Thebinarisedoption can be set toTrue, which will allow this function to be applied to data fields which have been passed through thebinariseCategorical()function.
- funpack.processing_functions.expandCompound([nameFormat][, replace])[source]
Expand a compound column into a set of columns, one for each value.
Expands compound variables into a set of columns, one for each value. Rows with different number of values are padded with
np.nan.This procedure is applied independently to each column of each specified variable.
The
nameFormatoption can be used to control how the new columns should be named - it must be a format string using named replacement fields'vid','visit','instance', and'index'. The default value fornameFormatis'{vid}-{visit}.{instance}_{index}'.The
replaceoption controls whether the original columns are removed (True- the default), or retained.
- funpack.processing_functions.removeField()[source]
Remove all columns associated with one or more datafields/variables.
This function can be used to simply remove columns from the data set. This can be useful if a variable is required for some processing step, but is not needed in the final output file.
- funpack.processing_functions.removeIfRedundant(corrthres[, nathres])[source]
Removes columns deemed to be redundant.
Removes columns from the specified group of variables if they are found to be redundant with any other columns in the group.
Redundancy is determined by calculating the correlation between all pairs of columns - columns with an absolute correlation greater than
corrthresare identified as redundant.The test can optionally take the patterns oof missing values into account - if
nathresis provided, the missingness correlation is also calculated between all column pairs. Columns must have absolute correlation greater thancorrthresand absolute missingness correlation greater thannathresto be identified as redundant.The
skipUnknownsoption defaults toFalse. If it is set toTrue, columns which are deemed to be redundant with respect to an unknown or uncategorised column are not dropped.The
precisionoption can be set to either'double'(the default) or'single'- this controls whether 32 bit (single) or 64 bit (double) precision floating point is used for the correlation calculation. Double precision is recommended, as the correlation calculation algorithm can be unstable for data with large values (>10e5).
- funpack.processing_functions.removeIfSparse([minpres][, minstd][, mincat][, maxcat][, abspres][, abscat][, naval])[source]
Removes columns deemed to be sparse.
Removes columns for all specified variables if they fail a sparsity test. The test is based on the following criteria:
The number/proportion of non-NA values must be greater than or equal to
minpres.The standard deviation of the data must be greater than
minstd.For integer and categorical types, the number/proportion of the largest category must be greater than
mincat.For integer and categorical types, the number/proportion of the largest category must be less than
maxcat.
If any of these criteria are not met, the data is considered to be sparse. Each criteria can be disabled by passing in
Nonefor the relevant parameter.The
minstdtest is only performed on numeric columns, and themincat/maxcattests are only performed on integer/categorical columns.If
abspres=True(the default),minpresis interpreted as an absolute count. Ifabspress=False,minpresis interpreted as a proportion.Similarly, If
abscat=True(the default),mincatandmaxcatare interpreted as absolute counts. Otherwisemincatandmaxcatare interpreted as proportionsThe
navalargument can be used to customise the value to consider as “missing” - it defaults tonp.nan.