funpack.importing.reindex
This module contains functions used by the core.importData()
function, for re-arranging the input data as it ls loaded.
- funpack.importing.reindex.expandCompressedColumn(series: Series, col: Column, expanded: list[Column], delimiter: str = '|') DataFrame[source]
Expands values in a “compressed” column to multiple columns.
This function is used to take multiple values stored in a single “compressed” column, and copy them to multiple columns, so that each column contains one value.
Some UKB data files store multi-valued data fields in a single column, with values separated by pipe (
|) characters. For example, datafield 6138 is conventionally stored across multiple columns, e.g.:eid,6138-0.0,6138-0.1,6138-0.2 1000,1,, 1001,3,6, 1002,1,2,3 1003,6,, 1004,3,4,5
But may also be stored in a single column, e.g.:
eid,6138-0.0 1000,1 1001,3|6 1002,1|2|3 1003,6 1004,3|4|5
This function is used in conjunction with
identifyCompressedColumns()to convert data from the latter format to the former.- Parameters:
series –
pandas.Seriescontaining the compressed column.col –
datatable.Columnobject describing the compresesd column.expanded – List of :arg col:
datatable.Columnobjects created byidentifyCompressedColumns(), describing the expanded columns.delimiter – String used to separate values in the compressed data.
- Returns:
pandas.DataFramecontaining the expanded columns.
- funpack.importing.reindex.genReindexedColumns(cols, vartable, onlyRealVisits=True)[source]
Figures out how to re-arrange the columns of a data set so that it is indexed by row ID and visit. This function is called by
loadFile()when theindexVisitsoption is used.The
onlyRealVisitsargument controls whether the re-indexing is only applied to variables which are labelled with instancing 2 (Biobank assessment centre visit), or is solely based on theinstancecomponent of the column name (see theparseColumnName()function for details).- Parameters:
cols – list of
Columnobjects dewscribing the existing data.vartable –
pandas.DataFramecontaining the variable tableonlyRealVisits – Re-index all columns according to the
instancecomponent of their column names, not just columns associated with variables that follow instancing 2.
- Returns:
Noneif the file does not contain any data with multiple visits, or atuplecontaining:list of
Columnobjects describing the re-indexed data set.Dict of
{old Column : new Column}mappings, describing how the old data maps to the new data.A
setcontaining all visit codes in the data.
- funpack.importing.reindex.identifyCompressedColumns(columns: list[Column], vartable: DataFrame) dict[Column, list[Column]][source]
Identifies “compressed” columns - multi-valued datafields which have been compressed into a single column.
This function generates a set of
Columnobjects for a “compressed” column containing multiple values that are to be expanded out to one column per value. TheexpandCompressedColumn()function may then be used to generate the data for the new columns.Any columns for any datafields which only have one column, and which should have multiple values according to the UKB
array_maxschema field (ArrayMaxin the variable table) are identified as compressed columns.- Parameters:
columns – List of columns in the input file.
vartable – Variable table containing UKB schema information on all data fields.
- Returns:
A dict of
{Column : [Column]}mappings containing compressed columns as keys, and new columns to which they should be expanded as values.
- funpack.importing.reindex.reindexByVisit(df, oldcols, newcols, oldnewmap, visits)[source]
Re-arranges the data so that visits form part of the row indices, rather than being stored in separate columns for each variable.
The
generateReindexedColumns()function is used to create thenewcols,oldnewmap, andvisitsarguments.The given dataframe is assumed to be indexed by a single-level
pandas.IndexThis is replaced with apandas.MultiIndex, where the first level corresponds to the original index, and the second level to the visit number.- Parameters:
- :arg oldnewmap Dict of
{old Column : new Column}mappings, describing how the old data maps to the new data.
- Returns:
adjusted
pandas.DataFrame