funpack.importing.reindex

This module contains functions used by the core.importData() function, for re-arranging the input data as it ls loaded.

funpack.importing.reindex.expandCompressedColumn(series: Series, col: Column, expanded: list[Column], delimiter: str = '|') DataFrame[source]

Expands values in a “compressed” column to multiple columns.

This function is used to take multiple values stored in a single “compressed” column, and copy them to multiple columns, so that each column contains one value.

Some UKB data files store multi-valued data fields in a single column, with values separated by pipe (|) characters. For example, datafield 6138 is conventionally stored across multiple columns, e.g.:

eid,6138-0.0,6138-0.1,6138-0.2
1000,1,,
1001,3,6,
1002,1,2,3
1003,6,,
1004,3,4,5

But may also be stored in a single column, e.g.:

eid,6138-0.0
1000,1
1001,3|6
1002,1|2|3
1003,6
1004,3|4|5

This function is used in conjunction with identifyCompressedColumns() to convert data from the latter format to the former.

Parameters:
  • seriespandas.Series containing the compressed column.

  • coldatatable.Column object describing the compresesd column.

  • expanded – List of :arg col: datatable.Column objects created by identifyCompressedColumns(), describing the expanded columns.

  • delimiter – String used to separate values in the compressed data.

Returns:

pandas.DataFrame containing the expanded columns.

funpack.importing.reindex.genReindexedColumns(cols, vartable, onlyRealVisits=True)[source]

Figures out how to re-arrange the columns of a data set so that it is indexed by row ID and visit. This function is called by loadFile() when the indexVisits option is used.

The onlyRealVisits argument controls whether the re-indexing is only applied to variables which are labelled with instancing 2 (Biobank assessment centre visit), or is solely based on the instance component of the column name (see the parseColumnName() function for details).

Parameters:
  • cols – list of Column objects dewscribing the existing data.

  • vartablepandas.DataFrame containing the variable table

  • onlyRealVisits – Re-index all columns according to the instance component of their column names, not just columns associated with variables that follow instancing 2.

Returns:

None if the file does not contain any data with multiple visits, or a tuple containing:

  • list of Column objects describing the re-indexed data set.

  • Dict of {old Column : new Column} mappings, describing how the old data maps to the new data.

  • A set containing all visit codes in the data.

funpack.importing.reindex.identifyCompressedColumns(columns: list[Column], vartable: DataFrame) dict[Column, list[Column]][source]

Identifies “compressed” columns - multi-valued datafields which have been compressed into a single column.

This function generates a set of Column objects for a “compressed” column containing multiple values that are to be expanded out to one column per value. The expandCompressedColumn() function may then be used to generate the data for the new columns.

Any columns for any datafields which only have one column, and which should have multiple values according to the UKB array_max schema field (ArrayMax in the variable table) are identified as compressed columns.

Parameters:
  • columns – List of columns in the input file.

  • vartable – Variable table containing UKB schema information on all data fields.

Returns:

A dict of {Column : [Column]} mappings containing compressed columns as keys, and new columns to which they should be expanded as values.

funpack.importing.reindex.reindexByVisit(df, oldcols, newcols, oldnewmap, visits)[source]

Re-arranges the data so that visits form part of the row indices, rather than being stored in separate columns for each variable.

The generateReindexedColumns() function is used to create the newcols, oldnewmap, and visits arguments.

The given dataframe is assumed to be indexed by a single-level pandas.Index This is replaced with a pandas.MultiIndex, where the first level corresponds to the original index, and the second level to the visit number.

Parameters:
  • dfpandas.DataFrame containing the data.

  • oldcols – list of Column objects describing the existing data.

  • newcols – List of Column objects decsribing each column in the new adjusted data.

:arg oldnewmap Dict of {old Column : new Column} mappings, describing

how the old data maps to the new data.

Returns:

adjusted pandas.DataFrame