funpack.datatable

This module provides the DataTable class, a container class which holds a reference to the loaded data.

funpack.datatable.AUTO_VARIABLE_ID = 5000000

Starting variable ID to use for unknown data. Automatically generated variable IDs really shoulld not conflict with actual UKB variable IDs.

class funpack.datatable.Column(datafile, name, index, vid=None, visit=0, instance=0, metadata=None, basevid=None, fillval=None, origname=None, **kwargs)[source]

Bases: object

The Column is a simple container class containing metadata about a single column in a data file.

See the parseColumnName() function for important information about column naming conventions in the UK BioBank.

A fundamental assumption made throughout much of the funpack code-base is that columns with a variable ID (vid) equal to 0 are index columns (e.g. subject ID).

A Column object has the following attributes:

  • datafile: Input file that the column originates from (None for

    generated columns).

  • name: Column name.

  • origname: Column name as it appears in the input/output file(s)

  • index: Location of the column in the input/output file(s)

  • vid: Variable ID

  • visit: Visit number

  • instance: Instance number/code

  • metadata: Metadata which may be used to generate descriptive

    information about the column (see funpack.main.generateDescription()).

  • basevid: ID of the variable that the data in this column was

    derived/generated from (and has the same data type as), if not the original vid. For example, see the take argument to processing_functions.binariseCategorical().

  • fillval: Fill value to use for missing values, instead of the

    global default (e.g. as specified by --tsv_missing_values).

Any other keyword arguments given when a Column is created will be added as attributes - this feature may be used by cleaning/processing steps to pass information onto subsequent steps.

class funpack.datatable.DataTable(*args, **kwargs)[source]

Bases: Singleton

The DataTable is a simple container class.

It keeps references to the variable and processing tables, and the data itself. The importData() function creates and returns a DataTable.

A DataTable has the following attributes and helper methods:

pool

Return a multiprocessing.Pool object for performing tasks in parallel on the data.

manager

Returns a multiprocessing.Manager for sharing data between processes.

parallel

Returns True if this invocation of funpack has the ability to run tasks in parallel, False otherwise.

vartable

Returns the pandas.DataFrame containing the variable table.

proctable

Returns the pandas.DataFrame containing the processing table.

cattable

Returns the pandas.DataFrame containing the category table.

columns

Return the data columns corresponding to the specified variable, visit and instance.

allColumns

Returns a list of all columns present in the data, including index columns.

present

Returns True if the specified variable (and optionally visit/ instance) is present in the data, False otherwise.

visits

Returns the visit IDs for the given variable.

instances

Returns the instance IDs for the given variable.

variables

Returns a list of all integer variable IDs present in the data.

Data should be accessed/modified via these methods:

index

Returns the subject indices.

__getitem__

Get the specified slice from the data.

__setitem__

Set the specified slice in the data.

Columns can be added/removed, and rows removed, via these methods:

maskSubjects

Remove subjects where mask is False.

addColumns

Adds one or more new columns to the data set.

removeColumns

Remove the columns described by cols.

Columns can be “flagged” with metadata labels via the addFlag() method. All of the flags on a column can be retrieved via the getFlags() method.

The subtable() method can be used to generate a replica DataTable with a specific subset of columns. It is intended for parallelisation, so that child processes are given a view of only the columns that are relevant to them, and as little copying between processes as possible takes place. The subtable() and merge() methods are intended to be used like so:

  1. Create subtables which only contain data for specific columns:

    cols      = ['1-0.0', '2-0.0', '3-0.0']
    subtables = [dtable.subtable([c]) for c in cols]
    
  2. Use multiprocessing to perform parallel processing on each column:

    def mytask(dtable, col):
        dtable[:, col] += 5
        return dtable
    
    with dtable.pool() as pool:
        subtables = pool.starmap(mytask, zip(subtables, cols))
    
  3. Merge the results back into the main table:

    dtable.merge(subtables)
    

Modifications must occur through the DataTable.__setitem__() interface, so it can keep track of which columns have been modified. Addition or removal of columns or rows on subtables is not supported.

addColumns(series, vids=None, kwargs=None)[source]

Adds one or more new columns to the data set.

Note

It is assumed that each pandas.Series object shares the same row indices as this DataTable.

Parameters:
  • series – Sequence of pandas.Series objects containing the new column data to add.

  • vids – Sequence of variables each new column is associated with. If None (the default), variable IDs are automatically assigned.

  • kwargs – Sequence of dicts, one for each series, containing arguments to pass to the Column constructor (e.g. visit, metadata, etc).

addFlag(col, flag)[source]

Adds a flag for the specified column.

Parameters:
  • colColumn object

  • flag – Flag to add

property allColumns

Returns a list of all columns present in the data, including index columns.

property cattable

Returns the pandas.DataFrame containing the category table.

columns(variable, visit=None, instance=None)[source]

Return the data columns corresponding to the specified variable, visit and instance.

Parameters:
  • variable – Integer variable ID

  • visit – Visit number. If None, column names for all visits are returned.

  • instance – Instance number. If None, column names for all instances are returned.

Returns:

A list of Column objects.

property dataColumns

Returns a list of all non-index columns present in the data.

getFlags(col)[source]

Return any flags associated with the specified column.

Parameters:

colColumn object

Returns:

A set containing any flags associated with col

property index

Returns the subject indices.

property indexColumns

Returns a list of all index columns present in the data.

instances(variable)[source]

Returns the instance IDs for the given variable.

property isSubtable

Returns True if this DataTable was created by a subtable() call.

property manager

Returns a multiprocessing.Manager for sharing data between processes.

maskSubjects(mask)[source]

Remove subjects where mask is False.

merge(subtables)[source]

Merge the data from the given subtables into this DataTable.

It is assumed that the subtables each contain a sub-set of the columns in this DataTable.

Note

The merge() method cannot be used to merge in a subtable that contains a subset of rows.

Parameters:

subtables – A single DataTable, or a sequence of DataTable instances, returned by subtable().

property njobs

Returns the number of jobs that will be used by the pool for parallelising tasks.

property parallel

Returns True if this invocation of funpack has the ability to run tasks in parallel, False otherwise.

pool()[source]

Return a multiprocessing.Pool object for performing tasks in parallel on the data. If the njobs argument passed to __init__() was 1, or if this is a DataTable instance that has been unpickled (i.e. instances which are running in a sub-process), a multiprocessing.dummy.Pool instance is created and returned.

present(variable, visit=None, instance=None)[source]

Returns True if the specified variable (and optionally visit/ instance) is present in the data, False otherwise.

property proctable

Returns the pandas.DataFrame containing the processing table.

removeColumns(cols)[source]

Remove the columns described by cols.

Parameters:

cols – Sequence of Column objects to remove.

property shape

Returns the (nrows, ncolumns) shape of the data.

subtable(columns=None, rows=None)[source]

Return a new DataTable which only contains the data for the specified columns and/or rows.

This method can be used to create a replica DataTable where the underlying pandas.DataFrame only contains the specified columns. It is intended to be used when parallelising tasks, so that child processes are given a view of only the relevant columns.

Note

The merge() method cannot be used to merge in a subtable that contains a subset of rows.

Parameters:
  • columns – List of Column objects.

  • rows – Sequence of row indices.

Returns:

A new DataTable.

property variables

Returns a list of all integer variable IDs present in the data.

The list includes the index variable (which has an id of 0).

property vartable

Returns the pandas.DataFrame containing the variable table.

visits(variable)[source]

Returns the visit IDs for the given variable.

funpack.datatable.MODIFIED_COLUMN = 'nREdzFwKABYYmwCZIHfA'

Flag used internally by the DataTable when merging subtables.