`funpack.datatable`

This module provides the DataTable class, a container class which holds a reference to the loaded data.

funpack.datatable.AUTO_VARIABLE_ID = 5000000: Starting variable ID to use for unknown data. Automatically generated variable IDs really shoulld not conflict with actual UKB variable IDs.

class funpack.datatable.Column(datafile, name, index, vid=None, visit=0, instance=0, metadata=None, basevid=None, fillval=None, origname=None, **kwargs)[source]

Bases: object

The Column is a simple container class containing metadata about a single column in a data file.

See the parseColumnName() function for important information about column naming conventions in the UK BioBank.

A fundamental assumption made throughout much of the funpack code-base is that columns with a variable ID (vid) equal to 0 are index columns (e.g. subject ID).

A Column object has the following attributes:

datafile: Input file that the column originates from (None for
generated columns).

name: Column name.

origname: Column name as it appears in the input/output file(s)

index: Location of the column in the input/output file(s)

vid: Variable ID

visit: Visit number

instance: Instance number/code

metadata: Metadata which may be used to generate descriptive
information about the column (see funpack.main.generateDescription()).

basevid: ID of the variable that the data in this column was
derived/generated from (and has the same data type as), if not the original vid. For example, see the take argument to processing_functions.binariseCategorical().

fillval: Fill value to use for missing values, instead of the
global default (e.g. as specified by --tsv_missing_values).

Any other keyword arguments given when a Column is created will be added as attributes - this feature may be used by cleaning/processing steps to pass information onto subsequent steps.

class funpack.datatable.DataTable(*args, **kwargs)[source]

Bases: Singleton

The DataTable is a simple container class.

It keeps references to the variable and processing tables, and the data itself. The importData() function creates and returns a DataTable.

A DataTable has the following attributes and helper methods:

`pool`	Return a `multiprocessing.Pool` object for performing tasks in parallel on the data.
`manager`	Returns a `multiprocessing.Manager` for sharing data between processes.
`parallel`	Returns `True` if this invocation of `funpack` has the ability to run tasks in parallel, `False` otherwise.
`vartable`	Returns the `pandas.DataFrame` containing the variable table.
`proctable`	Returns the `pandas.DataFrame` containing the processing table.
`cattable`	Returns the `pandas.DataFrame` containing the category table.
`columns`	Return the data columns corresponding to the specified `variable`, `visit` and `instance`.
`allColumns`	Returns a list of all columns present in the data, including index columns.
`present`	Returns `True` if the specified variable (and optionally visit/ instance) is present in the data, `False` otherwise.
`visits`	Returns the visit IDs for the given `variable`.
`instances`	Returns the instance IDs for the given `variable`.
`variables`	Returns a list of all integer variable IDs present in the data.

Data should be accessed/modified via these methods:

`index`	Returns the subject indices.
`__getitem__`	Get the specified slice from the data.
`__setitem__`	Set the specified slice in the data.

Columns can be added/removed, and rows removed, via these methods:

`maskSubjects`	Remove subjects where `mask is False`.
`addColumns`	Adds one or more new columns to the data set.
`removeColumns`	Remove the columns described by `cols`.

Columns can be “flagged” with metadata labels via the addFlag() method. All of the flags on a column can be retrieved via the getFlags() method.

The subtable() method can be used to generate a replica DataTable with a specific subset of columns. It is intended for parallelisation, so that child processes are given a view of only the columns that are relevant to them, and as little copying between processes as possible takes place. The subtable() and merge() methods are intended to be used like so:

Create subtables which only contain data for specific columns:
cols      = ['1-0.0', '2-0.0', '3-0.0']
subtables = [dtable.subtable([c]) for c in cols]
Use multiprocessing to perform parallel processing on each column:
def mytask(dtable, col):
    dtable[:, col] += 5
    return dtable

with dtable.pool() as pool:
    subtables = pool.starmap(mytask, zip(subtables, cols))
Merge the results back into the main table:
dtable.merge(subtables)

Modifications must occur through the DataTable.__setitem__() interface, so it can keep track of which columns have been modified. Addition or removal of columns or rows on subtables is not supported.

addColumns(series, vids=None, kwargs=None)[source]

Adds one or more new columns to the data set.

Note

It is assumed that each pandas.Series object shares the same row indices as this DataTable.

Parameters:

series – Sequence of pandas.Series objects containing the new column data to add.
vids – Sequence of variables each new column is associated with. If None (the default), variable IDs are automatically assigned.
kwargs – Sequence of dicts, one for each series, containing arguments to pass to the Column constructor (e.g. visit, metadata, etc).

addFlag(col, flag)[source]

Adds a flag for the specified column.

Parameters:

col – Column object
flag – Flag to add

property allColumns: Returns a list of all columns present in the data, including index columns.

property cattable: Returns the pandas.DataFrame containing the category table.

columns(variable, visit=None, instance=None)[source]

Return the data columns corresponding to the specified variable, visit and instance.

Parameters:

variable – Integer variable ID
visit – Visit number. If None, column names for all visits are returned.
instance – Instance number. If None, column names for all instances are returned.

Returns:

A list of Column objects.

property dataColumns: Returns a list of all non-index columns present in the data.

getFlags(col)[source]

Return any flags associated with the specified column.

Parameters:: col – Column object
Returns:: A set containing any flags associated with col

property index: Returns the subject indices.

property indexColumns: Returns a list of all index columns present in the data.

instances(variable)[source]: Returns the instance IDs for the given variable.

property isSubtable: Returns True if this DataTable was created by a subtable() call.

property manager: Returns a multiprocessing.Manager for sharing data between processes.

maskSubjects(mask)[source]: Remove subjects where mask is False.

merge(subtables)[source]

Merge the data from the given subtables into this DataTable.

It is assumed that the subtables each contain a sub-set of the columns in this DataTable.

Note

The merge() method cannot be used to merge in a subtable that contains a subset of rows.

Parameters:: subtables – A single DataTable, or a sequence of DataTable instances, returned by subtable().

property njobs: Returns the number of jobs that will be used by the pool for parallelising tasks.

property parallel: Returns True if this invocation of funpack has the ability to run tasks in parallel, False otherwise.

pool()[source]: Return a multiprocessing.Pool object for performing tasks in parallel on the data. If the njobs argument passed to __init__() was 1, or if this is a DataTable instance that has been unpickled (i.e. instances which are running in a sub-process), a multiprocessing.dummy.Pool instance is created and returned.

present(variable, visit=None, instance=None)[source]: Returns True if the specified variable (and optionally visit/ instance) is present in the data, False otherwise.

property proctable: Returns the pandas.DataFrame containing the processing table.

removeColumns(cols)[source]

Remove the columns described by cols.

Parameters:: cols – Sequence of Column objects to remove.

property shape: Returns the (nrows, ncolumns) shape of the data.

subtable(columns=None, rows=None)[source]

Return a new DataTable which only contains the data for the specified columns and/or rows.

This method can be used to create a replica DataTable where the underlying pandas.DataFrame only contains the specified columns. It is intended to be used when parallelising tasks, so that child processes are given a view of only the relevant columns.

Note

The merge() method cannot be used to merge in a subtable that contains a subset of rows.

Parameters:

columns – List of Column objects.
rows – Sequence of row indices.

Returns:

A new DataTable.

property variables

Returns a list of all integer variable IDs present in the data.

The list includes the index variable (which has an id of 0).

property vartable: Returns the pandas.DataFrame containing the variable table.

visits(variable)[source]: Returns the visit IDs for the given variable.

funpack.datatable.MODIFIED_COLUMN = 'nREdzFwKABYYmwCZIHfA': Flag used internally by the DataTable when merging subtables.

funpack.datatable

`funpack.datatable`