funpack.datatable
This module provides the DataTable
class, a container
class which holds a reference to the loaded data.
- funpack.datatable.AUTO_VARIABLE_ID = 5000000
Starting variable ID to use for unknown data. Automatically generated variable IDs really shoulld not conflict with actual UKB variable IDs.
- class funpack.datatable.Column(datafile, name, index, vid=None, visit=0, instance=0, metadata=None, basevid=None, fillval=None, origname=None, **kwargs)[source]
Bases:
object
The
Column
is a simple container class containing metadata about a single column in a data file.See the
parseColumnName()
function for important information about column naming conventions in the UK BioBank.A fundamental assumption made throughout much of the
funpack
code-base is that columns with a variable ID (vid
) equal to 0 are index columns (e.g. subject ID).A
Column
object has the following attributes:datafile
: Input file that the column originates from (None
forgenerated columns).
name
: Column name.origname
: Column name as it appears in the input/output file(s)index
: Location of the column in the input/output file(s)vid
: Variable IDvisit
: Visit numberinstance
: Instance number/codemetadata
: Metadata which may be used to generate descriptiveinformation about the column (see
funpack.main.generateDescription()
).
basevid
: ID of the variable that the data in this column wasderived/generated from (and has the same data type as), if not the original
vid
. For example, see thetake
argument toprocessing_functions.binariseCategorical()
.
fillval
: Fill value to use for missing values, instead of theglobal default (e.g. as specified by
--tsv_missing_values
).
Any other keyword arguments given when a
Column
is created will be added as attributes - this feature may be used by cleaning/processing steps to pass information onto subsequent steps.
- class funpack.datatable.DataTable(*args, **kwargs)[source]
Bases:
Singleton
The
DataTable
is a simple container class.It keeps references to the variable and processing tables, and the data itself. The
importData()
function creates and returns aDataTable
.A
DataTable
has the following attributes and helper methods:Return a
multiprocessing.Pool
object for performing tasks in parallel on the data.Returns a
multiprocessing.Manager
for sharing data between processes.Returns
True
if this invocation offunpack
has the ability to run tasks in parallel,False
otherwise.Returns the
pandas.DataFrame
containing the variable table.Returns the
pandas.DataFrame
containing the processing table.Returns the
pandas.DataFrame
containing the category table.Return the data columns corresponding to the specified
variable
,visit
andinstance
.Returns a list of all columns present in the data, including index columns.
Returns
True
if the specified variable (and optionally visit/ instance) is present in the data,False
otherwise.Returns the visit IDs for the given
variable
.Returns the instance IDs for the given
variable
.Returns a list of all integer variable IDs present in the data.
Data should be accessed/modified via these methods:
Returns the subject indices.
__getitem__
Get the specified slice from the data.
__setitem__
Set the specified slice in the data.
Columns can be added/removed, and rows removed, via these methods:
Remove subjects where
mask is False
.Adds one or more new columns to the data set.
Remove the columns described by
cols
.Columns can be “flagged” with metadata labels via the
addFlag()
method. All of the flags on a column can be retrieved via thegetFlags()
method.The
subtable()
method can be used to generate a replicaDataTable
with a specific subset of columns. It is intended for parallelisation, so that child processes are given a view of only the columns that are relevant to them, and as little copying between processes as possible takes place. Thesubtable()
andmerge()
methods are intended to be used like so:Create subtables which only contain data for specific columns:
cols = ['1-0.0', '2-0.0', '3-0.0'] subtables = [dtable.subtable([c]) for c in cols]
Use multiprocessing to perform parallel processing on each column:
def mytask(dtable, col): dtable[:, col] += 5 return dtable with dtable.pool() as pool: subtables = pool.starmap(mytask, zip(subtables, cols))
Merge the results back into the main table:
dtable.merge(subtables)
Modifications must occur through the
DataTable.__setitem__()
interface, so it can keep track of which columns have been modified. Addition or removal of columns or rows on subtables is not supported.- addColumns(series, vids=None, kwargs=None)[source]
Adds one or more new columns to the data set.
Note
It is assumed that each
pandas.Series
object shares the same row indices as thisDataTable
.- Parameters:
series – Sequence of
pandas.Series
objects containing the new column data to add.vids – Sequence of variables each new column is associated with. If
None
(the default), variable IDs are automatically assigned.kwargs – Sequence of dicts, one for each series, containing arguments to pass to the
Column
constructor (e.g.visit
,metadata
, etc).
- addFlag(col, flag)[source]
Adds a flag for the specified column.
- Parameters:
col –
Column
objectflag – Flag to add
- property allColumns
Returns a list of all columns present in the data, including index columns.
- property cattable
Returns the
pandas.DataFrame
containing the category table.
- columns(variable, visit=None, instance=None)[source]
Return the data columns corresponding to the specified
variable
,visit
andinstance
.- Parameters:
variable – Integer variable ID
visit – Visit number. If
None
, column names for all visits are returned.instance – Instance number. If
None
, column names for all instances are returned.
- Returns:
A list of
Column
objects.
- property dataColumns
Returns a list of all non-index columns present in the data.
- getFlags(col)[source]
Return any flags associated with the specified column.
- Parameters:
col –
Column
object- Returns:
A
set
containing any flags associated withcol
- property index
Returns the subject indices.
- property indexColumns
Returns a list of all index columns present in the data.
- property isSubtable
Returns
True
if thisDataTable
was created by asubtable()
call.
- property manager
Returns a
multiprocessing.Manager
for sharing data between processes.
- merge(subtables)[source]
Merge the data from the given
subtables
into thisDataTable
.It is assumed that the
subtables
each contain a sub-set of the columns in thisDataTable
.Note
The
merge()
method cannot be used to merge in a subtable that contains a subset of rows.- Parameters:
subtables – A single
DataTable
, or a sequence ofDataTable
instances, returned bysubtable()
.
- property parallel
Returns
True
if this invocation offunpack
has the ability to run tasks in parallel,False
otherwise.
- pool()[source]
Return a
multiprocessing.Pool
object for performing tasks in parallel on the data. If thenjobs
argument passed to__init__()
was1
, or if this is aDataTable
instance that has been unpickled (i.e. instances which are running in a sub-process), amultiprocessing.dummy.Pool
instance is created and returned.
- present(variable, visit=None, instance=None)[source]
Returns
True
if the specified variable (and optionally visit/ instance) is present in the data,False
otherwise.
- property proctable
Returns the
pandas.DataFrame
containing the processing table.
- removeColumns(cols)[source]
Remove the columns described by
cols
.- Parameters:
cols – Sequence of
Column
objects to remove.
- property shape
Returns the (nrows, ncolumns) shape of the data.
- subtable(columns=None, rows=None)[source]
Return a new
DataTable
which only contains the data for the specifiedcolumns
and/orrows
.This method can be used to create a replica
DataTable
where the underlyingpandas.DataFrame
only contains the specified columns. It is intended to be used when parallelising tasks, so that child processes are given a view of only the relevant columns.Note
The
merge()
method cannot be used to merge in a subtable that contains a subset of rows.
- property variables
Returns a list of all integer variable IDs present in the data.
The list includes the index variable (which has an id of
0
).
- property vartable
Returns the
pandas.DataFrame
containing the variable table.