funpack.datatable
This module provides the DataTable class, a container
class which holds a reference to the loaded data.
- funpack.datatable.AUTO_VARIABLE_ID = 5000000
Starting variable ID to use for unknown data. Automatically generated variable IDs really shoulld not conflict with actual UKB variable IDs.
- class funpack.datatable.Column(datafile, name, index, vid=None, visit=0, instance=0, metadata=None, basevid=None, fillval=None, origname=None, **kwargs)[source]
Bases:
objectThe
Columnis a simple container class containing metadata about a single column in a data file.See the
parseColumnName()function for important information about column naming conventions in the UK BioBank.A fundamental assumption made throughout much of the
funpackcode-base is that columns with a variable ID (vid) equal to 0 are index columns (e.g. subject ID).A
Columnobject has the following attributes:datafile: Input file that the column originates from (Noneforgenerated columns).
name: Column name.origname: Column name as it appears in the input/output file(s)index: Location of the column in the input/output file(s)vid: Variable IDvisit: Visit numberinstance: Instance number/codemetadata: Metadata which may be used to generate descriptiveinformation about the column (see
funpack.main.generateDescription()).
basevid: ID of the variable that the data in this column wasderived/generated from (and has the same data type as), if not the original
vid. For example, see thetakeargument toprocessing_functions.binariseCategorical().
fillval: Fill value to use for missing values, instead of theglobal default (e.g. as specified by
--tsv_missing_values).
Any other keyword arguments given when a
Columnis created will be added as attributes - this feature may be used by cleaning/processing steps to pass information onto subsequent steps.
- class funpack.datatable.DataTable(*args, **kwargs)[source]
Bases:
SingletonThe
DataTableis a simple container class.It keeps references to the variable and processing tables, and the data itself. The
importData()function creates and returns aDataTable.A
DataTablehas the following attributes and helper methods:Return a
multiprocessing.Poolobject for performing tasks in parallel on the data.Returns a
multiprocessing.Managerfor sharing data between processes.Returns
Trueif this invocation offunpackhas the ability to run tasks in parallel,Falseotherwise.Returns the
pandas.DataFramecontaining the variable table.Returns the
pandas.DataFramecontaining the processing table.Returns the
pandas.DataFramecontaining the category table.Return the data columns corresponding to the specified
variable,visitandinstance.Returns a list of all columns present in the data, including index columns.
Returns
Trueif the specified variable (and optionally visit/ instance) is present in the data,Falseotherwise.Returns the visit IDs for the given
variable.Returns the instance IDs for the given
variable.Returns a list of all integer variable IDs present in the data.
Data should be accessed/modified via these methods:
Returns the subject indices.
__getitem__Get the specified slice from the data.
__setitem__Set the specified slice in the data.
Columns can be added/removed, and rows removed, via these methods:
Remove subjects where
mask is False.Adds one or more new columns to the data set.
Remove the columns described by
cols.Columns can be “flagged” with metadata labels via the
addFlag()method. All of the flags on a column can be retrieved via thegetFlags()method.The
subtable()method can be used to generate a replicaDataTablewith a specific subset of columns. It is intended for parallelisation, so that child processes are given a view of only the columns that are relevant to them, and as little copying between processes as possible takes place. Thesubtable()andmerge()methods are intended to be used like so:Create subtables which only contain data for specific columns:
cols = ['1-0.0', '2-0.0', '3-0.0'] subtables = [dtable.subtable([c]) for c in cols]
Use multiprocessing to perform parallel processing on each column:
def mytask(dtable, col): dtable[:, col] += 5 return dtable with dtable.pool() as pool: subtables = pool.starmap(mytask, zip(subtables, cols))
Merge the results back into the main table:
dtable.merge(subtables)
Modifications must occur through the
DataTable.__setitem__()interface, so it can keep track of which columns have been modified. Addition or removal of columns or rows on subtables is not supported.- addColumns(series, vids=None, kwargs=None)[source]
Adds one or more new columns to the data set.
Note
It is assumed that each
pandas.Seriesobject shares the same row indices as thisDataTable.- Parameters:
series – Sequence of
pandas.Seriesobjects containing the new column data to add.vids – Sequence of variables each new column is associated with. If
None(the default), variable IDs are automatically assigned.kwargs – Sequence of dicts, one for each series, containing arguments to pass to the
Columnconstructor (e.g.visit,metadata, etc).
- addFlag(col, flag)[source]
Adds a flag for the specified column.
- Parameters:
col –
Columnobjectflag – Flag to add
- property allColumns
Returns a list of all columns present in the data, including index columns.
- property cattable
Returns the
pandas.DataFramecontaining the category table.
- columns(variable, visit=None, instance=None)[source]
Return the data columns corresponding to the specified
variable,visitandinstance.- Parameters:
variable – Integer variable ID
visit – Visit number. If
None, column names for all visits are returned.instance – Instance number. If
None, column names for all instances are returned.
- Returns:
A list of
Columnobjects.
- property dataColumns
Returns a list of all non-index columns present in the data.
- getFlags(col)[source]
Return any flags associated with the specified column.
- Parameters:
col –
Columnobject- Returns:
A
setcontaining any flags associated withcol
- property index
Returns the subject indices.
- property indexColumns
Returns a list of all index columns present in the data.
- property isSubtable
Returns
Trueif thisDataTablewas created by asubtable()call.
- property manager
Returns a
multiprocessing.Managerfor sharing data between processes.
- merge(subtables)[source]
Merge the data from the given
subtablesinto thisDataTable.It is assumed that the
subtableseach contain a sub-set of the columns in thisDataTable.Note
The
merge()method cannot be used to merge in a subtable that contains a subset of rows.- Parameters:
subtables – A single
DataTable, or a sequence ofDataTableinstances, returned bysubtable().
- property parallel
Returns
Trueif this invocation offunpackhas the ability to run tasks in parallel,Falseotherwise.
- pool()[source]
Return a
multiprocessing.Poolobject for performing tasks in parallel on the data. If thenjobsargument passed to__init__()was1, or if this is aDataTableinstance that has been unpickled (i.e. instances which are running in a sub-process), amultiprocessing.dummy.Poolinstance is created and returned.
- present(variable, visit=None, instance=None)[source]
Returns
Trueif the specified variable (and optionally visit/ instance) is present in the data,Falseotherwise.
- property proctable
Returns the
pandas.DataFramecontaining the processing table.
- removeColumns(cols)[source]
Remove the columns described by
cols.- Parameters:
cols – Sequence of
Columnobjects to remove.
- property shape
Returns the (nrows, ncolumns) shape of the data.
- subtable(columns=None, rows=None)[source]
Return a new
DataTablewhich only contains the data for the specifiedcolumnsand/orrows.This method can be used to create a replica
DataTablewhere the underlyingpandas.DataFrameonly contains the specified columns. It is intended to be used when parallelising tasks, so that child processes are given a view of only the relevant columns.Note
The
merge()method cannot be used to merge in a subtable that contains a subset of rows.
- property variables
Returns a list of all integer variable IDs present in the data.
The list includes the index variable (which has an id of
0).
- property vartable
Returns the
pandas.DataFramecontaining the variable table.