funpack.loadtables
This module provides functions and logic to load the variable, data coding, type, processing, and category tables used by funpack.
The variable table is a pandas.DataFrame
which contains metadata about all
UK Biobank variables in the input data files, along with cleaning rules. The
data coding and type tables contain the same information about all UK Biobank
data codings and types - these are merged into the variable table after
being loaded. All of these tables are loaded by the loadVariableTable()
function.
The processing table contains an ordered list of processing functions to be applied to the input data.
The category table contains collections of variable groupings; it is used to allow the user to select these groups by name, rather than having to use variable IDs.
Loads the data tables used to run |
|
Given variable table and datacoding table file names, builds and returns the variable table. |
|
Add a new row to the variable table. |
|
Loads the processing table from the given file. |
|
Loads the category table from the given file. |
|
Returns a list of variable IDs from |
|
Retrieves the type of each column in |
- funpack.loadtables.CATTABLE_COLUMNS = ['ID', 'Category', 'Variables']
The columns that must be in a category table file.
- funpack.loadtables.CATTABLE_CONVERTERS = {'Variables': <function convert_category_variables>}
Custom converter functinos to use for some columns in the category table.
- funpack.loadtables.CATTABLE_DTYPES = {'ID': <class 'numpy.int32'>}
Types to use for some columns in the category table.
- funpack.loadtables.DCTABLE_COLUMNS = ['ID', 'NAValues', 'RawLevels', 'NewLevels']
The columns that must be in a datacoding table file.
- funpack.loadtables.DCTABLE_DTYPES = {'ID': <class 'numpy.uint32'>, 'NAValues': <class 'object'>, 'NewLevels': <class 'object'>, 'RawLevels': <class 'object'>}
Types to use for some columns in the data coding table.
- funpack.loadtables.IMPLICIT_CATEGORIES = {'uncategorised': -2, 'unknown': -1}
This dict contains some categories which are automatically/implicitly added to the category table by the
loadTables()
function (via a call toaddImplicitCategories()
).
- funpack.loadtables.PROCTABLE_COLUMNS = ['Variable', 'Process']
The columns that must be in a processing table file.
- funpack.loadtables.PROCTABLE_CONVERTERS = {'Process': functools.partial(<function convert_Process>, 'processor')}
Custom converter functinos to use for some columns in the processing table.
- funpack.loadtables.TYPETABLE_COLUMNS = ['Type', 'Clean']
The columns that must be in a type table file.
- funpack.loadtables.TYPETABLE_CONVERTERS = {'Clean': functools.partial(<function convert_Process>, 'cleaner'), 'Type': <function convert_type>}
Custom converter functinos to use for some columns in the type trable.
- funpack.loadtables.TYPETABLE_DTYPES = {'Clean': <class 'object'>, 'Type': <class 'object'>}
Types to use for some columns in the types table.
- funpack.loadtables.VARTABLE_COLUMNS = ['ID', 'Type', 'InternalType', 'Description', 'DataCoding', 'Instancing', 'NAValues', 'RawLevels', 'NewLevels', 'ParentValues', 'ChildValues', 'Clean']
The columns that must be in a variable table file.
- funpack.loadtables.VARTABLE_CONVERTERS = {'Clean': functools.partial(<function convert_Process>, 'cleaner'), 'InternalType': <function convert_dtype>, 'ParentValues': <function convert_ParentValues>, 'Type': <function convert_type>}
Custom converter functinos to use for some columns in the variable table.
- funpack.loadtables.VARTABLE_DTYPES = {'ChildValues': <class 'object'>, 'Clean': <class 'object'>, 'DataCoding': <class 'numpy.float32'>, 'Description': <class 'object'>, 'ID': <class 'numpy.uint32'>, 'Instancing': <class 'numpy.float32'>, 'NAValues': <class 'object'>, 'NewLevels': <class 'object'>, 'ParentValues': <class 'object'>, 'RawLevels': <class 'object'>, 'Type': <class 'object'>}
Types to use for some columns in the variable table.
- funpack.loadtables.addImplicitCategories(cattable: DataFrame, unknown: Sequence[Column], uncat: Sequence[Column])[source]
Adds some implicit/automatic categories to the category table.
The following implicit categories are added:
unknown
: Variables which are not present in the variabletable - this comprises non-UKB variables, or new UKB variables which are not described by the internal FUNPACK variable information in
funpack/schema/
.
uncategorised
: Variables which are not present in any othercategory.
- funpack.loadtables.addNewVariable(vartable: DataFrame, vid: int, name: str, dtype: dtype | None = None, instancing: int | None = None)[source]
Add a new row to the variable table.
The
instancing
argument defines the meaning of theColumn.visit
field for columns associated with this variable. The default value is2
, meaning that this variable may be associated with columns corresponding to measurements acquired at different UK BioBank assessments and imaging visits. See https://biobank.ctsu.ox.ac.uk/crystal/schema.cgi?id=9 and https://biobank.ctsu.ox.ac.uk/crystal/schema.cgi?id=10 for more details.Note
If an entry for the specified
vid
already exists invartable
, thename
,dtype
andinstancing
arguments are ignored and the existing information invartable
will take precedence.- Parameters:
vartable – The variable table
vid – Integer variable ID
name – Variable name - used as the description
dtype –
numpy
data type. IfNone
, the variable type is set toutil.CTYPES.unknown
.instancing – Instancing code for the new variable - defaults to
2
- funpack.loadtables.categoryVariables(cattable: DataFrame, categories: Sequence[int | str]) List[int] [source]
Returns a list of variable IDs from
cattable
which correspond to the strings incategories
.- Parameters:
cattable – The category table.
categories – Sequence of integer category IDs or label sub-strings specifying the categories to return.
- Returns:
A list of variable IDs.
- funpack.loadtables.columnTypes(vartable: DataFrame, columns: Sequence[Column]) Tuple[List[CTYPES], Dict[str, dtype]] [source]
Retrieves the type of each column in
cols
as listed invartable
. Also identifies a suitable internal data type to use for each column where possible.- Parameters:
vartable – The variable table.
columnss – List of
Column
objects.
- Returns:
A tuple containing:
A list containing the type for each column in
cols
- an identifier from theutil.CTYPES
enum. Columns corresponding to a variable which is not in the variable table is given a type ofNone
.A dict of
{ column_name : dtype }
mappings containing a suitable internal data type to use for some columns.
- funpack.loadtables.convert_ParentValues(val: str) List[VariableExpression] | Literal[nan] [source]
Convert a string containing a sequence of comma-separated
ParentValue
expressions into a sequence ofVariableExpression
objects.
- funpack.loadtables.convert_Process(ptype: str, val: str) Dict[str, Process] [source]
Convert a string containing a sequence of comma-separated
Process
orClean
expressions into anOrderedDict
ofProcess
objects (with the process names used as dictionary keys).
- funpack.loadtables.convert_Process_Variable(val: str, cattable: DataFrame | None = None) Tuple[str, List[int]] [source]
Convert a string containing a process variable specification.
A process variable specification comprises one or more comma-separated:
integer variable IDs,
MATLAB-style
start:stop:step
ranges denoting a range of variable IDsCategory IDs, denoted as
'cat<ID>'
, e.g.'cat25'
.
A specification may be preceded by one of:
'all'
, indicating that the process is to be applied to all variables simultaneously (this is the default)'independent,'
, followed by one or more comma-separated variable IDs, indicating that the process is to be applied to the specified variables independently.'all_independent'
, indicating that the process is to be applied to all variables independently'all_except,'
, followed by one or more comma-separated MATLAB-style ranges, indicating that the process is to be applied to all variables simultaneously, except for the specified variables.'all_independent_except,'
, followed by one or more comma-separated MATLAB-style ranges, indicating that the process is to be applied to all variables independently, except for the specified variables.
- Returns:
A tuple containing:
The process variable type - one of
'all'
,'all_independent'
,'all_except'
,'all_independent_except'
,'independent'
, or'vids'
A list of variable IDs (empty if the process variable type is
'all'
or'all_independent'
).
- funpack.loadtables.convert_category_variables(val: str) List[int] [source]
Convert a string containing a sequence of comma-separated variable IDs or ranges into a list of variable IDs. Variables may be specified as integer IDs, or via a MATLAB-style
start:step:stop
range. Seeutil.parseMatlabRange()
.
- funpack.loadtables.convert_comma_sep_numbers(val: str) ndarray | Literal[nan] [source]
Convert a string containing comma-separated numbers into a
numpy
array.
- funpack.loadtables.convert_comma_sep_text(val: str) List[str] | Literal[nan] [source]
Convert a string containing comma-separated text into a list.
- funpack.loadtables.convert_dtype(val: str) dtype | Literal[nan] [source]
Convert a string containing a
numpy.dtype
(e.g.'float32'
) into adtype
object.
- funpack.loadtables.convert_type(val: str) CTYPES [source]
Convert a string containing a UK BioBank type into a numerical identifier for that type - see
funpack.util.CTYPES
.
- funpack.loadtables.identifyUncategorisedVariables(fileinfo: FileInfo, cattable: DataFrame) List[Column] [source]
Called by
loadTables()
. Identifies all variables which are in the data file(s), but which are uncategorised (not present in any categories in the category table).Such variables might have been overlooked, so the user may need to be warned about them.
- funpack.loadtables.loadCategoryTable(catfile: str | None = None) DataFrame [source]
Loads the category table from the given file.
- Parameters:
catfile – Path to the category file.
- funpack.loadtables.loadProcessingTable(procfile: str | None = None, skipProcessing: bool = False, prependProcess: Sequence[Tuple[List[int], str]] | None = None, appendProcess: Sequence[Tuple[List[int], str]] | None = None, cattable: DataFrame | None = None, **kwargs) DataFrame [source]
Loads the processing table from the given file.
- Parameters:
procfile – Path to the processing table file.
skipProcessing – If
True
, the processing table is not loaded fromprocfile
. TheprependProcess
andappendProcess
arguments are still applied.prependProcess – Sequence of
(varids, procstr)
mappings specifying processes to prepend to the beginning of the processing table.appendProcess – Sequence of
(varids, procstr)
mappings specifying processes to append to the end of the processing table.cattable –
pandas.DataFrame
containing variable categories. If not provided, any processing rules which refer to categories will not be able to be processed, and will result in an error.
All other keyword arguments are ignored.
- funpack.loadtables.loadTableBases() Tuple[DataFrame, DataFrame] [source]
Loads the UK Biobank variable and data coding schema files.
This function is called by
loadVariableTable()
. It loads the UK Biobank variable and data coding schema files (available from the UK Biobank data showcase web site), and returns the information contained within as twopandas.Dataframe
objects. These dataframes are then used as bases for thefunpack
variable table.Information in the base tables is loaded from the following files:
field.txt
: A list of all UKB variablesencoding.txt
: A list of all UKB data codingstype.txt
: A list ofvid : type
mappings for certainvariables, where
type
is the name of anumpy
data type (e.g.float32
).
- Returns:
A tuple containing:
a
pandas.DataFrame
to be used as the base for the variable tablea
pandas.DataFrame
to be used as the base for the datacoding table
- funpack.loadtables.loadTables(fileinfo: FileInfo, varfiles: Sequence[str] | None = None, dcfiles: Sequence[str] | None = None, typefile: str | None = None, procfile: str | None = None, catfile: str | None = None, **kw) Tuple[DataFrame, DataFrame, DataFrame, List[Column], List[Column]] [source]
Loads the data tables used to run
funpack
.- Parameters:
fileinfo –
FileInfo
object describing the input data files.varfiles – Path to one or more partial variable table files
dcfiles – Path to one or more partial data coding table files
typefile – Path to the type table file
procfile – Path to the processing table file
catfile – Path to the category table file
All other arguments are passed throughh to the
loadVariableTable()
andloadProcessingTable()
functions.
- funpack.loadtables.loadVariableTable(fileinfo: FileInfo, varfiles: Sequence[str] | None = None, dcfiles: Sequence[str] | None = None, typefile: str | None = None, noBuiltins: bool = False, naValues: Dict[int, str] | None = None, childValues: Dict[int, Tuple[str, str]] | None = None, recoding: Dict[int, Tuple[str, str]] | None = None, clean: Dict[int, str] | None = None, typeClean: Dict[CTYPES, str] | None = None, globalClean: str | None = None, dropAbsent: bool = True, **kwargs) Tuple[DataFrame, Sequence[Column], Sequence[Column]] [source]
Given variable table and datacoding table file names, builds and returns the variable table.
- Parameters:
fileinfo –
FileInfo
object describing the input data files.varfiles – Path(s) to one or more variable files
dcfiles – Path(s) to one or more data coding files
typefile – Path to the type file
noBuiltins – If provided, the built-in variable and datacoding base tables are not loaded.
naValues – Dictionary of
{vid : values}
mappings, specifying values which should be replaced with NA. The values are expected to be strings of comma-separated values.childValues – Dictionary of
{vid : (exprs, values)}
mappings, specifying parent value expressions, and corresponding child values. The expressions and values are expected to be strings of comma-separated values of the same length.recoding – Dictionary of
{vid : (rawlevel, newlevel)}
mappings. The raw and enw levels are expected to be strings of comma-separated values of the same length.clean – Dictionary of
{vid : expr}
mappings containing cleaning functions to apply - this will override any cleaning specified in the variable file, and any cleaning specified intypeClean
. The expressions are expected to be strings.typeClean – Dictionary of
{type : expr}
mappings containing cleaning functions to apply to all variables of a specific type - this will override any cleaning specified in the type file. The expressions are expected to be strings.globalClean – Expression containing cleaning functions to apply to every variable - this will be performed after variable-specific cleaning in the variable table, or specified via
clean
ortypeClean
. The expressions are expected to be strings.dropAbsent – If
True
(the default), remove all variables from the variable table which are not present in the data file(s).
All other keyword arguments are ignored.
- Returns:
A tuple containing:
A
pandas.DataFrame
containing the variable tableA sequence of
Column
objects representing variables which were present in the data files, but not in the variable table, but were added to the variable table.A sequence of
Column
objects representing variables which were present in the data files and in the variable table, but which did not have any cleaning rules specified.
- funpack.loadtables.mergeCleanFunctions(vartable: DataFrame, tytable: DataFrame, clean: Dict[int, str], typeClean: Dict[str, str], globalClean: str)[source]
Merges custom clean functions into the variable table.
Called by
loadVariableTable()
.- Parameters:
vartable – The variable table.
tytable – The type table
clean – Dictionary of
{vid : expr}
mappings containing cleaning functions to apply - this will override any cleaning specified in the variable file, and any cleaning specified intypeClean
.typeClean – Dictionary of
{type : expr}
mappings containing cleaning functions to apply to all variables of a specific type - this will override any cleaning specified in the type file.globalClean – Expression containing cleaning functions to apply to every variable - this will be performed after variable-specific cleaning in the variable table, or specified via
clean
ortypeClean
.
- funpack.loadtables.mergeDataCodingTable(vartable: DataFrame, dctable: DataFrame)[source]
Merges information from the data coding table into the variable table.
Called by
loadVariableTable()
.- Parameters:
vartable – The variable table.
dctable – The data coding table.
- funpack.loadtables.mergeIntoVariableTable(vartable: DataFrame, cols: Sequence[str], mapping: str | Dict[int, Any])[source]
Merge data from
mapping
into the variable table.Called by
loadVariableTable()
.- Parameters:
vartable – The variable table
cols – Names of columns in the variable table
mapping – Dict of
{vid : values}
mappings containing the data to copy in.
- funpack.loadtables.mergeTableFiles(base: DataFrame, fnames: List[str], what: str, dtypes: Dict[str, Type], converters: Dict[str, Callable], columns: List[str]) DataFrame [source]
Load and merge one or more table files.
This function is called by
loadVariableTable()
to load the variable, data coding, and type table files.The variable and datacoding tables can be loaded from multiple files, with each file containing part of the full table. All provided files are merged into one table. The final table for a given set of files is the outer join on the index column (assumed to be the first column in each file), where non-na values in overlapping columns from later files will overwrite the values in earlier files.
- Parameters:
base – Table containing base information - used for the variable and datacoding tables (see
_loadTableBases()
).fnames – List of files to load. If
None
, thebase
table is returned (or an empty table if abase
is not given).what – Name of the table files being loaded - used solely for log messages
dtypes – Dict containing
{column : datatype}
mappingsconverters – Dict containing
{column : convertfunc}
mappingscolumns – Expected column names
- funpack.loadtables.sanitiseVariableTable(vartable: DataFrame, cols: Sequence[Column], dropAbsent: bool) List[Column] [source]
Ensures that the variable table contains an entry for every variable in the input data.
Called by
loadVariableTable()
.- Parameters:
vartable –
pandas.DataFrame
containing the variable table.cols – Sequence of
Column
objects representing the columns in the input data.dropAbsent – If
True
, entries in the table for variables which are not incols
will be removed.
- Returns:
A list of unknown
Column
objects, i.e. representing variables which were not in the variable table.