`funpack.loadtables`

This module provides functions and logic to load the variable, data coding, type, processing, and category tables used by funpack.

The variable table is a pandas.DataFrame which contains metadata about all UK Biobank variables in the input data files, along with cleaning rules. The data coding and type tables contain the same information about all UK Biobank data codings and types - these are merged into the variable table after being loaded. All of these tables are loaded by the loadVariableTable() function.

The processing table contains an ordered list of processing functions to be applied to the input data.

The category table contains collections of variable groupings; it is used to allow the user to select these groups by name, rather than having to use variable IDs.

`loadTables`	Loads the data tables used to run `funpack`.
`loadVariableTable`	Given variable table and datacoding table file names, builds and returns the variable table.
`addNewVariable`	Add a new row to the variable table.
`loadProcessingTable`	Loads the processing table from the given file.
`loadCategoryTable`	Loads the category table from the given file.
`categoryVariables`	Returns a list of variable IDs from `cattable` which correspond to the strings in `categories`.
`columnTypes`	Retrieves the type of each column in `cols` as listed in `vartable`.

funpack.loadtables.CATTABLE_COLUMNS = ['ID', 'Category', 'Variables']: The columns that must be in a category table file.

funpack.loadtables.CATTABLE_CONVERTERS = {'Variables': <function convert_category_variables>}: Custom converter functinos to use for some columns in the category table.

funpack.loadtables.CATTABLE_DTYPES = {'ID': <class 'numpy.int32'>}: Types to use for some columns in the category table.

funpack.loadtables.DCTABLE_COLUMNS = ['ID', 'NAValues', 'RawLevels', 'NewLevels']: The columns that must be in a datacoding table file.

funpack.loadtables.DCTABLE_DTYPES = {'ID': <class 'numpy.uint32'>, 'NAValues': <class 'object'>, 'NewLevels': <class 'object'>, 'RawLevels': <class 'object'>}: Types to use for some columns in the data coding table.

funpack.loadtables.IMPLICIT_CATEGORIES = {'uncategorised': -2, 'unknown': -1}: This dict contains some categories which are automatically/implicitly added to the category table by the loadTables() function (via a call to addImplicitCategories()).

funpack.loadtables.PROCTABLE_COLUMNS = ['Variable', 'Process']: The columns that must be in a processing table file.

funpack.loadtables.PROCTABLE_CONVERTERS = {'Process': functools.partial(<function convert_Process>, 'processor')}: Custom converter functinos to use for some columns in the processing table.

funpack.loadtables.TYPETABLE_COLUMNS = ['Type', 'Clean']: The columns that must be in a type table file.

funpack.loadtables.TYPETABLE_CONVERTERS = {'Clean': functools.partial(<function convert_Process>, 'cleaner'), 'Type': <function convert_type>}: Custom converter functinos to use for some columns in the type trable.

funpack.loadtables.TYPETABLE_DTYPES = {'Clean': <class 'object'>, 'Type': <class 'object'>}: Types to use for some columns in the types table.

funpack.loadtables.VARTABLE_COLUMNS = ['ID', 'Type', 'InternalType', 'Description', 'DataCoding', 'Instancing', 'NAValues', 'RawLevels', 'NewLevels', 'ParentValues', 'ChildValues', 'Clean']: The columns that must be in a variable table file.

funpack.loadtables.VARTABLE_CONVERTERS = {'Clean': functools.partial(<function convert_Process>, 'cleaner'), 'InternalType': <function convert_dtype>, 'ParentValues': <function convert_ParentValues>, 'Type': <function convert_type>}: Custom converter functinos to use for some columns in the variable table.

funpack.loadtables.VARTABLE_DTYPES = {'ChildValues': <class 'object'>, 'Clean': <class 'object'>, 'DataCoding': <class 'numpy.float32'>, 'Description': <class 'object'>, 'ID': <class 'numpy.uint32'>, 'Instancing': <class 'numpy.float32'>, 'NAValues': <class 'object'>, 'NewLevels': <class 'object'>, 'ParentValues': <class 'object'>, 'RawLevels': <class 'object'>, 'Type': <class 'object'>}: Types to use for some columns in the variable table.

funpack.loadtables.addImplicitCategories(cattable: DataFrame, unknown: Sequence[Column], uncat: Sequence[Column])[source]

Adds some implicit/automatic categories to the category table.

The following implicit categories are added:

unknown: Variables which are not present in the variable
table - this comprises non-UKB variables, or new UKB variables which are not described by the internal FUNPACK variable information in funpack/schema/.

uncategorised: Variables which are not present in any other
category.

Parameters:

cattable – The category table.
unknown – Sequence of Column objects representing variables to add to an “unknown” category.
uncat – Sequence of Column objects representing variables to add to an “uncategorised” category.

funpack.loadtables.addNewVariable(vartable: DataFrame, vid: int, name: str, dtype: dtype | None = None, instancing: int | None = None)[source]

Add a new row to the variable table.

The instancing argument defines the meaning of the Column.visit field for columns associated with this variable. The default value is 2, meaning that this variable may be associated with columns corresponding to measurements acquired at different UK BioBank assessments and imaging visits. See https://biobank.ctsu.ox.ac.uk/crystal/schema.cgi?id=9 and https://biobank.ctsu.ox.ac.uk/crystal/schema.cgi?id=10 for more details.

Note

If an entry for the specified vid already exists in vartable, the name, dtype and instancing arguments are ignored and the existing information in vartable will take precedence.

Parameters:

vartable – The variable table
vid – Integer variable ID
name – Variable name - used as the description
dtype – numpy data type. If None, the variable type is set to util.CTYPES.unknown.
instancing – Instancing code for the new variable - defaults to 2

funpack.loadtables.categoryVariables(cattable: DataFrame, categories: Sequence[int | str]) → List[int][source]

Returns a list of variable IDs from cattable which correspond to the strings in categories.

Parameters:

cattable – The category table.
categories – Sequence of integer category IDs or label sub-strings specifying the categories to return.

Returns:

A list of variable IDs.

funpack.loadtables.columnTypes(vartable: DataFrame, columns: Sequence[Column]) → Tuple[List[CTYPES], Dict[str, dtype]][source]

Retrieves the type of each column in cols as listed in vartable. Also identifies a suitable internal data type to use for each column where possible.

Parameters:

vartable – The variable table.
columnss – List of Column objects.

Returns:

A tuple containing:

A list containing the type for each column in cols - an identifier from the util.CTYPES enum. Columns corresponding to a variable which is not in the variable table is given a type of None.
A dict of { column_name : dtype } mappings containing a suitable internal data type to use for some columns.

funpack.loadtables.convert_ParentValues(val: str) → List[VariableExpression] | Literal[nan][source]: Convert a string containing a sequence of comma-separated ParentValue expressions into a sequence of VariableExpression objects.

funpack.loadtables.convert_Process(ptype: str, val: str) → Dict[str, Process][source]: Convert a string containing a sequence of comma-separated Process or Clean expressions into an OrderedDict of Process objects (with the process names used as dictionary keys).

funpack.loadtables.convert_Process_Variable(val: str, cattable: DataFrame | None = None) → Tuple[str, List[int]][source]

Convert a string containing a process variable specification.

A process variable specification comprises one or more comma-separated:

integer variable IDs,

MATLAB-style start:stop:step ranges denoting a range of variable IDs

Category IDs, denoted as 'cat<ID>', e.g. 'cat25'.

A specification may be preceded by one of:

'all', indicating that the process is to be applied to all variables simultaneously (this is the default)

'independent,', followed by one or more comma-separated variable IDs, indicating that the process is to be applied to the specified variables independently.

'all_independent', indicating that the process is to be applied to all variables independently

'all_except,', followed by one or more comma-separated MATLAB-style ranges, indicating that the process is to be applied to all variables simultaneously, except for the specified variables.

'all_independent_except,', followed by one or more comma-separated MATLAB-style ranges, indicating that the process is to be applied to all variables independently, except for the specified variables.

Returns:

A tuple containing:

The process variable type - one of 'all', 'all_independent', 'all_except', 'all_independent_except', 'independent', or 'vids'
A list of variable IDs (empty if the process variable type is 'all' or 'all_independent').

funpack.loadtables.convert_category_variables(val: str) → List[int][source]: Convert a string containing a sequence of comma-separated variable IDs or ranges into a list of variable IDs. Variables may be specified as integer IDs, or via a MATLAB-style start:step:stop range. See util.parseMatlabRange().

funpack.loadtables.convert_comma_sep_numbers(val: str) → ndarray | Literal[nan][source]: Convert a string containing comma-separated numbers into a numpy array.

funpack.loadtables.convert_comma_sep_text(val: str) → List[str] | Literal[nan][source]: Convert a string containing comma-separated text into a list.

funpack.loadtables.convert_dtype(val: str) → dtype | Literal[nan][source]: Convert a string containing a numpy.dtype (e.g. 'float32') into a dtype object.

funpack.loadtables.convert_type(val: str) → CTYPES[source]: Convert a string containing a UK BioBank type into a numerical identifier for that type - see funpack.util.CTYPES.

funpack.loadtables.identifyUncategorisedVariables(fileinfo: FileInfo, cattable: DataFrame) → List[Column][source]

Called by loadTables(). Identifies all variables which are in the data file(s), but which are uncategorised (not present in any categories in the category table).

Such variables might have been overlooked, so the user may need to be warned about them.

Parameters:

fileinfo – FileInfo object.
cattable – Category table

Returns:

A list of Column objects associated with variables that are uncategorised.

funpack.loadtables.loadCategoryTable(catfile: str | None = None) → DataFrame[source]

Loads the category table from the given file.

Parameters:: catfile – Path to the category file.

funpack.loadtables.loadProcessingTable(procfile: str | None = None, skipProcessing: bool = False, prependProcess: Sequence[Tuple[List[int], str]] | None = None, appendProcess: Sequence[Tuple[List[int], str]] | None = None, cattable: DataFrame | None = None, **kwargs) → DataFrame[source]

Loads the processing table from the given file.

Parameters:

procfile – Path to the processing table file.
skipProcessing – If True, the processing table is not loaded from procfile. The prependProcess and appendProcess arguments are still applied.
prependProcess – Sequence of (varids, procstr) mappings specifying processes to prepend to the beginning of the processing table.
appendProcess – Sequence of (varids, procstr) mappings specifying processes to append to the end of the processing table.
cattable – pandas.DataFrame containing variable categories. If not provided, any processing rules which refer to categories will not be able to be processed, and will result in an error.

All other keyword arguments are ignored.

funpack.loadtables.loadTableBases() → Tuple[DataFrame, DataFrame][source]

Loads the UK Biobank variable and data coding schema files.

This function is called by loadVariableTable(). It loads the UK Biobank variable and data coding schema files (available from the UK Biobank data showcase web site), and returns the information contained within as two pandas.Dataframe objects. These dataframes are then used as bases for the funpack variable table.

Information in the base tables is loaded from the following files:

field.txt: A list of all UKB variables

encoding.txt: A list of all UKB data codings

type.txt: A list of vid : type mappings for certain
variables, where type is the name of a numpy data type (e.g. float32).

Returns:

A tuple containing:

a pandas.DataFrame to be used as the base for the variable table
a pandas.DataFrame to be used as the base for the datacoding table

funpack.loadtables.loadTables(fileinfo: FileInfo, varfiles: Sequence[str] | None = None, dcfiles: Sequence[str] | None = None, typefile: str | None = None, procfile: str | None = None, catfile: str | None = None, **kw) → Tuple[DataFrame, DataFrame, DataFrame, List[Column], List[Column]][source]

Loads the data tables used to run funpack.

Parameters:

fileinfo – FileInfo object describing the input data files.
varfiles – Path to one or more partial variable table files
dcfiles – Path to one or more partial data coding table files
typefile – Path to the type table file
procfile – Path to the processing table file
catfile – Path to the category table file

All other arguments are passed throughh to the loadVariableTable() and loadProcessingTable() functions.

Returns:

A tuple containing:

The variable table
The processing table
The category table
List of Column objects representing columns which were in the data file(s), but not in the variable table.
List of Column objects representing columns which are uncategorised.

funpack.loadtables.loadVariableTable(fileinfo: FileInfo, varfiles: Sequence[str] | None = None, dcfiles: Sequence[str] | None = None, typefile: str | None = None, noBuiltins: bool = False, naValues: Dict[int, str] | None = None, childValues: Dict[int, Tuple[str, str]] | None = None, recoding: Dict[int, Tuple[str, str]] | None = None, clean: Dict[int, str] | None = None, typeClean: Dict[CTYPES, str] | None = None, globalClean: str | None = None, dropAbsent: bool = True, **kwargs) → Tuple[DataFrame, Sequence[Column], Sequence[Column]][source]

Given variable table and datacoding table file names, builds and returns the variable table.

Parameters:

fileinfo – FileInfo object describing the input data files.
varfiles – Path(s) to one or more variable files
dcfiles – Path(s) to one or more data coding files
typefile – Path to the type file
noBuiltins – If provided, the built-in variable and datacoding base tables are not loaded.
naValues – Dictionary of {vid : values} mappings, specifying values which should be replaced with NA. The values are expected to be strings of comma-separated values.
childValues – Dictionary of {vid : (exprs, values)} mappings, specifying parent value expressions, and corresponding child values. The expressions and values are expected to be strings of comma-separated values of the same length.
recoding – Dictionary of {vid : (rawlevel, newlevel)} mappings. The raw and enw levels are expected to be strings of comma-separated values of the same length.
clean – Dictionary of {vid : expr} mappings containing cleaning functions to apply - this will override any cleaning specified in the variable file, and any cleaning specified in typeClean. The expressions are expected to be strings.
typeClean – Dictionary of {type : expr} mappings containing cleaning functions to apply to all variables of a specific type - this will override any cleaning specified in the type file. The expressions are expected to be strings.
globalClean – Expression containing cleaning functions to apply to every variable - this will be performed after variable-specific cleaning in the variable table, or specified via clean or typeClean. The expressions are expected to be strings.
dropAbsent – If True (the default), remove all variables from the variable table which are not present in the data file(s).

All other keyword arguments are ignored.

Returns:

A tuple containing:

A pandas.DataFrame containing the variable table
A sequence of Column objects representing variables which were present in the data files, but not in the variable table, but were added to the variable table.
A sequence of Column objects representing variables which were present in the data files and in the variable table, but which did not have any cleaning rules specified.

funpack.loadtables.mergeCleanFunctions(vartable: DataFrame, tytable: DataFrame, clean: Dict[int, str], typeClean: Dict[str, str], globalClean: str)[source]

Merges custom clean functions into the variable table.

Called by loadVariableTable().

Parameters:

vartable – The variable table.
tytable – The type table
clean – Dictionary of {vid : expr} mappings containing cleaning functions to apply - this will override any cleaning specified in the variable file, and any cleaning specified in typeClean.
typeClean – Dictionary of {type : expr} mappings containing cleaning functions to apply to all variables of a specific type - this will override any cleaning specified in the type file.
globalClean – Expression containing cleaning functions to apply to every variable - this will be performed after variable-specific cleaning in the variable table, or specified via clean or typeClean.

funpack.loadtables.mergeDataCodingTable(vartable: DataFrame, dctable: DataFrame)[source]

Merges information from the data coding table into the variable table.

Called by loadVariableTable().

Parameters:

vartable – The variable table.
dctable – The data coding table.

funpack.loadtables.mergeIntoVariableTable(vartable: DataFrame, cols: Sequence[str], mapping: str | Dict[int, Any])[source]

Merge data from mapping into the variable table.

Called by loadVariableTable().

Parameters:

vartable – The variable table
cols – Names of columns in the variable table
mapping – Dict of {vid : values} mappings containing the data to copy in.

funpack.loadtables.mergeTableFiles(base: DataFrame, fnames: List[str], what: str, dtypes: Dict[str, Type], converters: Dict[str, Callable], columns: List[str]) → DataFrame[source]

Load and merge one or more table files.

This function is called by loadVariableTable() to load the variable, data coding, and type table files.

The variable and datacoding tables can be loaded from multiple files, with each file containing part of the full table. All provided files are merged into one table. The final table for a given set of files is the outer join on the index column (assumed to be the first column in each file), where non-na values in overlapping columns from later files will overwrite the values in earlier files.

Parameters:

base – Table containing base information - used for the variable and datacoding tables (see _loadTableBases()).
fnames – List of files to load. If None, the base table is returned (or an empty table if a base is not given).
what – Name of the table files being loaded - used solely for log messages
dtypes – Dict containing {column : datatype} mappings
converters – Dict containing {column : convertfunc} mappings
columns – Expected column names

funpack.loadtables.sanitiseVariableTable(vartable: DataFrame, cols: Sequence[Column], dropAbsent: bool) → List[Column][source]

Ensures that the variable table contains an entry for every variable in the input data.

Called by loadVariableTable().

Parameters:

vartable – pandas.DataFrame containing the variable table.
cols – Sequence of Column objects representing the columns in the input data.
dropAbsent – If True, entries in the table for variables which are not in cols will be removed.

Returns:

A list of unknown Column objects, i.e. representing variables which were not in the variable table.

funpack.loadtables.variableCategories(cattable: DataFrame, vids: Sequence[int]) → Dict[int, List[str]][source]

Identify the categories for each variable in vids.

Parameters:

cattable – The category table
vids – Sequence of variable/datafield IDs

Returns:

A dict of {vid : [category]} mappings

funpack.loadtables

`funpack.loadtables`