funpack.loadtables
This module provides functions and logic to load the variable, data coding, type, processing, and category tables used by funpack.
The variable table is a pandas.DataFrame which contains metadata about all
UK Biobank variables in the input data files, along with cleaning rules. The
data coding and type tables contain the same information about all UK Biobank
data codings and types - these are merged into the variable table after
being loaded. All of these tables are loaded by the loadVariableTable()
function.
The processing table contains an ordered list of processing functions to be applied to the input data.
The category table contains collections of variable groupings; it is used to allow the user to select these groups by name, rather than having to use variable IDs.
Loads the data tables used to run |
|
Given variable table and datacoding table file names, builds and returns the variable table. |
|
Add a new row to the variable table. |
|
Loads the processing table from the given file. |
|
Loads the category table from the given file. |
|
Returns a list of variable IDs from |
|
Retrieves the type of each column in |
- funpack.loadtables.CATTABLE_COLUMNS = ['ID', 'Category', 'Variables']
The columns that must be in a category table file.
- funpack.loadtables.CATTABLE_CONVERTERS = {'Variables': <function convert_category_variables>}
Custom converter functinos to use for some columns in the category table.
- funpack.loadtables.CATTABLE_DTYPES = {'ID': <class 'numpy.int32'>}
Types to use for some columns in the category table.
- funpack.loadtables.DCTABLE_COLUMNS = ['ID', 'NAValues', 'RawLevels', 'NewLevels']
The columns that must be in a datacoding table file.
- funpack.loadtables.DCTABLE_DTYPES = {'ID': <class 'numpy.uint32'>, 'NAValues': <class 'object'>, 'NewLevels': <class 'object'>, 'RawLevels': <class 'object'>}
Types to use for some columns in the data coding table.
- funpack.loadtables.IMPLICIT_CATEGORIES = {'uncategorised': -2, 'unknown': -1}
This dict contains some categories which are automatically/implicitly added to the category table by the
loadTables()function (via a call toaddImplicitCategories()).
- funpack.loadtables.PROCTABLE_COLUMNS = ['Variable', 'Process']
The columns that must be in a processing table file.
- funpack.loadtables.PROCTABLE_CONVERTERS = {'Process': functools.partial(<function convert_Process>, 'processor')}
Custom converter functinos to use for some columns in the processing table.
- funpack.loadtables.TYPETABLE_COLUMNS = ['Type', 'Clean']
The columns that must be in a type table file.
- funpack.loadtables.TYPETABLE_CONVERTERS = {'Clean': functools.partial(<function convert_Process>, 'cleaner'), 'Type': <function convert_type>}
Custom converter functinos to use for some columns in the type trable.
- funpack.loadtables.TYPETABLE_DTYPES = {'Clean': <class 'object'>, 'Type': <class 'object'>}
Types to use for some columns in the types table.
- funpack.loadtables.VARTABLE_COLUMNS = ['ID', 'Type', 'InternalType', 'Description', 'DataCoding', 'Instancing', 'NAValues', 'RawLevels', 'NewLevels', 'ParentValues', 'ChildValues', 'Clean']
The columns that must be in a variable table file.
- funpack.loadtables.VARTABLE_CONVERTERS = {'Clean': functools.partial(<function convert_Process>, 'cleaner'), 'InternalType': <function convert_dtype>, 'ParentValues': <function convert_ParentValues>, 'Type': <function convert_type>}
Custom converter functinos to use for some columns in the variable table.
- funpack.loadtables.VARTABLE_DTYPES = {'ChildValues': <class 'object'>, 'Clean': <class 'object'>, 'DataCoding': <class 'numpy.float32'>, 'Description': <class 'object'>, 'ID': <class 'numpy.uint32'>, 'Instancing': <class 'numpy.float32'>, 'NAValues': <class 'object'>, 'NewLevels': <class 'object'>, 'ParentValues': <class 'object'>, 'RawLevels': <class 'object'>, 'Type': <class 'object'>}
Types to use for some columns in the variable table.
- funpack.loadtables.addImplicitCategories(cattable: DataFrame, unknown: Sequence[Column], uncat: Sequence[Column])[source]
Adds some implicit/automatic categories to the category table.
The following implicit categories are added:
unknown: Variables which are not present in the variabletable - this comprises non-UKB variables, or new UKB variables which are not described by the internal FUNPACK variable information in
funpack/schema/.
uncategorised: Variables which are not present in any othercategory.
- funpack.loadtables.addNewVariable(vartable: DataFrame, vid: int, name: str, dtype: dtype | None = None, instancing: int | None = None)[source]
Add a new row to the variable table.
The
instancingargument defines the meaning of theColumn.visitfield for columns associated with this variable. The default value is2, meaning that this variable may be associated with columns corresponding to measurements acquired at different UK BioBank assessments and imaging visits. See https://biobank.ctsu.ox.ac.uk/crystal/schema.cgi?id=9 and https://biobank.ctsu.ox.ac.uk/crystal/schema.cgi?id=10 for more details.Note
If an entry for the specified
vidalready exists invartable, thename,dtypeandinstancingarguments are ignored and the existing information invartablewill take precedence.- Parameters:
vartable – The variable table
vid – Integer variable ID
name – Variable name - used as the description
dtype –
numpydata type. IfNone, the variable type is set toutil.CTYPES.unknown.instancing – Instancing code for the new variable - defaults to
2
- funpack.loadtables.categoryVariables(cattable: DataFrame, categories: Sequence[int | str]) List[int][source]
Returns a list of variable IDs from
cattablewhich correspond to the strings incategories.- Parameters:
cattable – The category table.
categories – Sequence of integer category IDs or label sub-strings specifying the categories to return.
- Returns:
A list of variable IDs.
- funpack.loadtables.columnTypes(vartable: DataFrame, columns: Sequence[Column]) Tuple[List[CTYPES], Dict[str, dtype]][source]
Retrieves the type of each column in
colsas listed invartable. Also identifies a suitable internal data type to use for each column where possible.- Parameters:
vartable – The variable table.
columnss – List of
Columnobjects.
- Returns:
A tuple containing:
A list containing the type for each column in
cols- an identifier from theutil.CTYPESenum. Columns corresponding to a variable which is not in the variable table is given a type ofNone.A dict of
{ column_name : dtype }mappings containing a suitable internal data type to use for some columns.
- funpack.loadtables.convert_ParentValues(val: str) List[VariableExpression] | Literal[nan][source]
Convert a string containing a sequence of comma-separated
ParentValueexpressions into a sequence ofVariableExpressionobjects.
- funpack.loadtables.convert_Process(ptype: str, val: str) Dict[str, Process][source]
Convert a string containing a sequence of comma-separated
ProcessorCleanexpressions into anOrderedDictofProcessobjects (with the process names used as dictionary keys).
- funpack.loadtables.convert_Process_Variable(val: str, cattable: DataFrame | None = None) Tuple[str, List[int]][source]
Convert a string containing a process variable specification.
A process variable specification comprises one or more comma-separated:
integer variable IDs,
MATLAB-style
start:stop:stepranges denoting a range of variable IDsCategory IDs, denoted as
'cat<ID>', e.g.'cat25'.
A specification may be preceded by one of:
'all', indicating that the process is to be applied to all variables simultaneously (this is the default)'independent,', followed by one or more comma-separated variable IDs, indicating that the process is to be applied to the specified variables independently.'all_independent', indicating that the process is to be applied to all variables independently'all_except,', followed by one or more comma-separated MATLAB-style ranges, indicating that the process is to be applied to all variables simultaneously, except for the specified variables.'all_independent_except,', followed by one or more comma-separated MATLAB-style ranges, indicating that the process is to be applied to all variables independently, except for the specified variables.
- Returns:
A tuple containing:
The process variable type - one of
'all','all_independent','all_except','all_independent_except','independent', or'vids'A list of variable IDs (empty if the process variable type is
'all'or'all_independent').
- funpack.loadtables.convert_category_variables(val: str) List[int][source]
Convert a string containing a sequence of comma-separated variable IDs or ranges into a list of variable IDs. Variables may be specified as integer IDs, or via a MATLAB-style
start:step:stoprange. Seeutil.parseMatlabRange().
- funpack.loadtables.convert_comma_sep_numbers(val: str) ndarray | Literal[nan][source]
Convert a string containing comma-separated numbers into a
numpyarray.
- funpack.loadtables.convert_comma_sep_text(val: str) List[str] | Literal[nan][source]
Convert a string containing comma-separated text into a list.
- funpack.loadtables.convert_dtype(val: str) dtype | Literal[nan][source]
Convert a string containing a
numpy.dtype(e.g.'float32') into adtypeobject.
- funpack.loadtables.convert_type(val: str) CTYPES[source]
Convert a string containing a UK BioBank type into a numerical identifier for that type - see
funpack.util.CTYPES.
- funpack.loadtables.identifyUncategorisedVariables(fileinfo: FileInfo, cattable: DataFrame) List[Column][source]
Called by
loadTables(). Identifies all variables which are in the data file(s), but which are uncategorised (not present in any categories in the category table).Such variables might have been overlooked, so the user may need to be warned about them.
- funpack.loadtables.loadCategoryTable(catfile: str | None = None) DataFrame[source]
Loads the category table from the given file.
- Parameters:
catfile – Path to the category file.
- funpack.loadtables.loadProcessingTable(procfile: str | None = None, skipProcessing: bool = False, prependProcess: Sequence[Tuple[List[int], str]] | None = None, appendProcess: Sequence[Tuple[List[int], str]] | None = None, cattable: DataFrame | None = None, **kwargs) DataFrame[source]
Loads the processing table from the given file.
- Parameters:
procfile – Path to the processing table file.
skipProcessing – If
True, the processing table is not loaded fromprocfile. TheprependProcessandappendProcessarguments are still applied.prependProcess – Sequence of
(varids, procstr)mappings specifying processes to prepend to the beginning of the processing table.appendProcess – Sequence of
(varids, procstr)mappings specifying processes to append to the end of the processing table.cattable –
pandas.DataFramecontaining variable categories. If not provided, any processing rules which refer to categories will not be able to be processed, and will result in an error.
All other keyword arguments are ignored.
- funpack.loadtables.loadTableBases() Tuple[DataFrame, DataFrame][source]
Loads the UK Biobank variable and data coding schema files.
This function is called by
loadVariableTable(). It loads the UK Biobank variable and data coding schema files (available from the UK Biobank data showcase web site), and returns the information contained within as twopandas.Dataframeobjects. These dataframes are then used as bases for thefunpackvariable table.Information in the base tables is loaded from the following files:
field.txt: A list of all UKB variablesencoding.txt: A list of all UKB data codingstype.txt: A list ofvid : typemappings for certainvariables, where
typeis the name of anumpydata type (e.g.float32).
- Returns:
A tuple containing:
a
pandas.DataFrameto be used as the base for the variable tablea
pandas.DataFrameto be used as the base for the datacoding table
- funpack.loadtables.loadTables(fileinfo: FileInfo, varfiles: Sequence[str] | None = None, dcfiles: Sequence[str] | None = None, typefile: str | None = None, procfile: str | None = None, catfile: str | None = None, **kw) Tuple[DataFrame, DataFrame, DataFrame, List[Column], List[Column]][source]
Loads the data tables used to run
funpack.- Parameters:
fileinfo –
FileInfoobject describing the input data files.varfiles – Path to one or more partial variable table files
dcfiles – Path to one or more partial data coding table files
typefile – Path to the type table file
procfile – Path to the processing table file
catfile – Path to the category table file
All other arguments are passed throughh to the
loadVariableTable()andloadProcessingTable()functions.
- funpack.loadtables.loadVariableTable(fileinfo: FileInfo, varfiles: Sequence[str] | None = None, dcfiles: Sequence[str] | None = None, typefile: str | None = None, noBuiltins: bool = False, naValues: Dict[int, str] | None = None, childValues: Dict[int, Tuple[str, str]] | None = None, recoding: Dict[int, Tuple[str, str]] | None = None, clean: Dict[int, str] | None = None, typeClean: Dict[CTYPES, str] | None = None, globalClean: str | None = None, dropAbsent: bool = True, **kwargs) Tuple[DataFrame, Sequence[Column], Sequence[Column]][source]
Given variable table and datacoding table file names, builds and returns the variable table.
- Parameters:
fileinfo –
FileInfoobject describing the input data files.varfiles – Path(s) to one or more variable files
dcfiles – Path(s) to one or more data coding files
typefile – Path to the type file
noBuiltins – If provided, the built-in variable and datacoding base tables are not loaded.
naValues – Dictionary of
{vid : values}mappings, specifying values which should be replaced with NA. The values are expected to be strings of comma-separated values.childValues – Dictionary of
{vid : (exprs, values)}mappings, specifying parent value expressions, and corresponding child values. The expressions and values are expected to be strings of comma-separated values of the same length.recoding – Dictionary of
{vid : (rawlevel, newlevel)}mappings. The raw and enw levels are expected to be strings of comma-separated values of the same length.clean – Dictionary of
{vid : expr}mappings containing cleaning functions to apply - this will override any cleaning specified in the variable file, and any cleaning specified intypeClean. The expressions are expected to be strings.typeClean – Dictionary of
{type : expr}mappings containing cleaning functions to apply to all variables of a specific type - this will override any cleaning specified in the type file. The expressions are expected to be strings.globalClean – Expression containing cleaning functions to apply to every variable - this will be performed after variable-specific cleaning in the variable table, or specified via
cleanortypeClean. The expressions are expected to be strings.dropAbsent – If
True(the default), remove all variables from the variable table which are not present in the data file(s).
All other keyword arguments are ignored.
- Returns:
A tuple containing:
A
pandas.DataFramecontaining the variable tableA sequence of
Columnobjects representing variables which were present in the data files, but not in the variable table, but were added to the variable table.A sequence of
Columnobjects representing variables which were present in the data files and in the variable table, but which did not have any cleaning rules specified.
- funpack.loadtables.mergeCleanFunctions(vartable: DataFrame, tytable: DataFrame, clean: Dict[int, str], typeClean: Dict[str, str], globalClean: str)[source]
Merges custom clean functions into the variable table.
Called by
loadVariableTable().- Parameters:
vartable – The variable table.
tytable – The type table
clean – Dictionary of
{vid : expr}mappings containing cleaning functions to apply - this will override any cleaning specified in the variable file, and any cleaning specified intypeClean.typeClean – Dictionary of
{type : expr}mappings containing cleaning functions to apply to all variables of a specific type - this will override any cleaning specified in the type file.globalClean – Expression containing cleaning functions to apply to every variable - this will be performed after variable-specific cleaning in the variable table, or specified via
cleanortypeClean.
- funpack.loadtables.mergeDataCodingTable(vartable: DataFrame, dctable: DataFrame)[source]
Merges information from the data coding table into the variable table.
Called by
loadVariableTable().- Parameters:
vartable – The variable table.
dctable – The data coding table.
- funpack.loadtables.mergeIntoVariableTable(vartable: DataFrame, cols: Sequence[str], mapping: str | Dict[int, Any])[source]
Merge data from
mappinginto the variable table.Called by
loadVariableTable().- Parameters:
vartable – The variable table
cols – Names of columns in the variable table
mapping – Dict of
{vid : values}mappings containing the data to copy in.
- funpack.loadtables.mergeTableFiles(base: DataFrame, fnames: List[str], what: str, dtypes: Dict[str, Type], converters: Dict[str, Callable], columns: List[str]) DataFrame[source]
Load and merge one or more table files.
This function is called by
loadVariableTable()to load the variable, data coding, and type table files.The variable and datacoding tables can be loaded from multiple files, with each file containing part of the full table. All provided files are merged into one table. The final table for a given set of files is the outer join on the index column (assumed to be the first column in each file), where non-na values in overlapping columns from later files will overwrite the values in earlier files.
- Parameters:
base – Table containing base information - used for the variable and datacoding tables (see
_loadTableBases()).fnames – List of files to load. If
None, thebasetable is returned (or an empty table if abaseis not given).what – Name of the table files being loaded - used solely for log messages
dtypes – Dict containing
{column : datatype}mappingsconverters – Dict containing
{column : convertfunc}mappingscolumns – Expected column names
- funpack.loadtables.sanitiseVariableTable(vartable: DataFrame, cols: Sequence[Column], dropAbsent: bool) List[Column][source]
Ensures that the variable table contains an entry for every variable in the input data.
Called by
loadVariableTable().- Parameters:
vartable –
pandas.DataFramecontaining the variable table.cols – Sequence of
Columnobjects representing the columns in the input data.dropAbsent – If
True, entries in the table for variables which are not incolswill be removed.
- Returns:
A list of unknown
Columnobjects, i.e. representing variables which were not in the variable table.