`funpack.processing_functions_core`

This module contains utility algorithms and functions used by the processing_functions module.

isSparse

Returns True if the given data looks sparse, False otherwise.

naCorrelation

Compares the missingness correlation of every pair of columns in namask.

pairwiseRedundantColumns

Identifies redundant columns based on their correlation with each other by comparing each pair of columns one by one.

matrixRedundantColumns

Identifies redundant columns based on their correlation with each other using dot products to calculate a correlation matrix.

binariseCategorical

Takes one or more columns containing categorical data,, and generates a new set of binary columns (as np.uint8), one for each unique categorical value, containing 1 in rows where the value was present, 0 otherwise.

expandCompound

Takes a pandas.Series containing sequence data (potentially of variable length), and returns a 2D numpy array containing the parsed data.

funpack.processing_functions_core.binariseCategorical(data: DataFrame, minpres: int | None = None, take: DataFrame | None = None, token: str | None = None, njobs: int | None = None) → Tuple[ndarray, ndarray][source]

Takes one or more columns containing categorical data,, and generates a new set of binary columns (as np.uint8), one for each unique categorical value, containing 1 in rows where the value was present, 0 otherwise.

Parameters:

data – A pandas.DataFrame containing the input columns
minpres – Optional threshold - values with less than this number of occurrences will not be included in the output
take – Optional pandas.DataFrame containing values to use in the output. Instead of using binary 0/1 values, rows where a unique value is present will be populated with the corresponding value from take, and rows where a unique value is not present will be populated with np.nan. Must contain the same number of columns (and rows) as data.
token – Unique token to identify this function call, to make it re-entrant (in case multiple calls are made in a parallel execution environment).
njobs – Number of jobs to parallelise tasks with - the generateBinaryColumns() function is called in parallel for different blocks of unique values.

Returns:

A tuple containing:

A (nrows, nvalues) numpy array containing the generated binary columns
A 1D numpy array of length nvalues containing the unique values that are encoded in the binary columns.

funpack.processing_functions_core.expandCompound(data: Series) → ndarray[source]

Takes a pandas.Series containing sequence data (potentially of variable length), and returns a 2D numpy array containing the parsed data.

The returned array has shape (X, Y), where X is the number of rows in the data, and Y is the maximum number of values for any of the entries.

Parameters:: data – pandas.Series containing the compound data.
Returns:: 2D numpy array containing expanded data.

funpack.processing_functions_core.generateBinaryColumns(offset, nvals, fill, token)[source]

Called by binariseCategorical(). Generates binarised columns for a subset of unique values for a data set.

The following arguments are passed internally from binariseCategorical() to this function:

Sequence of unique values

Numpy array containing the data

Numpy array to store the output

Numpy array to take values from

Parameters:

offset – Offset into the sequence of unique values
nvals – Number of unique values to binarise
fill – Default value
token – Unique token used to retrieve the data for one invocation of this function.

Returns True if the given data looks sparse, False otherwise.

Used by removeIfSparse() - see that function for details.

By default, the minstd test is only performed on numeric columns, and the mincat/maxcat tests are only performed on integer/categorical columns. This behaviour can be overridden by passing in a ctype of None, in which case all speified tests will be performed.

Parameters:

data – pandas.Series containing the data
ctype – The series column type (one of the util.CTYPES values), or None to disable type-specific logic.

Returns:

A tuple containing:

True if the data is sparse, False otherwise.
If the data is sparse, one of 'minpres', 'minstd', mincat, or 'maxcat', indicating the cause of its sparsity. None if the data is not sparse.
If the data is sparse, the value of the criteria which caused the data to fail the test. None if the data is not sparse.

funpack.processing_functions_core.matrixRedundantColumns(data: DataFrame, corrthres: float, nathres: float | None = None, precision: str | None = None) → List[Tuple[int, int]][source]

Identifies redundant columns based on their correlation with each other using dot products to calculate a correlation matrix.

Parameters:

data – pandas.DataFrame containing the data
corrthres – Correlation threshold - columns with a correlation greater than this are identified as redundant.
nathres – Correlation threshold to use for missing values. If provided, columns must have a correlation greater than corrthres and a missing-value correlation greater than nathres to be identified as redundant.
precision – 'double' (the default) or 'single', specifying the floating point precision to use. Note that the algorithm used to calculate the correlation values is unstable for data with a range larger than ~10e5, so double precision is used as default.

Returns:

Sequence of tuples of column indices, where each tuple (a, b) indicates that column a is redundant with respect to column b.

funpack.processing_functions_core.naCorrelation(namask: ndarray, nathres: float, nacounts: ndarray | None = None) → ndarray[source]

Compares the missingness correlation of every pair of columns in namask.

Parameters:

namask – 2D numpy array of shape (nsamples, ncolumns) containing binary missingness classification
nathres – Threshold above which column pairs should be classed as correlated.
nacounts – 1D numpy array containing the umber of missing values in each column. Calculated if not provided.

Returns:

A (ncolumns, ncolumns) boolean numpy array containing True for column pairs which exceed nathres, False otherwise.

funpack.processing_functions_core.pairwiseRedundantColumns(data: DataFrame, colpairs: ndarray, corrthres: float, token: str | None = None) → List[Tuple[int, int]][source]

Identifies redundant columns based on their correlation with each other by comparing each pair of columns one by one.

Parameters:

data – pandas.DataFrame containing the data
colpairs – numpy array of shape (N, 2), containing indices of the column pairs to be evaluated.
corrthres – Correlation threshold - columns with a correlation greater than this are identified as redundant.
token – Identifier string for log messages.

Returns:

Sequence of tuples of column indices, where each tuple (a, b) indicates that column a is redundant with respect to column b.

funpack.processing_functions_core

`funpack.processing_functions_core`