funpack.processing_functions_core
This module contains utility algorithms and functions used by the
processing_functions module.
Returns
Trueif the given data looks sparse,Falseotherwise.Compares the missingness correlation of every pair of columns in
namask.Identifies redundant columns based on their correlation with each other by comparing each pair of columns one by one.
Identifies redundant columns based on their correlation with each other using dot products to calculate a correlation matrix.
Takes one or more columns containing categorical data,, and generates a new set of binary columns (as
np.uint8), one for each unique categorical value, containing1in rows where the value was present,0otherwise.Takes a
pandas.Seriescontaining sequence data (potentially of variable length), and returns a 2Dnumpyarray containing the parsed data.
- funpack.processing_functions_core.binariseCategorical(data: DataFrame, minpres: int | None = None, take: DataFrame | None = None, token: str | None = None, njobs: int | None = None) Tuple[ndarray, ndarray][source]
Takes one or more columns containing categorical data,, and generates a new set of binary columns (as
np.uint8), one for each unique categorical value, containing1in rows where the value was present,0otherwise.- Parameters:
data – A
pandas.DataFramecontaining the input columnsminpres – Optional threshold - values with less than this number of occurrences will not be included in the output
take – Optional
pandas.DataFramecontaining values to use in the output. Instead of using binary0/1values, rows where a unique value is present will be populated with the corresponding value fromtake, and rows where a unique value is not present will be populated withnp.nan. Must contain the same number of columns (and rows) asdata.token – Unique token to identify this function call, to make it re-entrant (in case multiple calls are made in a parallel execution environment).
njobs – Number of jobs to parallelise tasks with - the
generateBinaryColumns()function is called in parallel for different blocks of unique values.
- Returns:
A tuple containing:
A
(nrows, nvalues)numpyarray containing the generated binary columnsA 1D
numpyarray of lengthnvaluescontaining the unique values that are encoded in the binary columns.
- funpack.processing_functions_core.expandCompound(data: Series) ndarray[source]
Takes a
pandas.Seriescontaining sequence data (potentially of variable length), and returns a 2Dnumpyarray containing the parsed data.The returned array has shape
(X, Y), whereXis the number of rows in the data, andYis the maximum number of values for any of the entries.- Parameters:
data –
pandas.Seriescontaining the compound data.- Returns:
2D
numpyarray containing expanded data.
- funpack.processing_functions_core.generateBinaryColumns(offset, nvals, fill, token)[source]
Called by
binariseCategorical(). Generates binarised columns for a subset of unique values for a data set.The following arguments are passed internally from
binariseCategorical()to this function:Sequence of unique values
Numpy array containing the data
Numpy array to store the output
Numpy array to take values from
- Parameters:
offset – Offset into the sequence of unique values
nvals – Number of unique values to binarise
fill – Default value
token – Unique token used to retrieve the data for one invocation of this function.
- funpack.processing_functions_core.isSparse(data: Series, ctype: Enum | None = None, minpres: float | None = None, minstd: float | None = None, mincat: float | None = None, maxcat: float | None = None, abspres: bool = True, abscat: bool = True, naval: Any | None = None) Tuple[bool, str | None, Any][source]
Returns
Trueif the given data looks sparse,Falseotherwise.Used by
removeIfSparse()- see that function for details.By default, the
minstdtest is only performed on numeric columns, and themincat/maxcattests are only performed on integer/categorical columns. This behaviour can be overridden by passing in actypeofNone, in which case all speified tests will be performed.- Parameters:
data –
pandas.Seriescontaining the datactype – The series column type (one of the
util.CTYPESvalues), orNoneto disable type-specific logic.
- Returns:
A tuple containing:
Trueif the data is sparse,Falseotherwise.If the data is sparse, one of
'minpres','minstd',mincat, or'maxcat', indicating the cause of its sparsity.Noneif the data is not sparse.If the data is sparse, the value of the criteria which caused the data to fail the test.
Noneif the data is not sparse.
- funpack.processing_functions_core.matrixRedundantColumns(data: DataFrame, corrthres: float, nathres: float | None = None, precision: str | None = None) List[Tuple[int, int]][source]
Identifies redundant columns based on their correlation with each other using dot products to calculate a correlation matrix.
- Parameters:
data –
pandas.DataFramecontaining the datacorrthres – Correlation threshold - columns with a correlation greater than this are identified as redundant.
nathres – Correlation threshold to use for missing values. If provided, columns must have a correlation greater than
corrthresand a missing-value correlation greater thannathresto be identified as redundant.precision –
'double'(the default) or'single', specifying the floating point precision to use. Note that the algorithm used to calculate the correlation values is unstable for data with a range larger than ~10e5, so double precision is used as default.
- Returns:
Sequence of tuples of column indices, where each tuple
(a, b)indicates that columnais redundant with respect to columnb.
- funpack.processing_functions_core.naCorrelation(namask: ndarray, nathres: float, nacounts: ndarray | None = None) ndarray[source]
Compares the missingness correlation of every pair of columns in
namask.- Parameters:
namask – 2D
numpyarray of shape(nsamples, ncolumns)containing binary missingness classificationnathres – Threshold above which column pairs should be classed as correlated.
nacounts – 1D
numpyarray containing the umber of missing values in each column. Calculated if not provided.
- Returns:
A
(ncolumns, ncolumns)booleannumpyarray containingTruefor column pairs which exceednathres,Falseotherwise.
- funpack.processing_functions_core.pairwiseRedundantColumns(data: DataFrame, colpairs: ndarray, corrthres: float, token: str | None = None) List[Tuple[int, int]][source]
Identifies redundant columns based on their correlation with each other by comparing each pair of columns one by one.
- Parameters:
data –
pandas.DataFramecontaining the datacolpairs –
numpyarray of shape(N, 2), containing indices of the column pairs to be evaluated.corrthres – Correlation threshold - columns with a correlation greater than this are identified as redundant.
token – Identifier string for log messages.
- Returns:
Sequence of tuples of column indices, where each tuple
(a, b)indicates that columnais redundant with respect to columnb.