funpack.processing_functions_core
This module contains utility algorithms and functions used by the
processing_functions
module.
Returns
True
if the given data looks sparse,False
otherwise.Compares the missingness correlation of every pair of columns in
namask
.Identifies redundant columns based on their correlation with each other by comparing each pair of columns one by one.
Identifies redundant columns based on their correlation with each other using dot products to calculate a correlation matrix.
Takes one or more columns containing categorical data,, and generates a new set of binary columns (as
np.uint8
), one for each unique categorical value, containing1
in rows where the value was present,0
otherwise.Takes a
pandas.Series
containing sequence data (potentially of variable length), and returns a 2Dnumpy
array containing the parsed data.
- funpack.processing_functions_core.binariseCategorical(data: DataFrame, minpres: int | None = None, take: DataFrame | None = None, token: str | None = None, njobs: int | None = None) Tuple[ndarray, ndarray] [source]
Takes one or more columns containing categorical data,, and generates a new set of binary columns (as
np.uint8
), one for each unique categorical value, containing1
in rows where the value was present,0
otherwise.- Parameters:
data – A
pandas.DataFrame
containing the input columnsminpres – Optional threshold - values with less than this number of occurrences will not be included in the output
take – Optional
pandas.DataFrame
containing values to use in the output. Instead of using binary0
/1
values, rows where a unique value is present will be populated with the corresponding value fromtake
, and rows where a unique value is not present will be populated withnp.nan
. Must contain the same number of columns (and rows) asdata
.token – Unique token to identify this function call, to make it re-entrant (in case multiple calls are made in a parallel execution environment).
njobs – Number of jobs to parallelise tasks with - the
generateBinaryColumns()
function is called in parallel for different blocks of unique values.
- Returns:
A tuple containing:
A
(nrows, nvalues)
numpy
array containing the generated binary columnsA 1D
numpy
array of lengthnvalues
containing the unique values that are encoded in the binary columns.
- funpack.processing_functions_core.expandCompound(data: Series) ndarray [source]
Takes a
pandas.Series
containing sequence data (potentially of variable length), and returns a 2Dnumpy
array containing the parsed data.The returned array has shape
(X, Y)
, whereX
is the number of rows in the data, andY
is the maximum number of values for any of the entries.- Parameters:
data –
pandas.Series
containing the compound data.- Returns:
2D
numpy
array containing expanded data.
- funpack.processing_functions_core.generateBinaryColumns(offset, nvals, fill, token)[source]
Called by
binariseCategorical()
. Generates binarised columns for a subset of unique values for a data set.The following arguments are passed internally from
binariseCategorical()
to this function:Sequence of unique values
Numpy array containing the data
Numpy array to store the output
Numpy array to take values from
- Parameters:
offset – Offset into the sequence of unique values
nvals – Number of unique values to binarise
fill – Default value
token – Unique token used to retrieve the data for one invocation of this function.
- funpack.processing_functions_core.isSparse(data: Series, ctype: Enum | None = None, minpres: float | None = None, minstd: float | None = None, mincat: float | None = None, maxcat: float | None = None, abspres: bool = True, abscat: bool = True, naval: Any | None = None) Tuple[bool, str | None, Any] [source]
Returns
True
if the given data looks sparse,False
otherwise.Used by
removeIfSparse()
- see that function for details.By default, the
minstd
test is only performed on numeric columns, and themincat
/maxcat
tests are only performed on integer/categorical columns. This behaviour can be overridden by passing in actype
ofNone
, in which case all speified tests will be performed.- Parameters:
data –
pandas.Series
containing the datactype – The series column type (one of the
util.CTYPES
values), orNone
to disable type-specific logic.
- Returns:
A tuple containing:
True
if the data is sparse,False
otherwise.If the data is sparse, one of
'minpres'
,'minstd'
,mincat
, or'maxcat'
, indicating the cause of its sparsity.None
if the data is not sparse.If the data is sparse, the value of the criteria which caused the data to fail the test.
None
if the data is not sparse.
- funpack.processing_functions_core.matrixRedundantColumns(data: DataFrame, corrthres: float, nathres: float | None = None, precision: str | None = None) List[Tuple[int, int]] [source]
Identifies redundant columns based on their correlation with each other using dot products to calculate a correlation matrix.
- Parameters:
data –
pandas.DataFrame
containing the datacorrthres – Correlation threshold - columns with a correlation greater than this are identified as redundant.
nathres – Correlation threshold to use for missing values. If provided, columns must have a correlation greater than
corrthres
and a missing-value correlation greater thannathres
to be identified as redundant.precision –
'double'
(the default) or'single'
, specifying the floating point precision to use. Note that the algorithm used to calculate the correlation values is unstable for data with a range larger than ~10e5, so double precision is used as default.
- Returns:
Sequence of tuples of column indices, where each tuple
(a, b)
indicates that columna
is redundant with respect to columnb
.
- funpack.processing_functions_core.naCorrelation(namask: ndarray, nathres: float, nacounts: ndarray | None = None) ndarray [source]
Compares the missingness correlation of every pair of columns in
namask
.- Parameters:
namask – 2D
numpy
array of shape(nsamples, ncolumns)
containing binary missingness classificationnathres – Threshold above which column pairs should be classed as correlated.
nacounts – 1D
numpy
array containing the umber of missing values in each column. Calculated if not provided.
- Returns:
A
(ncolumns, ncolumns)
booleannumpy
array containingTrue
for column pairs which exceednathres
,False
otherwise.
- funpack.processing_functions_core.pairwiseRedundantColumns(data: DataFrame, colpairs: ndarray, corrthres: float, token: str | None = None) List[Tuple[int, int]] [source]
Identifies redundant columns based on their correlation with each other by comparing each pair of columns one by one.
- Parameters:
data –
pandas.DataFrame
containing the datacolpairs –
numpy
array of shape(N, 2)
, containing indices of the column pairs to be evaluated.corrthres – Correlation threshold - columns with a correlation greater than this are identified as redundant.
token – Identifier string for log messages.
- Returns:
Sequence of tuples of column indices, where each tuple
(a, b)
indicates that columna
is redundant with respect to columnb
.