funpack.processing
This module contains functionality for running “processes”, processing steps which are applied to one or more variables, and which may result in columns being removed from, or new columns being added to, the data table. These processes are specified in the processing table.
The functionality in this module is also used to manage cleaning functions,
which are specified in the Clean
column of the variable table. Definitions
of all available (pre-)processing functions are in the
cleaning_functions
and processing_functions
modules.
The processData()
function is also defined here - it executes the
processes defined in the processing table.
Logic for parsing processing step strings can be found in the
funpack.parsing.process
module.
Special processing functions can be applied to a variable’s data by adding
them to the Clean
and Process
of the variable or processing
table respectively. Processing is specified as a comma-separated list of
process functions - for example:
process1, process2(), process3('arg1', arg2=1234)
The parseProcesses()
function parses such a line, and returns a list of
Process
objects which can be used to query the process name and
arguments, and to run each process.
The processing table
Within the processing table, the above process specification is preceeded by a variable definition, which states which variables the process should be applied to, and how they should be applied - either independently, or together. This information is used to parallelise the processing where possible.
For each process, the processing table contains a “process variable type”, a list of vids, and the process itself. The process variable type is one of the following:
No process variable type, simply a comma-separated list of VIDS, The process is applied to the specified vids:
1,2,3 processName
'independent'
: The process is applied independently to the specified vids:independent,1,2,3 processName
'all'
: The process is applied to all vids:all processName
'all_independent'
: The process is applied independently to all vids:all_independent processName
'all_except'
: The process is applied to all vids except the specified ones:all_except,1,2,3 processName
'all_independent_except'
: The process is applied independently to all vids except the specified ones:all_independent_except,1,2,3 processName
For example, the processing_functions.binariseCategorical()
function
applies its logic independently to each variable, so it makes sense to specify
that it should be applied independently to a set of variables:
independent,1,2,3,4,5 binariseCategorical
However, the removeIfRedundant()
function works on a collection of
variables (probably all variables in the data set), and this must be specified
in the processing table.
all,1,2,3,4,5 removeIfRedundant(0.99)
Broadcasting arguments
Warning
Broadcasting arguments are deprecated, and will be removed in FUNPACK 4.0.0. The alternative to broadcasting is for processing functions to perform their own parallelisation.
When a process is applied independently to more than one variable, the input
arguments to the process may need to be different for each variable. This can
be accomplished by using a _broadcast_ argument - simply prefix the argument
name with 'broadcast_'
, and then specify a list containing the argument
values for each variable. For example, the following specification will result
in the processing_functions.binariseCategorical()
process being applied
independently to variables 1, 2, and 3, with values taken from variables 4, 5,
and 6 respectively.
independent,1,2,3 binariseCategorical(broadcast_take=[4, 5, 6])
Note that broadcast arguments are useful as a performance optimisation - the above specification is functionally equivalent to:
1 binariseCategorical(take=4)
2 binariseCategorical(take=5)
3 binariseCategorical(take=6)
however the latter example cannot take advantage of parallelism in a multi-core environment.
- funpack.processing.processData(dtable)[source]
Applies all processing specified in the processing table to the data.
- Parameters:
dtable – The
DataTable
instance.
- funpack.processing.retrieveProcess(dtable, procIdx)[source]
Used by
processData()
. Retrieves the process at indexprocIdx
in the processing table, and generates the variable groups that the process should be applied to.- Parameters:
dtable – The
DataTable
procIdx – Index into the processing table
- Returns:
A tuple containing:
A dict of
{ name : Process }
mappings containing theProcess
objects to be executed.A list of lists, each list containing a group of variable IDs that the process should be applied to
True
if the process can be applied in parallel across the variable groups,False
otherwise.
- funpack.processing.runParallelProcess(proc, dtable, vids, workDir, broadcastIndex=None)[source]
Used by
runProcess()
. Callsproc.run
, and returns its result and the (possibly modified)dtable
.- Parameters:
proc –
Process
to rundtable –
DataTable
instance (probably a subtable)vids – List of variable IDs
workDir – Directory to save new columns to if running in a worker process - this is used to reduce the amount of data that must be transferred between processes.
broadcastIndex – Deprecated. Index to use for broadcast arguments - passed through to the
Process.run()
method.
- Returns:
A tuple containing: - A reference to
dtable
- the result ofproc.run()
- funpack.processing.runProcess(proc, dtable, vids, workDir, parallel)[source]
Called by
processData()
. Runs the given process, and updates theDataTable
as needed.- Parameters:
proc –
Process
to run.dtable –
DataTable
containing the data.workDir – Directory to save/load new columns to/from, if the processing is performed in a worker process.
vids – List of lists, groups of variable IDs to run the process on.
parallel – If
True
, each variable group is processed in parallel. Otherwise they are processed sequentially.
- funpack.processing.unpackResults(proc, result)[source]
Parses the results returned by a
Process
. See theprocessing_functions
module for details on what can be returned by a processing function.- Parameters:
proc – The
Process
result – The value returned by
Process.run()
.
- Returns:
A tuple containing: - List of columns to remove - List of new series to add - List of variable IDs for new series - List of
Column
keyword arguments for new series