funpack.importing.core
This module provides the importData()
function, which implements
the data importing stage of the funpack
sequence
- funpack.importing.core.MERGE_AXIS = 'variables'
Default merge axis when loading multiple data files - see
mergeData()
.
- funpack.importing.core.MERGE_AXIS_OPTIONS = ['0', 'rows', 'subjects', '1', 'cols', 'columns', 'variables']
Values accepted for the
axis
option to themergeData()
function.
- funpack.importing.core.MERGE_STRATEGY = 'intersection'
Default merge strategy when loading multiple data files - see
mergeData()
.
- funpack.importing.core.MERGE_STRATEGY_OPTIONS = ['naive', 'union', 'intersection', 'inner', 'outer']
Values accepted for the
strategy
option to themergeData()
function.
- funpack.importing.core.NUM_ROWS = 10000
Default number of rows read at a time by
loadData()
- it reads the data in chunks.
- exception funpack.importing.core.VariableMissing[source]
Bases:
ValueError
Exception raised by
importData()
when--fail_if_missing
/failIfMissing
is active, and a requested variable is not present in the input files.
- funpack.importing.core.coerceToNumeric(vartable, series, column)[source]
Coerces the given column to numeric, if necessary.
- Parameters:
vartable – The variable table
series –
pandas.Series
containing the data to be coerced.column –
Column
object representing the column to coerce.
- Returns:
Coerced
pandas.Series
- funpack.importing.core.importData(fileinfo, vartable, proctable, cattable, variables=None, colnames=None, excludeColnames=None, categories=None, subjects=None, subjectExprs=None, exclude=None, excludeVariables=None, excludeCategories=None, trustTypes=False, mergeAxis=None, mergeStrategy=None, indexVisits=False, dropNaRows=False, addAuxVars=False, njobs=1, mgr=None, dryrun=False, failIfMissing=False)[source]
The data import stage.
This function does the following:
Figures out which columns to load (using the
columnsToLoad()
function).Loads the data (using
loadFiles()
),Creates and returns a
DataTable
.
- Parameters:
fileinfo –
FileInfo
object describing the input file(s).vartable – The data coding table
proctable – The processing table
cattable – The category table
variables – List of variable IDs to import
colnames – List of names/glob-style wildcard patterns specifying columns to import.
excludeColnames – List of column name suffixes specifying columns to exclude.
categories – List of category names/IDs to import
subjects – List of subjects to include
subjectExprs – List of subject inclusion expressions
exclude – List of subjects to exclude
excludeVariables – List of variables to exclude
excludeCategories – List of category names/IDs to exclude
trustTypes – If
True
, it is assumed that columns with a known data type do not contain any bad/unparseable values. This improves performance, but will cause an error if the assumption does not hold.mergeAxis – Merging axis to use when loading multiple data files - see the
mergeData()
function.mergeStrategy – Merging strategy to use when loading multiple data files - see the
mergeData()
function.indexVisits – Re-arrange the data so that rows are indexed by subject ID and visit, rather than visits being split into separate columns. Only applied to variables which are labelled with Instancing 2.
dropNaRows – If
True
, rows which do not contain data for any columns are not loaded.addAuxVars – If
True
, data fields which are referred to in the processing rules are selected for import if present in the input file(s) and not already selected, and . See thetake
argument toprocessing_functions.binariseCategorical()
for an example.njobs – Number of processes to use for parallelising tasks.
mgr –
multiprocessing.Manager
object for parallelisationdryrun – If
True
the data is not loaded.failIfMissing – Defaults to
False
. IfTrue
, and a requested variable is not present in the input files, aVariableMissing
error is raised.
- Returns:
A tuple containing:
A
DataTable
, which contains references to the data, and the variable and procesing tables.A list of
Column
objects that were not loaded from each input file.
- funpack.importing.core.loadChunk(i, offset, fname, vartable, header, allcols, toload, index, nrows, subjects, subjectExprs, exclude, indexByVisit, dropNaRows, encoding, trustTypes, dtypes, datecols, dlargs)[source]
Loads a chunk of
nrows
fromfname
, starting atoffset
.- Parameters:
i – Chunk number, just used for logging.
offset – Row number to start reading from.
fname – Path to the data file
vartable – Variable table
header –
True
if the file has a header row,False
otherwise.allcols – Sequence of
Column
objects describing all columns in the file.toload – Sequence of
Column
objects describing the columns that should be loaded, as generated bycolumnsToLoad()
.index – List containing position(s) of index column(s) (starting from 0).
nrows – Number of rows to read at a time. Defaults to attr:NUM_ROWS.
subjects – List of subjects to include.
subjectExprs – List of subject inclusion expressions
exclude – List of subjects to exclude
indexByVisit –
None
, or the return value ofgenerateReindexedColumns()
, which will be used to re-arrange the loaded data so it is indexed by visit number, in addition to row ID.dropNaRows – If
True
, rows which do not contain data for any columns are not loaded.encoding – Character encoding (or sequence of encodings, one for each data file). Defaults to
latin1
.trustTypes – Assume that columns with known data type do not contain any bad/unparseable values.
dtypes – A dict of
{ column_name : dtype }
mappings containing a suitable internal data type to use for some columns.datecols – List of column names denoting columns which should be interpreted as dates/times.
dlargs – Dict of arguments to pass through to
pandas.read_csv
.
- Returns:
pandas.DataFrame
- funpack.importing.core.loadFile(fname, fileinfo, vartable, toload, nrows=None, subjects=None, subjectExprs=None, exclude=None, indexVisits=False, dropNaRows=False, trustTypes=False, pool=None)[source]
Loads data from the specified file. The file is loaded in chunks of
nrows
rows, using theloadChunk()
file.- Parameters:
fname – Path to the data file
fileinfo –
FileInfo
object describing the input file.vartable – Variable table
toload – Sequence of
Column
objects describing the columns that should be loaded, as generated bycolumnsToLoad()
.nrows – Number of rows to read at a time. Defaults to attr:NUM_ROWS.
subjects – List of subjects to include.
subjectExprs – List of subject inclusion expressions
exclude – List of subjects to exclude
indexVisits – Re-arrange the data so that rows are indexed by subject ID and visit, rather than visits being split into separate columns. Only applied to variables which are labelled with Instancing 2.
dropNaRows – If
True
, rows which do not contain data for any columns are not loaded.trustTypes – Assume that columns with known data type do not contain any bad/unparseable values.
pool –
multiprocessing.Pool
object for running tasks in parallel.
- Returns:
A tuple containing:
A
pandas.DataFrame
containing the data.A list of
Column
objects representing the columns that were loaded. Note that this may be different fromtoload
, ifindexByVisit
was set.
- funpack.importing.core.loadFiles(fileinfo, vartable, columns, nrows=None, subjects=None, subjectExprs=None, exclude=None, trustTypes=False, mergeAxis=None, mergeStrategy=None, indexVisits=False, dropNaRows=False, pool=None, dryrun=False)[source]
Load data from
datafiles
, usingmergeDataFrames()
if multiple files are provided.- Parameters:
fileinfo –
FileInfo
object describing the input file(s).vartable – Variable table
columns – Dict of
{ file : [Column] }
mappings, defining the columns to load, as returned bycolumnsToLoad()
.nrows – Number of rows to read at a time. Defaults to
NUM_ROWS
.subjects – List of subjects to include.
subjectExprs – List of subject inclusion expressions
exclude – List of subjects to exclude
trustTypes – Assume that columns with known data type do not contain any bad/unparseable values.
mergeAxis – Merging axis to use when loading multiple data files - see the
mergeDataFrames()
function. Defaults toMERGE_AXIS
.mergeStrategy – Strategy for merging multiple data files - see the
mergeData()
function. Defaults toMERGE_STRATEGY
.indexVisits – Re-arrange the data so that rows are indexed by subject ID and visit, rather than visits being split into separate columns. Only applied to variables which are labelled with Instancing 2.
dropNaRows – If
True
, rows which do not contain data for any columns are not loaded.pool –
multiprocessing.Pool
object for running tasks in parallel.dryrun – If
True
, the data is not loaded.
- Returns:
A tuple containing:
A
pandas.DataFrame
containing the data, orNone
ifdryrun is True
.A list of
Column
objects representing the columns that were loaded.