funpack.importing.core

This module provides the importData() function, which implements the data importing stage of the funpack sequence

funpack.importing.core.MERGE_AXIS = 'variables'

Default merge axis when loading multiple data files - see mergeData().

funpack.importing.core.MERGE_AXIS_OPTIONS = ['0', 'rows', 'subjects', '1', 'cols', 'columns', 'variables']

Values accepted for the axis option to the mergeData() function.

funpack.importing.core.MERGE_STRATEGY = 'intersection'

Default merge strategy when loading multiple data files - see mergeData().

funpack.importing.core.MERGE_STRATEGY_OPTIONS = ['naive', 'union', 'intersection', 'inner', 'outer']

Values accepted for the strategy option to the mergeData() function.

funpack.importing.core.NUM_ROWS = 10000

Default number of rows read at a time by loadData() - it reads the data in chunks.

exception funpack.importing.core.VariableMissing[source]

Bases: ValueError

Exception raised by importData() when --fail_if_missing / failIfMissing is active, and a requested variable is not present in the input files.

funpack.importing.core.coerceToNumeric(vartable, series, column)[source]

Coerces the given column to numeric, if necessary.

Parameters:
  • vartable – The variable table

  • seriespandas.Series containing the data to be coerced.

  • columnColumn object representing the column to coerce.

Returns:

Coerced pandas.Series

funpack.importing.core.importData(fileinfo, vartable, proctable, cattable, variables=None, colnames=None, excludeColnames=None, categories=None, subjects=None, subjectExprs=None, exclude=None, excludeVariables=None, excludeCategories=None, trustTypes=False, mergeAxis=None, mergeStrategy=None, indexVisits=False, dropNaRows=False, addAuxVars=False, njobs=1, mgr=None, dryrun=False, failIfMissing=False)[source]

The data import stage.

This function does the following:

  1. Figures out which columns to load (using the columnsToLoad() function).

  2. Loads the data (using loadFiles()),

  3. Creates and returns a DataTable.

Parameters:
  • fileinfoFileInfo object describing the input file(s).

  • vartable – The data coding table

  • proctable – The processing table

  • cattable – The category table

  • variables – List of variable IDs to import

  • colnames – List of names/glob-style wildcard patterns specifying columns to import.

  • excludeColnames – List of column name suffixes specifying columns to exclude.

  • categories – List of category names/IDs to import

  • subjects – List of subjects to include

  • subjectExprs – List of subject inclusion expressions

  • exclude – List of subjects to exclude

  • excludeVariables – List of variables to exclude

  • excludeCategories – List of category names/IDs to exclude

  • trustTypes – If True, it is assumed that columns with a known data type do not contain any bad/unparseable values. This improves performance, but will cause an error if the assumption does not hold.

  • mergeAxis – Merging axis to use when loading multiple data files - see the mergeData() function.

  • mergeStrategy – Merging strategy to use when loading multiple data files - see the mergeData() function.

  • indexVisits – Re-arrange the data so that rows are indexed by subject ID and visit, rather than visits being split into separate columns. Only applied to variables which are labelled with Instancing 2.

  • dropNaRows – If True, rows which do not contain data for any columns are not loaded.

  • addAuxVars – If True, data fields which are referred to in the processing rules are selected for import if present in the input file(s) and not already selected, and . See the take argument to processing_functions.binariseCategorical() for an example.

  • njobs – Number of processes to use for parallelising tasks.

  • mgrmultiprocessing.Manager object for parallelisation

  • dryrun – If True the data is not loaded.

  • failIfMissing – Defaults to False. If True, and a requested variable is not present in the input files, a VariableMissing error is raised.

Returns:

A tuple containing:

  • A DataTable, which contains references to the data, and the variable and procesing tables.

  • A list of Column objects that were not loaded from each input file.

funpack.importing.core.loadChunk(i, offset, fname, vartable, header, allcols, toload, index, nrows, subjects, subjectExprs, exclude, indexByVisit, dropNaRows, encoding, trustTypes, dtypes, datecols, dlargs)[source]

Loads a chunk of nrows from fname, starting at offset.

Parameters:
  • i – Chunk number, just used for logging.

  • offset – Row number to start reading from.

  • fname – Path to the data file

  • vartable – Variable table

  • headerTrue if the file has a header row, False otherwise.

  • allcols – Sequence of Column objects describing all columns in the file.

  • toload – Sequence of Column objects describing the columns that should be loaded, as generated by columnsToLoad().

  • index – List containing position(s) of index column(s) (starting from 0).

  • nrows – Number of rows to read at a time. Defaults to attr:NUM_ROWS.

  • subjects – List of subjects to include.

  • subjectExprs – List of subject inclusion expressions

  • exclude – List of subjects to exclude

  • indexByVisitNone, or the return value of generateReindexedColumns(), which will be used to re-arrange the loaded data so it is indexed by visit number, in addition to row ID.

  • dropNaRows – If True, rows which do not contain data for any columns are not loaded.

  • encoding – Character encoding (or sequence of encodings, one for each data file). Defaults to latin1.

  • trustTypes – Assume that columns with known data type do not contain any bad/unparseable values.

  • dtypes – A dict of { column_name : dtype } mappings containing a suitable internal data type to use for some columns.

  • datecols – List of column names denoting columns which should be interpreted as dates/times.

  • dlargs – Dict of arguments to pass through to pandas.read_csv.

Returns:

pandas.DataFrame

funpack.importing.core.loadFile(fname, fileinfo, vartable, toload, nrows=None, subjects=None, subjectExprs=None, exclude=None, indexVisits=False, dropNaRows=False, trustTypes=False, pool=None)[source]

Loads data from the specified file. The file is loaded in chunks of nrows rows, using the loadChunk() file.

Parameters:
  • fname – Path to the data file

  • fileinfoFileInfo object describing the input file.

  • vartable – Variable table

  • toload – Sequence of Column objects describing the columns that should be loaded, as generated by columnsToLoad().

  • nrows – Number of rows to read at a time. Defaults to attr:NUM_ROWS.

  • subjects – List of subjects to include.

  • subjectExprs – List of subject inclusion expressions

  • exclude – List of subjects to exclude

  • indexVisits – Re-arrange the data so that rows are indexed by subject ID and visit, rather than visits being split into separate columns. Only applied to variables which are labelled with Instancing 2.

  • dropNaRows – If True, rows which do not contain data for any columns are not loaded.

  • trustTypes – Assume that columns with known data type do not contain any bad/unparseable values.

  • poolmultiprocessing.Pool object for running tasks in parallel.

Returns:

A tuple containing:

  • A pandas.DataFrame containing the data.

  • A list of Column objects representing the columns that were loaded. Note that this may be different from toload, if indexByVisit was set.

funpack.importing.core.loadFiles(fileinfo, vartable, columns, nrows=None, subjects=None, subjectExprs=None, exclude=None, trustTypes=False, mergeAxis=None, mergeStrategy=None, indexVisits=False, dropNaRows=False, pool=None, dryrun=False)[source]

Load data from datafiles, using mergeDataFrames() if multiple files are provided.

Parameters:
  • fileinfoFileInfo object describing the input file(s).

  • vartable – Variable table

  • columns – Dict of { file : [Column] } mappings, defining the columns to load, as returned by columnsToLoad().

  • nrows – Number of rows to read at a time. Defaults to NUM_ROWS.

  • subjects – List of subjects to include.

  • subjectExprs – List of subject inclusion expressions

  • exclude – List of subjects to exclude

  • trustTypes – Assume that columns with known data type do not contain any bad/unparseable values.

  • mergeAxis – Merging axis to use when loading multiple data files - see the mergeDataFrames() function. Defaults to MERGE_AXIS.

  • mergeStrategy – Strategy for merging multiple data files - see the mergeData() function. Defaults to MERGE_STRATEGY.

  • indexVisits – Re-arrange the data so that rows are indexed by subject ID and visit, rather than visits being split into separate columns. Only applied to variables which are labelled with Instancing 2.

  • dropNaRows – If True, rows which do not contain data for any columns are not loaded.

  • poolmultiprocessing.Pool object for running tasks in parallel.

  • dryrun – If True, the data is not loaded.

Returns:

A tuple containing:

  • A pandas.DataFrame containing the data, or None if dryrun is True.

  • A list of Column objects representing the columns that were loaded.