funpack.importing.filter

This module contains functions used by the core.importData() function to identify which columns should be imported, and to filter rows from a data frame after it has been loaded.

funpack.importing.filter.REMOVE_DUPLICATE_COLUMN_IDENTIFIER = '.REMOVE_DUPLICATE'

Identifier which is appended to the names of duplicate columns that are to be removed. Use of this identifier is not hard-coded anywhere - this module is just a convenient location for its definition. See the funpack.main.doImport() function.

funpack.importing.filter.addAuxillaryVariables(fileinfo: FileInfo, proctable: DataFrame, variables: Sequence[int] | None = None, exclude: Sequence[int] | None = None) Tuple[Sequence[int] | None, Sequence[int] | None][source]

Checks that auxillary variables referred to by processing rules are to be loaded.

Parameters:
  • fileinfoFileInfo object describing the input file(s).

  • proctable – Processing table

  • variables – Variables to load, as returnened by restrictVariables()

  • exclude – Variables to exclude, as returnened by restrictVariables()

Returns:

A tuple containing:

  • a sequence of variables to load, or None if all variables should be loaded.

  • a sequence of variables to exclude, or None if no variables should be excluded.

funpack.importing.filter.columnsToLoad(fileinfo, vartable, variables, exclude=None, colnames=None, excludeColnames=None)[source]

Determines which columns should be loaded from datafiles.

Peeks at the first line of the data file (assumed to contain column names), then uses the variable table to determine which of them should be loaded.

Parameters:
  • fileinfoFileInfo object describing the input file(s).

  • vartable – Variable table

  • variables – List of variables to load.

  • exclude – List of variables to exclude.

  • colnames – List of column names/glob-style wildcard patterns, specifying columns to load.

  • excludeColnames – List of column name suffixes specifying columns to exclude. This overrides colnames.

Returns:

A tuple containing:

  • A dict of { file : [Column] } mappings, the Column objects to load from each input file. The columns (including the index column) are ordered as they appear in the file.

  • A list containing the Column objects to ignore.

funpack.importing.filter.evaluateSubjectExpression(data, expr, cols)[source]

Evaluates the given variable expression for each row in the data.

Parameters:
  • data – A pandas.DataFrame instance.

  • expr – String containing a variable expression

  • cols – Dict of {vid : [Column]} mappings

Returns:

A boolean numpy array containing the result of evaluating the expression at each row, or None indicating that the expression was not evaluated (and every row passed).

funpack.importing.filter.evaluateSubjectExpressions(data, allcols, subjectExprs)[source]

Remove subjects (rows) from the data according to subjectExprs.

Parameters:
  • data – A pandas.DataFrame instance.

  • allcols – List of Column objects describing every column in the data set.

  • subjectExprs – List of strings containing expressions which identify subjects to be included. Subjects for which any expression evaluates to True will be included.

Returns:

1D boolean numpy array containing True for subjects to be included and False for subjects to be excluded. Or None, indicating that the expressions were not evaluated (and all rows passed).

funpack.importing.filter.filterSubjects(data, cols, subjects=None, subjectExprs=None, exclude=None)[source]

Removes rows (subjects) from the data based on subjects to include, conditional subjectExprs, and subjects to exclude.

Parameters:
  • data – A pandas.DataFrame instance.

  • allcols – List of Column objects describing every column in the data set.

  • subjects – List of subjects to include.

  • subjectExprs – List of subject inclusion expressions

  • exclude – List of subjects to exclude

Returns:

A pandas.DataFrame, potentially with rows removed.

funpack.importing.filter.restrictVariables(cattable: DataFrame, variables: Sequence[int] | None = None, categories: Sequence[int | str] | None = None, excludeVariables: Sequence[int] | None = None, excludeCategories: Sequence[int | str] | None = None) Tuple[Sequence[int] | None, Sequence[int] | None][source]

Determines which variables should be loaded (and the order they should appear in the output), and which variables should be excluded, from the given sequences of variables, categories, and excludeVariables and excludeCategories.

Parameters:
  • cattable – The category table

  • variables – List of variable IDs to import.

  • categories – List of category names or IDs to import.

  • excludeVariables – List of variable IDs to exclude.

  • excludeCategories – List of category names or IDs to exclude.

Returns:

A tuple containing:

  • a sequence of variables to load, or None if all variables should be loaded.

  • a sequence of variables to exclude, or None if no variables should be excluded.