funpack.fileinfo

This module contains the FileInfo class, and the sniff() and fileinfo() functions, for getting information about input data files.

class funpack.fileinfo.FileInfo(datafiles, indexes=None, loaders=None, encodings=None, renameDuplicates=False, renameSuffix=None)[source]

Bases: object

The FileInfo class is a container for the information generated by the fileinfo() function for a collection of input datafiles.

property allColumns: Sequence[Column]

Returns a list containing all columns from all data files. The result is just a concatenation of the lists returned by columns() for each data file.

property allVariables: Sequence[int]

Returns a list containing all variable IDs from all data files. Duplicates are removed, and the IDs are sorted.

columns(datafile)[source]

Return a list of Column objects representing each of the columns that are present in the given datafile.

property datafiles

Return a list containing the data files.

dialect(datafile)[source]

Return the CSV dialect type for the given datafile.

encoding(datafile)[source]

Return the encoding for the given datafile, or None if no custom encoding was specified.

header(datafile)[source]

Return True if the given datafile has a header row, False otherwise.

index(datafile)[source]

Return the index column for the given data file.

loader(datafile)[source]

Return the custom loader for the given datafile, or None. if there is no custom loader.

sniffer(datafile)[source]

Return the custom sniffer for the given datafile, or None if there is no custom sniffer. This is equivalent to loader(), as sniffer/loader functions are always paired.

funpack.fileinfo.fileinfo(datafiles, indexes=None, sniffers=None, encodings=None, renameDuplicates=False, renameSuffix=None)[source]

Identifies the format of each input data file, and extracts/generates column names and variable IDs for every column.

Parameters:
  • datafiles – Sequence of data files to be loaded.

  • indexes – Dict containing {filename : [index]} mappings, specifying which column(s) to use as the index. Defaults to 0 (the first column).

  • sniffers – Dict containing {file : snifferName} mappings, specifying custom sniffers to be used for specific files. See the custom module.

  • encodings – Dict of {datafile : encoding} mappings, specifying non-standard file encodings. If not specified, latin1 is assumed.

  • renameDuplicates – Defaults to False. If True, columns which have the same name are renamed - see renameDuplicateColumns().

  • renameSuffix – Passed as suffix to renameDuplicateColumns(), if renameDuplicates is True.

Returns:

A tuple containing:

  • List of csv dialect types

  • List of booleans, indicating whether or not each file has a header row.

  • List of lists, Column objects representing the columns in each file.

funpack.fileinfo.has_header(sample, dialect, candidateTypes=None, missingValues=None)[source]

Used in place of the csv.Sniffer.has_header method.

The Sniffer.has_header method can fail in some circumstances, e.g.:

  • for files which only contain a single column.

  • for files which contain lots of missing values.

This function works in essentially the same manner as the csv.Sniffer.has_header function, but handles the above situations.

Parameters:
  • sample – Text sample.

  • dialect – CSV dialect as returned by the csv.Sniffer.sniff method, or a string describing the dialect (e.g. 'whitespace').

  • candidateTypes – Sequence of types to check. Defaults to [float].

  • missingValues – Sequence of missing values to ignore. Defaults to ['', 'na', 'n/a', 'nan']. The missing value test is case insensitive.

Returns:

True if the sample looks like it contains a header, False otherwise.

funpack.fileinfo.renameDuplicateColumns(cols, suffix=None)[source]

Identifies any columns which have the same name, and re-names the subsequent ones. If N columns have the same name X, they are renamed X, X.1<suffix>, X.2<suffix>, ..., X.<N-1><suffix>.

The name attribute of each Column object is modified in-place.

Parameters:
  • cols – Sequence of Column objects.

  • suffix – String to append to the name of all renamed columns. Defaults to an empty string.

funpack.fileinfo.sniff(datafile, encoding=None)[source]

Identifies the format of the given input data file.

Parameters:
  • datafile – Input data file

  • encoding – File encoding (default: 'latin1')

Returns:

A tuple containing:

  • A csv dialect type

  • List of Column objects. The name attributes will be None if the file does not have a header row. The variable, visit, and instance attributes will be None if the file does not have UKB-style column names.