funpack.fileinfo
This module contains the FileInfo
class, and the sniff()
and
fileinfo()
functions, for getting information about input data files.
- class funpack.fileinfo.FileInfo(datafiles, indexes=None, loaders=None, encodings=None, renameDuplicates=False, renameSuffix=None)[source]
Bases:
object
The
FileInfo
class is a container for the information generated by thefileinfo()
function for a collection of input datafiles.- property allColumns: Sequence[Column]
Returns a list containing all columns from all data files. The result is just a concatenation of the lists returned by
columns()
for each data file.
- property allVariables: Sequence[int]
Returns a list containing all variable IDs from all data files. Duplicates are removed, and the IDs are sorted.
- columns(datafile)[source]
Return a list of
Column
objects representing each of the columns that are present in the givendatafile
.
- property datafiles
Return a list containing the data files.
- encoding(datafile)[source]
Return the encoding for the given
datafile
, orNone
if no custom encoding was specified.
- funpack.fileinfo.fileinfo(datafiles, indexes=None, sniffers=None, encodings=None, renameDuplicates=False, renameSuffix=None)[source]
Identifies the format of each input data file, and extracts/generates column names and variable IDs for every column.
- Parameters:
datafiles – Sequence of data files to be loaded.
indexes – Dict containing
{filename : [index]}
mappings, specifying which column(s) to use as the index. Defaults to 0 (the first column).sniffers – Dict containing
{file : snifferName}
mappings, specifying custom sniffers to be used for specific files. See thecustom
module.encodings – Dict of
{datafile : encoding}
mappings, specifying non-standard file encodings. If not specified,latin1
is assumed.renameDuplicates – Defaults to
False
. IfTrue
, columns which have the same name are renamed - seerenameDuplicateColumns()
.renameSuffix – Passed as
suffix
torenameDuplicateColumns()
, ifrenameDuplicates is True
.
- Returns:
A tuple containing:
List of
csv
dialect typesList of booleans, indicating whether or not each file has a header row.
List of lists,
Column
objects representing the columns in each file.
- funpack.fileinfo.has_header(sample, dialect, candidateTypes=None, missingValues=None)[source]
Used in place of the
csv.Sniffer.has_header
method.The
Sniffer.has_header
method can fail in some circumstances, e.g.:for files which only contain a single column.
for files which contain lots of missing values.
This function works in essentially the same manner as the
csv.Sniffer.has_header
function, but handles the above situations.- Parameters:
sample – Text sample.
dialect – CSV dialect as returned by the
csv.Sniffer.sniff
method, or a string describing the dialect (e.g.'whitespace'
).candidateTypes – Sequence of types to check. Defaults to
[float]
.missingValues – Sequence of missing values to ignore. Defaults to
['', 'na', 'n/a', 'nan']
. The missing value test is case insensitive.
- Returns:
True
if the sample looks like it contains a header,False
otherwise.
- funpack.fileinfo.renameDuplicateColumns(cols, suffix=None)[source]
Identifies any columns which have the same name, and re-names the subsequent ones. If
N
columns have the same nameX
, they are renamedX
,X.1<suffix>
,X.2<suffix>
,...
,X.<N-1><suffix>
.The
name
attribute of eachColumn
object is modified in-place.- Parameters:
cols – Sequence of
Column
objects.suffix – String to append to the name of all renamed columns. Defaults to an empty string.
- funpack.fileinfo.sniff(datafile, encoding=None)[source]
Identifies the format of the given input data file.
- Parameters:
datafile – Input data file
encoding – File encoding (default:
'latin1'
)
- Returns:
A tuple containing:
A
csv
dialect typeList of
Column
objects. Thename
attributes will beNone
if the file does not have a header row. Thevariable
,visit
, andinstance
attributes will beNone
if the file does not have UKB-style column names.