FUNPACK release history
3.8.1 (Wednesday 15th May 2024)
Fixed
Addressed deprecation warnings arising from changes to
pandas.
3.8.0 (Thursday 14th December 2023)
Added
New
--fail_if_missingoption, which causes FUNPACK to abort if a requested data-field is not present in the input file(s).
3.7.1 (Friday 22nd September 2023)
Changed
Replaced
setup.pywithpyproject.tomlfor storing metadata and building wheels.
Fixed
The
fillVisits(method='mode')cleaning function will no longer crash when applied to a column with all values equal toNaN.Fixed an inverted logic bug when parsing the
--no_downloadoption.
3.7.0 (Monday 17th April 2023)
Added
Replacement values used with the
--recodingand--child_valuesoptions can now be specified as simple value expressions which may contain simple arithmetic statements, and calls tomin()andmax()functions. For example, using--recoding 1 "1,2" "100,max()+1"will cause value1to be replaced with100, and value2to be replaced one plus the data maximum, for data field1.New
--no_downloadcommand-line option, which forces FUNPACK to use local copies of UK Biobank schema metadata files instead of downloading them.
Fixed
Miscellaneous changes to allow FUNPACK to be used with Pandas 2.x.
3.6.0 (Thursday 2nd February 2023)
Changed
The UK Biobank metadata files containing data field properties, encoding dictionaries, and categorical coding mappings are now downloaded automatically each time FUNPACK is executed, instead of being loaded from pre-downloaded copies in
funpack/schema/(which are only updated on each new FUNPACK release). The pre-downloaded files will still be used as back-ups in case of download failures.Internal reorganisation associated with the above change - the
codingandhierarchymodules have been moved into a newfunpack.schemasub-package, which also contains the pre-downloaded UK Biobank metadata files previously stored infunpack/data/.Variable categories can now be listed in processing rules, e.g.
"all,1,2,cat5 someproc()"will applysomeproc()to data fields 1, 2, and all data fields from category 5.
3.5.2 (Tuesday 16th August 2022)
Fixed
Fixed an issue with calculating the order in which child value replacement expressions should be evaluated.
3.5.1 (Monday 8th August 2022)
Fixed
Fixed an issue with retrieving the FUNPACK version string in a development installation.
3.5.0 (Friday 5th August 2022)
Added
New
binarisedoption to thecreateDiagnosisColumns()processing function, allowing it to be applied to data fields that have been passed through ~:func:.processing_functions.binariseCategorical.
3.4.1 (Tuesday 2nd August 2022)
Fixed
Fixed an issue in the
binariseCategorical()processing function which was causing leading0characters to be dropped in generated column names which, in turn, could result in duplicate column names and downstream errors.
3.4.0 (Thursday 28th July 2022)
Added
New
createDiagnosisColumns()processing function, which can be used to add binary indicator columns for ICD9, ICD10, OPSC3, and OPSC4 diagnosis/operative procedure codes, marking them as either primary or secondary diagnoses.New
removeField()processing function, which can be used to simply remove all columns associated with a variabl/data-field, at any point in the processing stage.New
schemeoption to thecodeToNumeric()cleaning function, allowing a choice of two different conversion schemes.
Changed
Changed the default naming convention for columns generated by
binariseCategorical()when called withacrossVisits=TrueandacrossInstances=True- generated columns are now named<vid>-<value>rather than<vid>-0.<value>.
3.3.1 (Tuesday 28th June 2022)
Fixed
Fixed an issue with the
findConfigDir()function, meaning that plugin files were not being located in some circumstances.
3.3.0 (Monday 27th June 2022)
Added
New
--add_aux_varsoption, which causes all auxillary data-fields specified in the processing rules to be selected for import, if present in the data, and not already selected. This option is enabled in thefmribconfiguration profile.The
$FUNPACK_CONFIG_DIRenvironment variable can be used to direct FUNPACK to a directory containing configuration, table, and plugin files.
Changed
The
fmribconfiguration profile has been removed from FUNPACK, and is now being released independently as a dependency of FUNPACK (see https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config). However, it will be installed automatically alongside FUNPACK, so from an end-user’s perspective, this change will have no effect on usage.
Deprecated
The use of
broadcast_arguments with processing functions have been deprecated, and will be removed in FUNPACK 4.0.0. The alternative to broadcast arguments is for processing functions to perform their own parallelisation internally.
3.2.3 (Wednesday 1st June 2022)
Changed
New additions to
fmribcategories.
3.2.2 (Tuesday 31st May 2022)
Changed
Updated internal datafield/encoding schemas to the latest available on the UK BioBank showcase.
3.2.1 (Tuesday 31st May 2022)
Changed
Adjustment to the categories that are loaded in the
fmribprofile.
3.2.0 (Friday 13th May 2022)
Added
New
--exclude_variableand--exclude_categoryoptions, allowing variables / categories to be explicitly excluded from being imported.
Changed
The
removeIfRedundant()processing function now usesdoubleprecision by default.
3.1.1 (Tuesday 10th May 2022)
Changed
Updated categorisations in the
fmribconfiguration profile.
3.1.0 (Thursday 5th May 2022)
Added
New
numericandconvertNumericoptions to theflattenHierarchical()cleaning function, which allow it to be used with unique numeric labels (node IDs) for hierarchical data encodings, rather than coding labels (which are not necessarily numeric, or unique) (!88).
Changed
Updated internal data field and data coding to the latest from the UK BioBank showcase.
Fixed
Fixed an issue with the interpretation of hierarchical data codings, which was causing the
flattenHierarchicalcleaning function to crash (!88).
3.0.0 (Wednesday 5th January 2022)
Changed
The
funpackcommand has been renamed tofmrib_unpack, to avoid naming conflicts with other packages.
2.9.1 (Wednesday 29th December 2021)
Fixed
Fixed some compatibility issues with Python 3.10.
2.9.0 (Tuesady 28th December 2021)
Added
New
--remove_duplicatesoption, which causes columns with duplicate names to be removed, with only the first retained.
Changed
Explicitly use the
forkmethod for themultiprocessingmodule.
2.8.0 (Thursday 19th August 2021)
Added
New
skipUnknownsoption to theremoveIfRedundantprocessing function, whereby if a column is found to be redundant with respect to a column of an unkknown or uncategorised data field, it is not dropped. This option is used in thefmribconfiguration.
Changed
Added a recoding rule for data coding 100348 to the
fmribconfiguration.
2.7.1 (Tuesday 22nd June 2021)
Fixed
Fixed some issues with the
binariseCategoricalandremoveIfSparseprocessing functions, which could cause them to cause a crash when processing empty columns.
2.7.0 (Friday 14th May 2021)
Changed
The
--na_values,--child_valuesand--recodingoptions can now be applied to non-numeric data fields.Updated internal data field and data coding to the latest from the UK BioBank showcase.
Updates to FMRIB categories and processing rules.
2.6.0 (Monday 29th March 2021)
Changed
Documentation for FUNPACK can now be found online at https://open.win.ox.ac.uk/pages/fsl/funpack/.
The way that “uncategorised” variables are identified has been changed. Previously, a variable which was not in a category and which had no cleaning/processing rules specified was classified as uncategorised. This has been simplified so that a variable which not in a category is classified as uncategorised. This has been done to avoid the possibility of missing newly added variables which use an existing data coding, and therefore may implicitly have cleaning rules specified for them.
2.5.2 (Monday 15th March 2021)
Changed
A warning message is now emitted if a processing step is requested for a variable that is not present in the input data.
Fixed
Fixed an issue with the
binariseCatgegoricalprocessing function where, when it was applied to multiple variables, each with a separatetakevariable (as is the case in thefmribconfiguration), would cause an error if any of thetakevariables were not present.
2.5.1 (Wednesday 3rd March 2021)
Added
New
--escape_newlinesoption, which causes non-numeric values containing escape characters (e.g.\n,\t, etc) to be output literally.
Fixed
Fixed a bug where subject inclusion expressions were causing zero rows to be imported if the variable was not present in the input file.
Fixed some compatibility issues with Pandas 1.2.
2.5.0 (Wednesday 9th December 2020)
Added
New
--rename_duplicatesoption, which can be used to give duplicate columns unique names.
Changed
Conditional expressions used with the
--subjectoption can now be used on date/time variables.Internal re-organisation of the
funpack.fileinfomodule, and minor changes to function interfaces in thefunpack.importingandfunpack.loadtablesmodules.
2.4.0 (Thursday 26th November 2020)
Added
New
--ids_onlyoption which causes only the subject IDs to be saved, suitable for use with the--subjectoption in subsequent calls tofunpack.
Changed
Changed how binary subject selection expressions are evaluated when the different variables have different numbers of columns.
All internal UKB showcase data tables have been updated to their latest versions.
The node name for Chapter V - Supplementary classification … in the ICD9 data table has been changed from Chapter V to Chapter V sup to avoid collisions with the Chapter V node.
2.3.3 (Wednesday 5th October 2020)
Changed
Improved performance when appying the
--columncommand-line argument.
2.3.2 (Wednesday 10th June 2020)
Fixed
Fixed an issue which was preventing FUNPACK from being used on Windows platforms (!65).
2.3.1 (Sunday 17th May 2020)
Changed
The
removeIfSparse()processing function can now parallelise the check across columns, rather than relying on theprocessingmodule to parallelise calls across variables (!64).
Fixed
The improved
fmribdate/time normalisation routines were not convertingNaTs(Not-a-Time) correctly (!62).Fixed a problem in the FMRIB configuration - diagnosis timestamps were not being injected into binarised ICD variables (!63).
2.3.0 (Tuesday 12th May 2020)
Changed
Modified the
processing_functions.binariseCategorical()function so that it parallelises tasks internally, instead of being called in parallel for different variables. This should give superior performance (!60).Revisited the
DataTable.merge()to optimise performance in all scenarios (!60).Improved performance of the
fmribdate/time normalisation routines, and changed their usage so they are now applied as “cleaning” functions after data import, rather than just before export. This means that date/ time columns can be subjected to the redundancy check (as they will have a numeric type), and will improve data export performance (!60, !61).
2.2.1 (Monday 4th May 2020)
Fixed
Reverted some changes to
DataTable.merge()which caused performance degradations.
2.2.0 (Friday 1st May 2020)
Changed
Substantial performance improvements to the
cleaning_functions.codeToNumeric()cleaning function, and toprocessing_functions.removeIfRedundant(),processing_functions.binariseCategorical(), and other processing functions.The default implementation of
processing_functions.removeIfRedundant()now uses matrix algebra rather thsn pairwise comparisons. This requires more memory, but is much faster.Added threadpoolctl as a dependency, for setting the number of threads to use when parallelising
numpyoperations.
Fixed
The
removeIfRedundant()processing function was not testing columns with no missing values when a NA correlation threshold was being used.removeIfRedundant()was also potentially producing inconsistent results for columns with no present values, or with a constant value.
2.1.0 (Tuesday 21st April 2020)
Added
New
--drop_na_rowsoption, which tellsfunpackto drop rows which do not contain a value for any column.
Changed
Internal changes to improve performance.
2.0.0 (Tuesday 7th April 2020)
Changed
The
fmribandfmrib_logsconfiguration profiles no longer define the variables/categories to be loaded - by default all variables in the input file will be loaded and processed.The
--non_numeric_fileoption has been replaced with--suppress_non_numerics(which tells FUNPACK to only save numeric columns to the main output file), and the--write_non_numericsand--non_numerics_fileoptions, which tell FUNPACK to save non-numeric columns to an auxillary output file.The
--tsv_var_formatoption has been renamed to--var_format, and is applied to all export formats.The default output file format is now inferred from the output file suffix - one of
tsv,csv, orh5.The format of the
--unknown_vars_filehas changed - theprocessedcolumn has been removed (as with the removal of the--import_alloption, it is now equivalent to theexportedcolumn), and uncategorised columns now have aclassofuncategorisedinstead ofunprocessed.
Removed
Removed several obscure, redundant, or deprecated options, including
--import_all,--remove_unknown,--pass_through,--output_id_column,--column_pattern,--column_name,--low_memory, and--work_dir.Removed the unused
funpack.storagemodule.Removed the unused
DataTable.order()method.
1.9.1 (Sunday 29th March 2020)
Changed
Updates to FMRIB categories.
1.9.0 (Friday 28th February 2020)
Added
New
--write_log,--write_unknown_vars,--write_icd10_map,--write_description, and--write_summaryoptions, which will save the respective auxillary output file using a default naming convention which is based on the name of the main output file. Exact names can still be specified via the--log_file,--unknown_vars_file,--icd10_map_file,--description_file, and--summary_fileoptions.
Changed
Refactored the
fmribconfiguration profile.fmribnow just applies cleaning/processing rules.fmrib_logsappliesfmrib, and also specifies logging/auxillary output files.
Removed
Removed the built-in
ukbconfiguration.
Deprecated
The
--pass_throughoption is deprecated - the same behaviour can be achieved by running FUNPACK without specifying any cleaning or processing steps.
1.8.2 (Monday 24th February 2020)
Changed
The
--config_fileoption can be used more than once, and can also be used from within a configuration file (i.e. one configuration file may “include” another).Changed the way that the
processing_functions.removeIfRedundant()process splits up the data set for parallel processing.
1.8.1 (Wednesday 19th February 2020)
Added
New
navaloption to theprocessing_functions.removeIfSparse()processing function.
Changed
Changes to the
fmribconfiguration, to correctly apply sparsity check to variables 41202, 41203, 41270 and 41271.
1.8.0 (Tuesday 18th February 2020)
Added
New
takeoption to theprocessing_functions.binariseCategorical()processing function, which allows the generated columns to contain values from another column, instead of containing binary labels.New
fillvaloption to theprocessing_functions.binariseCategorical()processing function, which can be used in conjunction withtake, to specify the fill value for missing rows.Argument broadcasting for processing functions - when a process is applied independently to more than one variable, the input arguments to the process may need to be different for each variable. This can be accomplished by using a _broadcast_ argument - simply prefix the argument name with
'broadcast_', and then specify a list containing the argument.Processing functions can now be passed lists of values.
Changed
Changes to the
fmribconfiguration - variables 41202, 41203, 41270, and 41271 are binarised, and the binarised values replaced with diagnosis dates from the corresponding date variables.The processing function interface has been changed - processing functions which return metadata for newly added columns must now return a sequence of dicts containing arguments to the
Columnconstructor, which can include metadata.
Fixed
Fixed a bug whereby only the first two conditions were being parsed in an expression comprising multiple identical chained boolean operations (e.g. v10 == 1 || v20 == 2 || v30 == 3).
1.7.1 (Thursday 30th January 2020)
Added
New built-in
ukbconfiguration, which applies NA insertion, categorical recoding, and child value replacement rules from thefmribconfiguration.
Fixed
Fixed a bug which arose from combining the
--import_alland--columnoptions.
1.7.0 (Friday 24th January 2020)
Added
New
--index_visitsoption, which re-arranges variables with separate columns per visit into single columns indexed by both subject ID and visit.
Changed
The
--indexoption now supports specification of multiple index columns for each input file.The
fileinfo.has_header()function has been modified to be more lenient.The
importingmodule has been internally refactored to improve code cleanliness.Various minor internal API changes.
The
removeIfRedundant()processsing function will now drop columns which are redundant with respect to other columns which have already been dropped.Update to the FMRIB configuration (handling variable 6150).
The
'codingdesc'metaproc function takes into account possible categorical recodings when retrieving the description for a particular value.
Fixed
The
removeIfRedundant()processsing function was unnecessarily evaluating column pairs more than once, when run in parallel.
1.6.0 (Wednesday 11th December 2019)
Added
Non-numeric variables can now be used in conditional expressions, e.g.
'v41202 == "A009"'. Within such expressions, the value must be contained within single or double quotes.New
containsoperator, for use within conditional expressions to test presence of sub-strings.
Changed
Parallelisation is now disabled by default, and must be explicitly enabled via the
--num_jobsoption. This is done in thefmribconfiguration.Subject inclusion expressions are now evaluated during, rather than after, data import. They are now therefore performed in parallel on different chunks of the input file(s) (when parallelisaton is enabled).
1.5.0 (Monday 9th December 2019)
Added
New
util.wc()function to count the rows (lines) of a file; this is simply a wrapper around the UNIXwctool.New
util.cat()function to concatenate multiple files together; this is simply a wrapper around the UNIXcattool.New
util.inMainProcess()function so a process can determine whether it is the main process or a worker process.New
DataTable.subtable()andDataTable.merge()methods, to aid in passing data to/from worker processes.Processing functions can now be specified to run independently on a subset of variables by using
'independent'in the variable list.New
anyandalloperations which can be used in conditional statements to control how the conditional results are combined across multiple columns for one variable. These can be used with the--subjectoption.
Changed
FUNPACK will now parallelise tasks by default; previously it would only parallelise tasks if
--low_memorymode were selected.The data import stage is parallelised by using multiple processes to read different chunks of the input file(s), and then concatenating the resulting
pandas.DataFrameobjects afterwards.Cleaning functions are executed on each variable in parallel.
Each processing step is executed in parallel where possible (e.g.
independentprocesses), but processing steps are still executed sequentially. New columns created by processing functions are saved to disk, and re-loaded by the main process, rather than being passed back to the main process via inter-process communication.The
removeIfRedundantprocess now compares pairs of columns in parallel.The data export stage is parallelised by writing chunks of rows to different files, and then concatenating them into a single output file afterwards.
The
--variable,--subjectand--excludeoptions now accept comma-separated mixtures of IDs and MATLAB-style ranges.Updates to FMRIB categories.
Updates to FMRIB processing rules, to take advantage of parallelism.
The ,:mod:icd10 module must now be initialised via the
icd10.initialise()function, when it is to be used in a multiprocessing context. This is not necessary whenfunpackis configured to not parallelise tasks (e.g. with--num_jobs 1).
Deprecated
The
--low_memoryand--work_diroptions have been deprecated, and no longer have any effect. Thestoragemodule is no longer used, but is still present for possible future usage.
1.4.5 (Thursday 5th December 2019)
Changed
The
funpack_demonotebook is now executed from a temporary directory, so it does not require write-permissions to the FUNPACK installation directory.
Fixed
Fixed a bug where non-numeric variables (e.g. 41271) were being interpreted by
pandasas being numeric.
1.4.4 (Friday 15th November 2019)
Changed
Updates to the FMRIB categories and configuration.
1.4.3 (Monday 11th November 2019)
Changed
Updated internal variable and data coding tables to the latest available from the UK Biobank showcase.
Increased the file sample size used by
fileinfo.sniff().
1.4.2 (Tuesday 6th August 2019)
Changed
Minor changes to the FMRIB configuration.
1.4.1 (Monday 8th July 2019)
Added
New
--trust_typescommand-line flag which tells FUNPACK to assume that the data in known-to-be-numeric columns is parseable (i.e. that there are no bad/unparseable values). This option improves import performance, but at the cost of causing FUNPACK to crash if the assumption is not true.
1.4.0 (Sunday 7th July 2019
Added
Added a new
InternalTypecolumn to the variable table, which can be used to specify the type to use internally for a given variable (e.g.float64). This is so that the default type offloat32can be overridden for specific variables for which this is problematic, such as variable 20003. This column is initially populated fromfunpack/data/type.txt.New
funpack.codingmodule, for retrieving descriptive information about data codings. The information is stored in thefunpack/data/coding/directory. Hierarchical data codings are still accessed via thehierarchymodule.New
hierarchicalDescriptionFromCode(),hierarchicalDescriptionFromNumeric(), andcodingDescriptionFromValue()metaprocessing functions.
Changed
The hierarchical coding name no longer needs to be specified when using the
cleaning_functions.codeToNumeric()cleaning function - the coding is automatically looked up.Variable 4288 has been moved from
cognitive phenotypestomiscellaneousin the FMRIB categories.Variable 20003 is now binarised in the FMRIB categories.
Changed the meta-processing function signature - these functions are now passed the
DataTableand variable ID, in addition the value.
Fixed
Now using an internal type of
float64for variable 20003, as it potentially has values which cannot be represented infloat32.
Deprecated
Deprecated the xDescriptionFromCode and xDescriptionFromNumeric metaprocessing functions.
1.3.2 (Tuesday 4th June 2019)
Changed
Minor adjustments to the FMRIB categories.
1.3.1 (Thursday 30th May 2019)
Changed
Updates to documentation.
1.3.0 (Wednesday 29th May 2019)
Added
New
cleaning_functions.codeToNumeric()cleaning function, for transforming hierarhical variable codes.New
hierarchy.codeToNumeric()andhierarchy.numericToCode()functions.New meta-process functions for generating descriptions for ICD9, OPCS3 and OPCS4 hierarchical variables.
Variable, data coding, processing, category and type files in the
funpack/configdirectory can be specified on the command line and in configuration files as relative paths, and using a “dot” syntax, e.g.fmrib/categories.tsv, orfmrib.categories.
Changed
Built-in cleaning and processing rules are no longer applied by default - they are now a part of the built-in
fmribconfiguration, and can be applied via-cfg fmrib.Updates to built-in
fmribprocessing.The
flattenHierarchicalprocessing function accepts anameargument, allowing the hierarchical data type name to be specified. If not provided, the type is inferred from the variable ID if possible.
Fixed
Fixed a bug where a processing step attempted to add a new column with the same name as an existing one.
Deprecated
The
convertICD10Codes()cleaning function has been replaced by the newcleaning_functions.codeToNumeric()function, which can be used with any hierarchical variable.The
icd10.codeToNumeric()andicd10.numericToCode()functions have been replaced by thehierarchy.codeToNumeric()andhierarchy.numericToCode()functions.The
loadDefaultTables()function is obsolete and has been deprecated.
1.2.1 (Tuesday 28th May 2019)
Changed
Minor changes to built-in variable categories.
1.2.0 (Saturday 25th May 2019)
Added
New
--summary_fileoption, which exports a summary of the cleaning/processing steps that have been applied to each variable.
Changed
Built-in recoding, NA insertion, and child value replacement rules have been revised and updated.
1.1.4 (Friday 17th May 2019)
Changed
Changed default processing rules so a column with standard deviation less than
1e-6is deemed sparse, and dropped.
1.1.3 (Thursday 16th May 2019)
Changed
The
isSparse()function has been changed so that, when themincatormaxcattests are specified as proportions, they are applied relative to the number of non-missing data points, rather than the total number of data points.
1.1.2 (Thursday 16th May 2019)
Fixed
Fixed a bug in
flattenHierarchical()with respect to handling missing values.
1.1.1 (Wednesday 15th May 2019)
Fixed
Changed the
isSparse()function to avoid issues with non-numaric data.
1.1.0 (Tuesday 14th May 2019)
Changed
The
--visit/-vicommand line option will no longer be applied to variables which do not have an instancing code 2. This is implemented in thekeepVisits()function.The
remove()andkeepVisits()function signatures have changed - they now require the variable table to be passed in as the first argument.
1.0.2 (Tuesday 14th May 2019)
Changed
The
removeIfSparse()processing function accepts anignoreTypeparameter which forces all tests to be run, regardless of the variable type.
Fixed
The
isSparse()function was skipping themincat/maxcattests for non-numeric categorical variables.
1.0.1 (Friday 9th May 2019)
Changed
Python package name changed from
fmrib_unpacktofmrib-unpack.
1.0.0 (Friday 9th May 2019)
Changed
ukbparseis now calledFUNPACK- the FMRIB UKBiobank Normalisation, Processing And Cleaning Kit.
Removed
The
ukbparse_htmlparse,ukbparse_join, andukbparse_compare_tablesscripts have been removed.The
ukbparse.icd10.readICD10CodingFilefunction andukbparse.icd10.ICD10Hierarchyclass have been removed (their functionality was replaced by thehierarchymodule)The
processing_functions.removeIfSparse()andprocessing_functions_core.removeIfSparse()functions no longer accept anabsoluteargument.
0.21.1 (Thursday 8th May 2019)
Changed
Addd categories 1, 2 and 99 to the
fmribconfiguration.
0.21.0 (Thursday 8th May 2019)
Added
Columnobjects now have ametadataattribute which may be used in the column description (if the--description_fileoption is used). Processing functions can set the metadata for newly added columns.New
metaprocplugin type to manipulate column metadata.All processing functions accept a
metaprocargument, allowing ametaprocfunction to be applied to any column metadata that is returned by the processing function..
Changed
The
processing_functions.binariseCategorical()function sets the categorical value as column metadata on the new binarised columns.
0.20.1 (Wednesday 8th May 2019)
Fixed
Fixed some typos in the
READMEfile.
0.20.0 (Tuesday 7th May 2019)
Added
The
isSparse()andremoveIfSparse()functions accept a new option,mincat, which allows a categorical to be deemed sparse if the size of its smallest category is below a given threshold.New
--description_fileoption which, for UK BioBank data, saves the description for each column to a text file.
Changed
The
absoluteparameter to theisSparse()andremoveIfSparse()functions is deprecated. Instead, they now acceptabspresandabscatarguments, allowing the absoluteness/proportionality of theminpresandmincat/maxcatoptions to be specified separately.Changed default processing rules so that ICD10 variables undergo a slightly different sparsity test.
Fixed
Fixed a bug in the categorical recoding rules for Data Coding 100012.
0.19.2 (Friday 26th April 2019)
Changed
Changes to built-in categories and to
fmribconfiguration.
0.19.1 (Thursday 25th April 2019)
Changed
Changed the default processing rules for ICD10 variables 40001, 40002, 40006, 41202, and 41204.
Added ICD10 variables 41201 and 41270 to the default cleaning/processing rules.
0.19.0 (Wednesday 24th April 2019)
Added
The
--columnoption now accepts a file which contains a list of column names to import.
Changed
The
icd10.codeToNumeric()andicd10.numericToCode()functions have been changed to use the integer node IDs in the ICD10 hierarchy file. The previous approach could not handle parent categories, nor a small number of ICD10 codes which do not have a<letter><number>structure.The
fileinfo.has_header()function has been made more lenient for files with a small number of columns.
0.18.0 (Tuesday 23rd April 2019)
Added
New
icd10.numericToCode()function for converting from a numeric ICD10 code representation back to its alphanumeric representation.
Changed
The default binarised ICD10 column name format has been changed from
[variable_id][numeric_code]-[visit].0to[variable_id]-[visit].[numeric_code].The
--non_numeric_filewill not be created if there are not any non-numeric columns.The built-in
fmribconfiguration now includes verbosity and logging settings.The
isSparse()function now returns the reason and value for columns which fail the sparsity test.
0.17.0 (Monday 22nd April 2019)
Added
New
--non_numeric_fileoption allows non-numeric columns to be saved to a separate file (TSV export only).Built-in
fmrib.cfgconfiguration file, which can be used via-cfg fmrib.
Changed
The file generated by
--unknown_vars_filenow includes variables which are known, but are not in an existing category, and do not have any cleaning or processing rules specified for them.Built-in categories have been updated.
Fixed
A bug in the column names generated for binarised ICD10 categorical codes has been fixed. This bug would potentially have resulted in collisions between column names for different ICD10 codes.
0.16.0 (Friday 22nd March 2019)
Changed
Full variable and datacoding table files no longer need to be provided -
ukbparseusesukbparse/data/field.txtandukbparse/data/encoding.txtfiles, obtained from the UK Biobank showcase website, as the basis for recognising variables and data codings. The--variable_file/-vfand--datacoding_file/-dfoptions now accept partial table definitions - these will be merged with the built-in rules (still stored inukbparse/data/variables_*.tsvandukbparse/data/datacodings_*.tsv) whenukbparseis invoked.
Deprecated
The
ukbparse_htmlparse,ukbparse_join, andukbparse_compare_tablescommands.
Removed
The
--icd10_filecommand-line option has been removed.
0.15.1 (Thursday 21st March 2019)
Fixed
Fixed a bug which arose when using the
--rename_columnoption.
0.15.0 (Monday 18th March 2019)
Added
New cleaning function,
flattenHierarchical(), for use with hierarchical variables (e.g. ICD10). The function can be used to replace leaf values with parent values.New
hierarchymodule which contains helper functions and data structures for working with hierarchical variables.Definitions for all hierarchical UK Biobank variables are located in the
ukbparse/data/hierarchy/directory.
Deprecated
The
readICD10ConfigFile()function has been replaced with theloadHierarchyFile()function.The
ICD10Hierarchyclass has been replaced with theHierarchyclass .
0.14.8 (Monday 18th March 2019)
Fixed
Fixed an issue with the
processing_functions.binariseCategorical()processing function being applied to ICD10 codes.
0.14.7 (Sunday 17th March 2019)
Changed
Changes to default cleaning rules - negative values for integer/categorical types are no longer discarded.
0.14.6 (Saturday 16th March 2019)
Fixed
Fixed a
KeyErrorwhich was occurring during the child-value replacement stage for input files which did not have column names of the form[variable]-[visit].[instance].Fixed some issues introduced by behavioural changes in the
pandas.HDFStoreclass.
0.14.5 (Thursday 17th January 2019)
Fixed
Implemented a workaround for a bug in the Python
argparsemodule.
0.14.4 (Friday 11th January 2019)
Changed
Updated the default processing rules for variable [1120-1150](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=1120).
0.14.3 (Tuesday 8th January 2019)
Fixed
Fixed a regression introduced in 0.14.2, where column loading restrictions (e.g.
--variable) were not being honoured
0.14.2 (Monday 7th January 2019)
Fixed
Fixed a regression introduced in 0.14.1, where using the
--variableand--visitoptions together could cause a crash.
0.14.1 (Monday 7th January 2019)
Fixed
If the index columns for each input file have different names, the output index column was unnamed. It is now given the name of the index column in the first input file.
When the
--columnand--variableoptions were used together, only columns which passed both tests were being loaded. Now, columns which pass either test are loaded.
0.14.0 (Tuesday 25th December 2018)
Added
New
--columnoption, allowing columns to be selected by name/name pattern.ukbparsecan now be installed from conda-forge.
Changed
The index column in the output file no longer defaults to being named
'eid'. It defaults to the name of the index in the input file, but can still be overridden by the--output_id_columnoption.
Fixed
Blank lines are now allowed in configuration files (#2)
Fix to derived column names for ICD10 variables in default processing rules.
0.13.1 (Thursday 20th December 2018)
Added
Unit test to make sure that
ukbparsecrashes if given bad input arguments.
0.13.0 (Thursday 20th Deember 2018)
Added
New
--indexoption, allowing the position of the index column in input files to be specified.The
--variable,--subject, and--excludeoptions now accept comma-separated lists, in addition to IDs, ID ranges, and text files.
Fixed
Memory usage estimates in log messages were wrong under Linux.
0.12.3 (Tuesday 18th December 2018
Changed
Changes to new
fileinfo.has_header()function to improve robustness.
0.12.2 (Monday 17th December 2018)
Changed
Now using a custom implementation of
csv.Sniffer.has_header, as the standard library version does not handle some scenarios.
0.12.1 (Saturday 15th December 2018)
Added
Added some instructions for generating your own variable and data coding tables to the README.
Changed
The
ukbparse_demoscript ensures that the Jupyterbash_kernelis installed.The
ukbparse_compare_tables,ukbparse_htmlparseandukbparse_joinscripts print some help documentation when called without any arguments.Added
lxmlas a dependency (required bybeautifulsoup4).
0.12.0 (Tuesday 11th December 2018)
Added
The
join,compare_tables, andhtmlparsescripts are now installed as entry points calledukbparse_join,ukbparse_compare_tables, andukbparse_htmlparse.Jupyter notebook, demonstrating most of the features in
ukbparse, atukbparse/demo/ukbparse_demonstration.ipynb. You can run the demo via theukbparse_demoentry point.
Changed
Moved the
scripts/directory into theukbparse/directory.Improved string representation of process functions.
Fixed
Fix to configuration file parsing code -
shlex.splitis now used instead ofstr.split.Fixed mixed data type issues when merging the data coding and type tables into the variable table.
0.11.3 (Monday 10th December 2018)
Changed
Made the
vid,visit, andinstanceparameters to theColumnclass optional, to make life easier for custom sniffer functions.
0.11.2 (Monday 10th December 2018)
Fixed
Fixed a bug in the handling of new variable IDs returned by processing functions.
0.11.1 (Monday 10th December 2018)
Fixed
Fixed a bug in the
removeIfSparse()processing function.
0.11.0 (Monday 10th December 2018)
Added
New
--no_builtinsoption, which causes the built-in variable, data coding, type, and category table files to be bypassed.New
PluginRegistry.get()function for getting a reference to a plugin function.Cleaning/processing functions are listed in command-line help.
0.10.5 (Saturday 8th December 2018)
Changed
The
minpresoption to theremoveIfSparse()processing function is ignored if it is specified as an absolute value, and the data set length is less than it.
0.10.4 (Friday 7th December 2018)
Fixed
Fixed an issue with the –subject command line option.
0.10.3 (Friday 7th December 2018)
Fixed
Made use of the standard library
resourcemodule conditional, as it is not present on Windows.
0.10.2 (Friday 7th December 2018)
Fixed
Removed relative imports from test modules.
0.10.1 (Friday 7th December 2018)
Fixed
The
ukbparse.pluginspackage was missing an__init__.py, and was not being included in PyPI packages.
0.10.0 (Thursday 6th December 2018)
Added
New
--na_values,--recoding, and--child_valuescommand-line options for specifying/overriding NA insertion, categorical recodings, and child variable value replacement.--dry_runmode now prints information about columns that would not be loaded.
Fixed
Fixed a bug in the
calculateExpressionEvaluationOrder()function.
0.9.0 (Thursday 6th December 2018)
Added
Infrastructure for automatic deployment to PyPI and Zenodo.
Changed
Improved
--dry_runoutput formatting.
0.8.0
Added
New
--dry_runcommand-line option, which prints a summary of the cleaning and processing that would take place.
0.7.1
Fixed
Fixed a bug in the
icd10.saveCodes()function.
0.7.0
Changed
Small refactorings in
ukbparse.configso that command line arguments can be logged easily.
0.6.3
Changed
Minor updates to avoid deprecation warnings.
0.6.2
Fixed
Fixed a bug with the
--import_alloption, where an error would be thrown if a specifically requested variable was removed during processing.
0.6.1
Changed
Changed default processing for variables 41202/41204 so they are binarised within visit.
0.6.0
Added
New
--import_alland--unknown_vars_fileoptions for outputting information about previously unknown variables/columns.
Changed
Changed processing function return value interface - see the
processing_functionsmodule for details.
0.5.0
Added
Ability to export a mapping file containing the numeric values that ICD10 codes have been converted into - see the
--icd10_map_fileargument.
Changed
Changes to default processing - all ICD10 variables are binarised by default. Sparsity/redundancy tests happen at the end, so that columns generated by previous steps are tested.
Fixed
HDFStoreCollection.loc()method returns apandas.DataFramewhen a list of columns are indexed, and apandas.Serieswhen a single column is indexed.
0.4.1
Changed
Updates to variable table for UKBiobank spirometry variables.
0.4.0
Added
New
parseSpirometryData()function for parsing spirometry data (i.e. UKBiobank variable 3066
Removed
Removed the
--disable_renamecommand line option, because having the columns renamed is really annoying.
0.3.3
Changed
Reverted the behaviour of
isSparse().
0.3.2
Changed
Changed the behaviour of
isSparse()so that series which are greater than theminpresthreshold pass, rather than greater than or equal to.
0.3.1
Changed
The
isSparse()function ignores theminpresargument if it is larger than the number of samples in the data set.
Fixed
The
processing_functions.binariseCategorical()function now works on data with missing values.
0.3.0
Added
New
DataTable.addColumns()method, so processing functions can now add new columns.New
processing_functions.binariseCategorical()processing function, which expands a categorical column into multiple binary columns, one for each unique value in the data.New
processing_functions.expandCompound()processing function, which expands a compound column into columns, one for each value in the compound data.Keyword arguments can now be used when specifying processing.
Fixed
Fixed handling of non-numeric categorical variables
0.2.0
Added
Added a changelog file
Changed
Updated variable/datacoding files to bring them in line with the latest Biobank data release.