FUNPACK overview

win logo

Note: If you have FUNPACK installed, you can start an interactive version of this page by running fmrib_unpack_demo.

FUNPACK is a command-line program which you can use to extract data from UK Biobank (and other tabular) data. You can run FUNPACK by calling the fmrib_unpack command.

You can give FUNPACK one or more input files (e.g. .csv, .tsv), and it will merge them together, perform some preprocessing, and produce a single output file.

A large number of rules are built into FUNPACK which are specific to UK Biobank data. But you can control and customise everything that FUNPACK does to your data, including which rows and columns to extract, and which cleaning/processing steps to perform on each column.

Important The examples in this notebook assume that you have installed FUNPACK 4.0.0 or newer.

Note: The fmrib_unpack command was called funpack in older versions of FUNPACK, but was changed to fmrib_unpack in 3.0.0 to avoid a naming conflict with an unrelated software package.

[1]:
fmrib_unpack -V
funpack 4.0.1.dev6+gd8aa574c1

Note: If the above command produces a NameError, you may need to change the Jupyter Notebook kernel type to Bash - you can do so via the Kernel -> Change Kernel menu option.

Contents

  1. Overview

  2. Examples

  3. Import examples

  4. Cleaning examples

  5. Processing examples

  6. Custom cleaning, processing and loading - FUNPACK plugins

  7. Miscellaneous topics

Overview

FUNPACK performs the following steps:

  1. Import

  2. Cleaning

  3. Processing

  4. Export

1. Import

All data files are loaded in, unwanted data-fields and subjects are dropped, and the data files are merged into a single table (a.k.a. data frame). Multiple files can be merged according to an index column (e.g. subject ID). Or, if the input files contain the same columns/subjects, they can be naively concatenated along rows or columns.

Note: FUNPACK refers to UK Biobank Data-fields as variables. The two terms can be considered interchangeable. Also note that multiple columns may map to one data-field (e.g. for data-fields collected at each assessment centre visit).

2. Cleaning

The following cleaning steps are applied to each column:

  1. NA value replacement: Specific values for some data-fields are replaced with NA, for example, data-fields where a value of -1 indicates Do not know.

  2. Data-field-specific cleaning functions: Certain data-fields are re-formatted; for example, the ICD10 disease codes can be converted to integer representations.

  3. Categorical recoding: Certain categorical data-fields are re-coded.

  4. Child value replacement: NA values within some data-fields which are dependent upon other data-fields may have values inserted based on the values of their parent data-fields.

3. Processing

During the processing stage, columns may be removed, merged, or expanded into additional columns. For example, a categorical column may be expanded into a set of binary columns, one for each category.

A column may also be removed on the basis of being too sparse, or being redundant with respect to another column.

4. Export

The processed data can be saved as a .csv, .tsv, or .hdf5 file.

Examples

Throughout these examples, we are going to use a few command line options, which you will probably not normally want to use:

  • -ow (short for --overwrite): This tells fmrib_unpack not to complain if the output file already exists.

  • -nd (short for --no_download): This tells fmrib_unpack to use built-in copies of the UK Biobank schema files, instead of downloading the latest versions.

  • -q (short for --quiet): This tells fmrib_unpack to be quiet. Without the -q option, fmrib_unpack can be quite verbose, which can be annoying, but is very useful when things go wrong. When using fmrib_unpack for real, a good strategy is to tell it to produce verbose output using the --noisy (-n for short) option, and to send all of its output to a log file with the --log_file (or -lf) option. For example:

    fmrib_unpack -n -n -n -lf log.txt out.tsv in.tsv
    
[2]:
alias fmrib_unpack="fmrib_unpack -ow -nd -q"

Here’s the first example input data set, with UK Biobank-style column names:

[3]:
cat data_01.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0   7-0.0   8-0.0   9-0.0   10-0.0
1       31      65      10      11      84      22      56      65      90      12
2       56      52      52      42      89      35      3       65      50      67
3       45      84      20      84      93      36      96      62      48      59
4       7       46      37      48      80      20      18      72      37      27
5       8       86      51      68      80      84      11      28      69      10
6       6       29      85      59      7       46      14      60      73      80
7       24      49      41      46      92      23      39      68      7       63
8       80      92      97      30      92      83      98      36      6       23
9       84      59      89      79      16      12      95      73      2       62
10      23      96      67      41      8       20      97      57      59      23

The numbers in each column name typically represent:

  1. The data-field ID

  2. The “instance”, usually corresponding to the assessment centre visit, for data-fields which were collected at multiple points in time.

  3. The index, for multi-valued data-fields.

FUNPACK understands UK Biobank column names which conform to any of these formats:

  • <datafield>-<instance>.<index> (e.g. 31-0.0)

  • f.<datafield>.<instance>.<index> (e.g. f.31.0.0)

  • p<datafield>[_i<instance>][_a<index>] (e.g. p31)

Note that one data-field is typically associated with several columns, although we’re keeping things simple for this first example - there is only one visit for each data-field, and there are no multi-valued data-fields.

Most but not all data-fields in the UK Biobank contain data collected at different visits, the times that the participants visited a UK Biobank assessment centre. The instance number in the column name identifies the assessment centre visit. However there are some data-fields (e.g. ICD10 diagnosis codes) for which this is not the case.

Import examples

Selecting data-fields (columns)

You can specify which data-fields you want to load in the following ways, using the --variable (-v for short), --category (-c for short) and --column (-co for short) command line options:

  • By data-field ID.

  • By data-field ranges.

  • By a text file which contains the IDs you want to keep.

  • By pre-defined data-field categories.

  • By column name.

Selecting individual data-fields

Simply provide the IDs of the data-fields you want to extract:

[4]:
fmrib_unpack -v 1,5 out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   5-0.0
1       31      84.0
2       56      89.0
3       45      93.0
4       7       80.0
5       8       80.0
6       6       7.0
7       24      92.0
8       80      92.0
9       84      16.0
10      23      8.0

Selecting data-field ranges

The --variable/-v option accepts MATLAB-style ranges of the form start:step:stop (where the stop is inclusive):

[5]:
fmrib_unpack -v 1:3:10 out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   4-0.0   7-0.0   10-0.0
1       31      11.0    56      12
2       56      42.0    3       67
3       45      84.0    96      59
4       7       48.0    18      27
5       8       68.0    11      10
6       6       59.0    14      80
7       24      46.0    39      63
8       80      30.0    98      23
9       84      79.0    95      62
10      23      41.0    97      23

Selecting data-fields with a file

If your data-fields of interest are listed in a plain-text file, you can simply pass that file:

[6]:
echo -e "1\n6\n9" > vars.txt
fmrib_unpack -v vars.txt out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   6-0.0   9-0.0
1       31      22.0    90
2       56      35.0    50
3       45      36.0    48
4       7       20.0    37
5       8       84.0    69
6       6       46.0    73
7       24      23.0    7
8       80      83.0    6
9       84      12.0    2
10      23      20.0    59

Selecting data-fields from pre-defined categories

Some UK Biobank-specific categories are built into FUNPACK, but you can also define your own categories - you just need to create a .tsv file, and pass it to FUNPACK via the --category_file (-cf for short):

[7]:
echo -e "ID\tCategory\tVariables"     > custom_categories.tsv
echo -e "1\tGood data-fields\t1:5,7" >> custom_categories.tsv
echo -e "2\tBad data-fields\t6,8:10" >> custom_categories.tsv
cat custom_categories.tsv
ID      Category        Variables
1       Good data-fields        1:5,7
2       Bad data-fields 6,8:10

Use the --category (-c for short) to select categories to output. You can refer to categories by their ID:

[8]:
fmrib_unpack -cf custom_categories.tsv -c 1 out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   7-0.0
1       31      65      10.0    11.0    84.0    56
2       56      52      52.0    42.0    89.0    3
3       45      84      20.0    84.0    93.0    96
4       7       46      37.0    48.0    80.0    18
5       8       86      51.0    68.0    80.0    11
6       6       29      85.0    59.0    7.0     14
7       24      49      41.0    46.0    92.0    39
8       80      92      97.0    30.0    92.0    98
9       84      59      89.0    79.0    16.0    95
10      23      96      67.0    41.0    8.0     97

Or by name:

[9]:
fmrib_unpack -cf custom_categories.tsv -c bad out.tsv data_01.tsv
cat out.tsv
eid     6-0.0   8-0.0   9-0.0   10-0.0
1       22.0    65      90      12
2       35.0    65      50      67
3       36.0    62      48      59
4       20.0    72      37      27
5       84.0    28      69      10
6       46.0    60      73      80
7       23.0    68      7       63
8       83.0    36      6       23
9       12.0    73      2       62
10      20.0    57      59      23

Selecting column names

If you are working with data that has non-UK Biobank style column names, you can use the --column (-co for short) to select individual columns by their name, rather than the data-field with which they are associated. The --column option accepts full column names, and also shell-style wildcard patterns:

[10]:
fmrib_unpack -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv
cat out.tsv
eid     4-0.0   10-0.0
1       11.0    12
2       42.0    67
3       84.0    59
4       48.0    27
5       68.0    10
6       59.0    80
7       46.0    63
8       30.0    23
9       79.0    62
10      41.0    23

Selecting subjects (rows)

FUNPACK assumes that the first column in every input file is a subject ID, usually labelled eid. You can specify which subjects you want to load via the --subject (-s for short) option. You can specify subjects in the same way that you specified data-fields above, and also:

  • By specifying a conditional expression on data-field values - only subjects for which the expression evaluates to true will be imported

  • By specifying subjects to exclude

Selecting individual subjects

[11]:
fmrib_unpack -s 1,3,5 out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0   7-0.0   8-0.0   9-0.0   10-0.0
1       31      65      10.0    11.0    84.0    22.0    56      65      90      12
3       45      84      20.0    84.0    93.0    36.0    96      62      48      59
5       8       86      51.0    68.0    80.0    84.0    11      28      69      10

Selecting subject ranges

[12]:
fmrib_unpack -s 2:2:10 out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0   7-0.0   8-0.0   9-0.0   10-0.0
2       56      52      52.0    42.0    89.0    35.0    3       65      50      67
4       7       46      37.0    48.0    80.0    20.0    18      72      37      27
6       6       29      85.0    59.0    7.0     46.0    14      60      73      80
8       80      92      97.0    30.0    92.0    83.0    98      36      6       23
10      23      96      67.0    41.0    8.0     20.0    97      57      59      23

Selecting subjects from a file

[13]:
echo -e "5\n6\n7\n8\n9\n10" > subjects.txt
fmrib_unpack -s subjects.txt out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0   7-0.0   8-0.0   9-0.0   10-0.0
5       8       86      51.0    68.0    80.0    84.0    11      28      69      10
6       6       29      85.0    59.0    7.0     46.0    14      60      73      80
7       24      49      41.0    46.0    92.0    23.0    39      68      7       63
8       80      92      97.0    30.0    92.0    83.0    98      36      6       23
9       84      59      89.0    79.0    16.0    12.0    95      73      2       62
10      23      96      67.0    41.0    8.0     20.0    97      57      59      23

Selecting subjects by data-field value

The --subject option accepts data-field expressions (a.k.a variable expressiones)- you can write an expression performing numerical comparisons on data-fields (denoted with a leading v) and combine these expressions using boolean algebra. Only subjects for which the expression evaluates to true will be imported. For example, to only import subjects where data-field 1 is greater than 10, and data-field 2 is less than 70, you can type:

[14]:
fmrib_unpack -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0   7-0.0   8-0.0   9-0.0   10-0.0
1       31      65      10.0    11.0    84.0    22.0    56      65      90      12
2       56      52      52.0    42.0    89.0    35.0    3       65      50      67
7       24      49      41.0    46.0    92.0    23.0    39      68      7       63
9       84      59      89.0    79.0    16.0    12.0    95      73      2       62

The following symbols can be used in data-field expressions:

Symbol

Meaning

==

equal to

!=

not equal to

>

greater than

>=

greater than or equal to

<

less than

<=

less than or equal to

na

N/A

&&

logical and

||

logical or

~

logical not

contains

Contains sub-string

all

all columns must meet condition

any

any column must meet condition

()

to denote precedence

The all and any symbols allow you to control how an expression is evaluated across multiple columns which are associated with one data-field (e.g. separate columns for each visit). We will give an example of this in the section on selecting visits, below.

Non-numeric (i.e. string) data-fields can be used in these expressions in conjunction with the ==, !=, and contains operators, and date/time data-fields can be compared using the ==, !=, >, >=, <, and <= operators. For example, imagine that we have the following data set:

[15]:
cat data_02.tsv
eid     1-0.0   33-0.0
1       A       1948-07-14
2       B       1954-11-18
3       B       1977-09-16
4       A       1962-04-26
5       B       1945-02-08
6       A       1978-07-24
7       B       1950-07-25
8       A       1965-07-25
9       B       1967-07-16
10      A       1950-11-16

And we want to identify subjects who were born during or after 1965 (data-field 33), and who have a value of B for data-field 1:

[16]:
fmrib_unpack -s "v33 >= 1965-01-01 && v1 == 'B'" out.tsv data_02.tsv
cat out.tsv
eid     1-0.0   33-0.0
3       B       1977-09-16
9       B       1967-07-16

When comparing dates and times, you must use the format YYYY-MM-DD and YYYY-MM-DD HH:MM:SS respectively. When comparing strings, you must surround values with single or double quotes.

Evaluating a data-field expression requires the data for every subject to be loaded into memory, so that the conditional expression can be evaluated, which can be problematic on systems with limited RAM.

A useful strategy, if you intend to work with the same subset of subjects more than once, is to use FUNPACK once, to identify the subjects of interest, save their IDs to a text file (e.g. subjects.txt), and on subsequent calls to FUNPACK, use the text file to select subjects (e.g. fmrib_unpack -s subjects.txt). This means that subsequent FUNPACK runs will require less memory, and run much faster.

The --ids_only option can be used to generate an output file which only contains the IDs of the rows which would have been output, so can be used to generate a subject ID file. Let’s re-run one of the examples above, but this time with the --ids_only option:

[17]:
fmrib_unpack --ids_only -v 1,2 -s "v1 > 10 && v2 < 70" subjects.txt data_01.tsv
cat subjects.txt
1
2
7
9

Now we can use subjects.txt with the -s option in subsequent FUNPACK calls, to select the same subjects without having to re-evaluate the data-field expression.

Excluding subjects

The --exclude (-ex for short) option allows you to exclude subjects - it accepts individual IDs, an ID range, or a file containing IDs. The --exclude/-ex option takes precedence over the --subject/-s option:

[18]:
fmrib_unpack -s 1:8 -ex 5:10 out.tsv data_01.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0   7-0.0   8-0.0   9-0.0   10-0.0
1       31      65      10.0    11.0    84.0    22.0    56      65      90      12
2       56      52      52.0    42.0    89.0    35.0    3       65      50      67
3       45      84      20.0    84.0    93.0    36.0    96      62      48      59
4       7       46      37.0    48.0    80.0    20.0    18      72      37      27

Selecting visits

Many data-fields in the UK Biobank data contain observations at multiple points in time, or visits. FUNPACK allows you to specify which visits you are interested in. Here is an example data set with data-fields that have data for multiple visits (remember that the second number in the column names denotes the visit):

[19]:
cat data_03.tsv
eid     1-0.0   2-0.0   2-1.0   2-2.0   3-0.0   3-1.0   4-0.0   5-0.0
1       86      76      82      75      34      99      50      5
2       20      25      40      44      30      57      54      44
3       85      2       48      42      23      77      84      27
4       23      30      18      97      44      55      97      20
5       83      45      76      51      18      64      8       33

We can use the --visit (-vi for short) option to get just the last visit for each data-field:

[20]:
fmrib_unpack -vi last out.tsv data_03.tsv
cat out.tsv
eid     1-0.0   2-2.0   3-1.0   4-0.0   5-0.0
1       86      75      99.0    50.0    5.0
2       20      44      57.0    54.0    44.0
3       85      42      77.0    84.0    27.0
4       23      97      55.0    97.0    20.0
5       83      51      64.0    8.0     33.0

You can also specify which visit you want by its number:

[21]:
fmrib_unpack -vi 1 out.tsv data_03.tsv
cat out.tsv
eid     2-1.0   3-1.0
1       82      99.0
2       40      57.0
3       48      77.0
4       18      55.0
5       76      64.0

Data-fields which are not associated with specific visits (e.g. ICD10 diagnosis codes) will not be affected by the -vi option. Specifically, the -vi option is only applied to data-fields which use an “instancing” of 2 (Assessment Centre Visit).

Evaluating expressions across visits

The data-field expressions described above in the section on selecting subjects will be applied to all of the columns associated with a data-field. By default, an expression will evaluate to true where the values in any column asssociated with the data-field evaluate to true. For example, we can extract the data for subjects where the values of any column of data-field 2 were less than 50:

[22]:
fmrib_unpack -v 2 -s 'v2 < 50' out.tsv data_03.tsv
cat out.tsv
eid     2-0.0   2-1.0   2-2.0
2       25      40      44
3       2       48      42
4       30      18      97
5       45      76      51

We can use the any and all operators to control how an expression is evaluated across the columns of a data-field. For example, we may only be interested in subjects for whom all columns of data-field 2 were greater than 50:

[23]:
fmrib_unpack -v 2 -s 'all(v2 < 50)' out.tsv data_03.tsv
cat out.tsv
eid     2-0.0   2-1.0   2-2.0
2       25      40      44
3       2       48      42

We can use any and all in expressions involving multiple data-fields:

[24]:
fmrib_unpack -v 2,3 -s 'any(v2 < 50) && all(v3 >= 40)' out.tsv data_03.tsv
cat out.tsv
eid     2-0.0   2-1.0   2-2.0   3-0.0   3-1.0
4       30      18      97      44.0    55.0

Merging multiple input files

If your data is split across multiple files, you can specify how FUNPACK should merge them together.

Merging two UK Biobank data sets with a bridge file

Every UK Biobank application is assigned a different set of subject IDs, for anonymisation/privacy purposes. Some researchers are part of multiple applications, and occasionally need to work with data from both applications. To facilitate this, the UK Biobank may provide these researchers with a bridge file which contains a mapping between the subject IDs from both applications. You can use FUNPACK to merge two data sets together with a bridge file.

For example, let’s say we have two input files (shown side-by-side) with different sets of subject IDs:

[25]:
echo " " | paste data_04.tsv - data_04_2.tsv
eid     1-0.0   2-0.0   3-0.0           eid     4-0.0   5-0.0   6-0.0
1       89      47      26              100     28      78      89
2       94      37      70              200     42      13      70
3       63      5       97              300     17      79      86
4       98      97      91              400     60      19      38
5       37      10      11              500     26      76      6

And we have a bridge file which maps the two sets of subject IDs to each other:

[26]:
cat bridge.csv
eid1,eid2
1,100
2,200
3,300
4,400
5,500

We can merge these two datasets together by running FUNPACK twice - first, we will merge the bridge file and the second data set together. We use the --index <filename> <column> option to tell FUNPACK to use the second set of subject IDs from the bridge file.

[27]:
fmrib_unpack --index bridge.csv 1 temp.tsv bridge.csv data_04_2.tsv
cat temp.tsv
eid2    eid1    4-0.0   5-0.0   6-0.0
100     1       28.0    78.0    89.0
200     2       42.0    13.0    70.0
300     3       17.0    79.0    86.0
400     4       60.0    19.0    38.0
500     5       26.0    76.0    6.0

Then we can run FUNPACK again to merge in the first data set. This time, we use the --index <filename> <column> again to tell FUNPACK to use the first set of subject IDs from the intermediate file.

[28]:
fmrib_unpack --index temp.tsv 1 out.tsv data_04.tsv temp.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   eid2    4-0.0   5-0.0   6-0.0
1       89      47      26.0    100     28.0    78.0    89.0
2       94      37      70.0    200     42.0    13.0    70.0
3       63      5       97.0    300     17.0    79.0    86.0
4       98      97      91.0    400     60.0    19.0    38.0
5       37      10      11.0    500     26.0    76.0    6.0

FUNPACK also has more flexible options for merging datasets, as outlined below.

Merging by subject

For example, let’s say we have these two input files (shown side-by-side):

[29]:
echo " " | paste data_04.tsv - data_05.tsv
eid     1-0.0   2-0.0   3-0.0           eid     4-0.0   5-0.0   6-0.0
1       89      47      26              2       19      17      62
2       94      37      70              3       41      12      7
3       63      5       97              4       8       86      9
4       98      97      91              5       7       65      71
5       37      10      11              6       3       23      15

Note that each file contains different data-fields, and different, but overlapping, subjects. By default, when you pass these files to FUNPACK, it will output the intersection of the two files (more formally known as an inner join), i.e. subjects which are present in both files:

[30]:
fmrib_unpack out.tsv data_04.tsv data_05.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0
2       94      37      70.0    19.0    17.0    62.0
3       63      5       97.0    41.0    12.0    7.0
4       98      97      91.0    8.0     86.0    9.0
5       37      10      11.0    7.0     65.0    71.0

If you want to keep all subjects, you can instruct FUNPACK to output the union (a.k.a. outer join) via the --merge_strategy (-ms for short) option:

[31]:
fmrib_unpack -ms outer out.tsv data_04.tsv data_05.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0
1       89.0    47.0    26.0
2       94.0    37.0    70.0    19.0    17.0    62.0
3       63.0    5.0     97.0    41.0    12.0    7.0
4       98.0    97.0    91.0    8.0     86.0    9.0
5       37.0    10.0    11.0    7.0     65.0    71.0
6                               3.0     23.0    15.0

Merging by column

Your data may be organised in a different way. For example, these next two files contain different groups of subjects, but overlapping columns:

[32]:
echo " " | paste data_06.tsv - data_07.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0           eid     2-0.0   3-0.0   4-0.0   5-0.0   6-0.0
1       69      80      70      60      42              4       17      36      56      90      12
2       64      15      82      99      67              5       63      16      87      57      63
3       33      67      58      96      26              6       43      19      84      53      63

In this case, we need to tell FUNPACK to merge along the row axis, rather than along the column axis. We can do this with the --merge_axis (-ma for short) option:

[33]:
fmrib_unpack -ma rows out.tsv data_06.tsv data_07.tsv
cat out.tsv
eid     2-0.0   3-0.0   4-0.0   5-0.0
1       80      70.0    60.0    42.0
2       15      82.0    99.0    67.0
3       67      58.0    96.0    26.0
4       17      36.0    56.0    90.0
5       63      16.0    87.0    57.0
6       43      19.0    84.0    53.0

Again, if we want to retain all columns, we can tell FUNPACK to perform an outer join with the -ms option:

[34]:
fmrib_unpack -ma rows -ms outer out.tsv data_06.tsv data_07.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0
1       69.0    80      70.0    60.0    42.0
2       64.0    15      82.0    99.0    67.0
3       33.0    67      58.0    96.0    26.0
4               17      36.0    56.0    90.0    12.0
5               63      16.0    87.0    57.0    63.0
6               43      19.0    84.0    53.0    63.0

Naive merging

Finally, your data may be organised such that you simply want to “paste”, or concatenate them together, along either rows or columns. For example, your data files might look like this:

[35]:
echo " " | paste data_08.tsv - data_09.tsv
eid     1-0.0   2-0.0   3-0.0           eid     4-0.0   5-0.0   6-0.0
1       30      99      57              1       16      54      60
2       3       6       75              2       43      59      9
3       13      91      36              3       71      73      38

Here, we have columns for different data-fields on the same set of subjects, and we just need to concatenate them together horizontally. We do this by using --merge_strategy naive (-ms naive for short):

[36]:
fmrib_unpack -ms naive out.tsv data_08.tsv data_09.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0   5-0.0   6-0.0
1       30      99      57.0    16.0    54.0    60.0
2       3       6       75.0    43.0    59.0    9.0
3       13      91      36.0    71.0    73.0    38.0

For files which need to be concatenated vertically, such as these:

[37]:
echo " " | paste data_10.tsv - data_11.tsv
eid     1-0.0   2-0.0   3-0.0           eid     1-0.0   2-0.0   3-0.0
1       16      34      10              4       40      89      58
2       62      78      16              5       25      75      9
3       72      29      53              6       28      74      57

We need to tell FUNPACK which axis to concatenate along, again using the -ma option:

[38]:
fmrib_unpack -ms naive -ma rows out.tsv data_10.tsv data_11.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0
1       16      34      10.0
2       62      78      16.0
3       72      29      53.0
4       40      89      58.0
5       25      75      9.0
6       28      74      57.0

Cleaning examples

Once the data has been imported, a set of cleaning steps are applied to each column.

NA insertion

For some data-fields it may make sense to discard or ignore certain values. For example, if an individual selects Do not know to a question such as How much milk did you drink yesterday?, that answer will be coded with a specific value (e.g. -1). It does not make any sense to include these values in most analyses, so FUNPACK can be used to mark such values as Not Available (NA).

A large number of NA insertion rules, specific to UK Biobank data-fields, are coded into FUNPACK, and are applied when you use the -cfg fmrib option (see the section below on built-in rules). You can also specify your own rules via the --na_values (-nv for short) option.

Let’s say we have this data set:

[39]:
cat data_12.tsv
eid     1-0.0   2-0.0   3-0.0
1       4       1       6
2       2       6       0
3       7       0       -1
4       -1      6       1
5       2       8       4
6       0       2       7
7       -1      0       0
8       7       7       2
9       4       -1      -1
10      8       -1      2

For data-field 1, we want to ignore values of -1, for data-field 2 we want to ignore -1 and 0, and for data-field 3 we want to ignore 1 and 2:

[40]:
fmrib_unpack -nv 1 " -1" -nv 2 " -1,0" -nv 3 "1,2" out.tsv data_12.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0
1       4.0     1.0     6.0
2       2.0     6.0     0.0
3       7.0             -1.0
4               6.0
5       2.0     8.0     4.0
6       0.0     2.0     7.0
7                       0.0
8       7.0     7.0
9       4.0             -1.0
10      8.0

The --na_values option expects two arguments:

  • The data-field ID

  • A comma-separated list of values to replace with NA

Data-field-specific cleaning functions

A small number of cleaning/preprocessing functions are built into FUNPACK, which can be applied to specific data-fields. For example, some data-fields in the UK Biobank contain ICD10 disease codes, which may be more useful if converted to a numeric format (e.g. to make them easy to load into MATLAB). Imagine that we have some data with ICD10 codes:

[41]:
cat data_13.tsv
eid     1-0.0
1       A481
2       A590
3       B391
4       D596
5       Z980

We can use the --clean (-cl for short) option with the built-in codeToNumeric cleaning function to convert the codes to a numeric representation&:

[42]:
fmrib_unpack -cl 1 "codeToNumeric('icd10')" out.tsv data_13.tsv
cat out.tsv
eid     1-0.0
1       5340
2       5960
3       9320
4       21590
5       191430

&The codeToNumeric function will replace each ICD10 code with the corresponding Node number, as defined in the UK Biobank ICD10 data coding.

The --clean option expects two arguments:

You can find a list of the built-in cleaning functions here. You can also define your own cleaning functions by passing them in as a --plugin_file (see the section on custom plugins below).

Example: flattening hierarchical data

Several data-fields in the UK Biobank (including the ICD10 disease categorisations) are organised in a hierarchical manner - each value is a child of a more general parent category. The flattenHierarchical cleaninng function can be used to replace each value in a data set with the value that corresponds to a parent category. Let’s apply this to our example ICD10 data set.

[43]:
fmrib_unpack -cl 1 "flattenHierarchical(name='icd10')" out.tsv data_13.tsv
cat out.tsv
eid     1-0.0
1       Chapter I
2       Chapter I
3       Chapter I
4       Chapter III
5       Chapter XXI

Aside: ICD10 mapping file

FUNPACK has a feature specific to these ICD10 disease categorisations - you can use the --icd10_map_file (-imf for short) option to tell FUNPACK to save a file which contains a list of all ICD10 codes that were present in the input data, and the corresponding numerical codes that FUNPACK generated:

[44]:
fmrib_unpack -cl 1 "codeToNumeric('icd10')" -imf icd10_codes.tsv out.tsv data_13.tsv
cat icd10_codes.tsv
code    value   description     parent_descs
A481    5340    A48.1 Legionnaires' disease [Chapter I Certain infectious and parasitic diseases] [A30-A49 Other bacterial diseases] [A48 Other bacterial diseases, not elsewhere classified]
A590    5960    A59.0 Urogenital trichomoniasis [Chapter I Certain infectious and parasitic diseases] [A50-A64 Infections with a predominantly sexual mode of transmission] [A59 Trichomoniasis]
B391    9320    B39.1 Chronic pulmonary histoplasmosis capsulati        [Chapter I Certain infectious and parasitic diseases] [B35-B49 Mycoses] [B39 Histoplasmosis]
D596    21590   D59.6 Haemoglobinuria due to haemolysis from other external causes      [Chapter III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism] [D55-D59 Haemolytic anaemias] [D59 Acquired haemolytic anaemia]
Z980    191430  Z98.0 Intestinal bypass and anastomosis status  [Chapter XXI Factors influencing health status and contact with health services] [Z80-Z99 Persons with potential health hazards related to family and personal history and certain conditions influencing health status] [Z98 Other postsurgical states]

Categorical recoding

You may have some categorical data which is coded in an awkward manner, such as in this example, which encodes the amount of some item that an individual has consumed:

data coding example

You can use the --recoding (-re for short) option to recode data like this into something more useful. For example, given this data:

[45]:
cat data_14.tsv
eid     1-0.0
1       1
2       555
3       444
4       2
5       300
6       444
7       2
8       2

Let’s recode it to be more monotonic:

[46]:
fmrib_unpack -re 1 "300,444,555" "3,0.25,0.5" out.tsv data_14.tsv
cat out.tsv
eid     1-0.0
1       1.0
2       0.5
3       0.25
4       2.0
5       3.0
6       0.25
7       2.0
8       2.0

The --recoding option expects three arguments:

  • The data-field ID

  • A comma-separated list of the values to be replaced

  • A comma-separated list of the values to replace them with

Child value replacement

Imagine that we have these two questions:

  • 1: Do you currently smoke cigarettes?

  • 2: How many cigarettes did you smoke yesterday?

Now, question 2 was only asked if the answer to question 1 was Yes. So for all individuals who answered No to question 1, we will have a missing value for question 2. But for some analyses, it would make more sense to have a value of 0, rather than NA, for these subjects.

FUNPACK can handle these sorts of dependencies by way of child value replacement. For question 2, we can define a conditional variable expression such that when both question 2 is NA and question 1 is No, we can insert a value of 0 into question 2.

This scenario is demonstrated in this example data set (where, for question 1 values of 1 and 0 represent Yes and No respectively):

[47]:
cat data_15.tsv
eid     1-0.0   2-0.0
1       1       7
2       1       4
3       1       1
4       0
5       0
6       0
7       1       25
8       0

We can fill in the values for data-field 2 by using the --child_values (-cv for short) option:

[48]:
fmrib_unpack -cv 2 "v1 == 0" "0" out.tsv data_15.tsv
cat out.tsv
eid     1-0.0   2-0.0
1       1       7.0
2       1       4.0
3       1       1.0
4       0       0.0
5       0       0.0
6       0       0.0
7       1       25.0
8       0       0.0

The --child_values option expects three arguments:

  • The data-field ID

  • An expression evaluating some condition on the parent data-field(s)

  • A value to replace NA with where the expression evaluates to true.

Processing examples

After every column has been cleaned, the entire data set undergoes a series of processing steps. The processing stage may result in columns being removed or manipulated, or new columns being added.

The processing stage can be controlled with these options:

  • --prepend_process (-ppr for short): Apply a processing function before the built-in processing

  • --append_process (-apr for short): Apply a processing function after the built-in processing

A default set of processing steps are applied when you apply the fmrib configuration profile by using -cfg fmrib - see the section on built-in rules.

The --prepend_process and --append_process options require two arguments:

  • The data-field ID(s) to apply the function to, or all to denote all variables.

  • The processing function to apply. The available processing functions are listed here and in the command line help, or you can write your own and pass it in as a plugin file (see below).

Sparsity check

The removeIfSparse process will remove columns that are deemed to have too many missing values. If we take this data set:

[49]:
cat data_16.tsv
eid     1-0.0   2-0.0
1       7       24
2       2       37
3       4       14
4       6
5               77
6       7       10
7
8       3       13
9               62
10              74

Imagine that our analysis requires at least 8 values per data-field to work. We can use the minpres option to removeIfSparse to drop any columns which do not meet this threshold:

[50]:
fmrib_unpack -apr all "removeIfSparse(minpres=8)" out.tsv data_16.tsv
cat out.tsv
eid     2-0.0
1       24.0
2       37.0
3       14.0
4
5       77.0
6       10.0
7
8       13.0
9       62.0
10      74.0

You can also specify minpres as a proportion, rather than an absolute number. e.g.:

[51]:
fmrib_unpack -apr all "removeIfSparse(minpres=0.65, abspres=False)" out.tsv data_16.tsv
cat out.tsv
eid     2-0.0
1       24.0
2       37.0
3       14.0
4
5       77.0
6       10.0
7
8       13.0
9       62.0
10      74.0

Redundancy check

You may wish to remove columns which contain redundant information. The removeIfRedundant process calculates the pairwise correlation between all columns, and removes columns with a correlation above a threshold that you provide. Imagine that we have this data set:

[52]:
cat data_17.tsv
eid     1-0.0   2-0.0   3-0.0
1       1       10      1
2       2       20      6
3       3       30      3
4       4       40      7
5       5       50      2
6       6       60      3
7       5       50      2
8       4       40      9
9       3       30      8
10      2       20      5

The data in column 2-0.0 is effectively equivalent to the data in column 1-0.0, so is not of any use to us. We can tell FUNPACK to remove it like so:

[53]:
fmrib_unpack -apr all "removeIfRedundant(0.9)" out.tsv data_17.tsv
cat out.tsv
eid     1-0.0   3-0.0
1       1       1.0
2       2       6.0
3       3       3.0
4       4       7.0
5       5       2.0
6       6       3.0
7       5       2.0
8       4       9.0
9       3       8.0
10      2       5.0

The removeIfRedundant process can also calculate the correlation of the patterns of missing values between data-fields - Consider this example:

[54]:
cat data_18.tsv
eid     1-0.0   2-0.0   3-0.0
1       1       10      100
2       2       20      200
3                       300
4       4       40      400
5                       500
6       4       40
7       3               300
8       2       20      200
9       1       10      100
10                      0

All three columns are highly correlated, but the pattern of missing values in column 3-0.0 is different to that of the other columns.

If we use the nathres option, FUNPACK will only remove columns where the correlation of both present and missing values meet the thresholds. Note that the column which contains more missing values will be the one that gets removed:

[55]:
fmrib_unpack -apr all "removeIfRedundant(0.9, nathres=0.6)" out.tsv data_18.tsv
cat out.tsv
eid     1-0.0   3-0.0
1       1.0     100.0
2       2.0     200.0
3               300.0
4       4.0     400.0
5               500.0
6       4.0
7       3.0     300.0
8       2.0     200.0
9       1.0     100.0
10              0.0

Categorical binarisation

The binariseCategorical process takes a column containing categorical labels, and replaces it with a set of new binary columns, one for each category. Imagine that we have this data:

[56]:
cat data_19.tsv
eid 1-0.0
1       1
2       2
3       3
4       2
5       2
6       3
7       1
8       4
9       1
10      3

We can use the binariseCategorical process to split column 1-0.0 into a separate column for each category:

[57]:
fmrib_unpack -apr 1 "binariseCategorical" out.tsv data_19.tsv
cat out.tsv
eid     1-0.1   1-0.2   1-0.3   1-0.4
1       1       0       0       0
2       0       1       0       0
3       0       0       1       0
4       0       1       0       0
5       0       1       0       0
6       0       0       1       0
7       1       0       0       0
8       0       0       0       1
9       1       0       0       0
10      0       0       1       0

There are a few options to binariseCategorical, including controlling whether the original column is removed, and also the naming of the newly created columns:

[58]:
fmrib_unpack -apr 1 "binariseCategorical(replace=False, nameFormat='{vid}:{value}')" out.tsv data_19.tsv
cat out.tsv
eid     1-0.0   1:1     1:2     1:3     1:4
1       1       1       0       0       0
2       2       0       1       0       0
3       3       0       0       1       0
4       2       0       1       0       0
5       2       0       1       0       0
6       3       0       0       1       0
7       1       1       0       0       0
8       4       0       0       0       1
9       1       1       0       0       0
10      3       0       0       1       0

Custom cleaning, processing and loading - FUNPACK plugins

If you want to apply some specific cleaning or processing function to a variable, you can code your functions up in Python, and then tell FUNPACK to apply them.

As an example, let’s say we have some data like this:

[59]:
cat data_20.tsv
eid     1-0.0   2-0.0   3-0.0
1       28      65      18
2       12      71      63
3       17      60      95
4       36      80      38
5       91      55      36
6       4       97      71
7       20      3       88
8       58      64      65
9       87      27      26
10      36      17      22

Custom cleaning functions

But for our analysis, we are only interested in the even values for columns 1 and 2. Let’s write a cleaning function which replaces all odd values with NA:

[60]:
cat plugin_1.py | pygmentize
#!/usr/bin/env python

import numpy as np

from funpack import cleaner

# Cleaner functions are passed:
#
#   - dtable: An object which provides access to the data set.
#   - vid:    The variable ID of the column(s) to be cleaned.
#
# Cleaner functions should modify the data in-place.
@cleaner()
def drop_odd_values(dtable, vid):

    # Remember that a variable may be
    # associated with multiple columns
    cols = dtable.columns(vid)

    # the columns() method returns a list of
    # Column objects,  each of which contains
    # information about one column in the data.
    for col in cols:
        col = col.name
        mask = dtable[:, col] % 2 != 0
        dtable[mask, col] = np.nan

To use our custom cleaner function, we simply pass our plugin file to FUNPACK using the --plugin_file (-p for short) option:

[61]:
fmrib_unpack -p plugin_1.py -cl 1 drop_odd_values -cl 2 drop_odd_values out.tsv data_20.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0
1       28.0            18.0
2       12.0            63.0
3               60.0    95.0
4       36.0    80.0    38.0
5                       36.0
6       4.0             71.0
7       20.0            88.0
8       58.0    64.0    65.0
9                       26.0
10      36.0            22.0

Custom processing functions

Recall that cleaning functions are applied independently to each column, whereas processing functions may be applied to multiple columns simultaneously, and may add and/or remove columns. Let’s say we want to derive a new column from columns 1-0.0 and 2-0.0 in our example data set. Our plugin file might look like this:

[62]:
cat plugin_2.py | pygmentize
#!/usr/bin/env python

import itertools as it

from funpack import processor

# Processor functions are passed:
#
#   - dtable: An object which provides access to the data set.
#   - vid:    The variable ID of the column(s) to be cleaned.
#
# Processor functions can do any of the following:
#
#   - modify existing columns in place,
#   - return a list of columns that should be removed
#   - return a list of columns that should be added
@processor()
def sum_squares(dtable, vids):

    cols    = it.chain(*[dtable.columns(v) for v in vids])
    series  = [dtable[:, c.name] for c in cols]
    squares = [s * s for s in series]
    sumsq   = sum(squares)

    sumsq.name = 'sum_square({})'.format(','.join([str(v) for v in vids]))

    # The value returned by a processor function differs
    # depending on what it wishes to do. In this case,
    # we are returning a list of new pandas.Series to be
    # added as columns, and a list of integer variable
    # IDs, one for each new column. The variable IDs are
    # optional, so we are just returning None instead.
    return [sumsq], None

Again, to use our plugin, we pass it to FUNPACK via the --plugin/-p option:

[63]:
fmrib_unpack -p plugin_2.py -apr "1,2" "sum_squares" out.tsv data_20.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   sum_square(1,2)
1       28      65      18.0    5009
2       12      71      63.0    5185
3       17      60      95.0    3889
4       36      80      38.0    7696
5       91      55      36.0    11306
6       4       97      71.0    9425
7       20      3       88.0    409
8       58      64      65.0    7460
9       87      27      26.0    8298
10      36      17      22.0    1585

Custom file loaders

You might want to load some auxillary data which is in an awkward format that cannot be automatically parsed by FUNPACK. For example, you may have a file which has acquisition date information separated into year, month and day columns, e.g.:

[64]:
cat data_21.tsv
eid     year    month   day
1       2018    6       1
2       2018    6       17
3       2018    7       5
4       2018    3       11
5       2018    2       21
6       2018    8       8
7       2018    11      30
8       2018    12      5
9       2018    4       13

These three columns would be better loaded as a single column. So we can write a plugin to load this file for us. We need to write two functions:

  • A sniffer function, which returns information about the columns contained in the file.

  • A loader function which loads the file, returning it as a pandas.DataFrame.

[65]:
cat plugin_3.py | pygmentize
#!/usr/bin/env python

import pandas as pd

from funpack import sniffer, loader, Column

# Sniffer and loader functions are defined in
# pairs. A pair is denoted by its functions
# being given the same label, passed to the
# @sniffer/@loader decorators
# ("my_datefile_loader" in this example). The
# function names are irrelevant.

@sniffer('my_datefile_loader')
def columns_datefile(infile):

    # A sniifer function must return a
    # sequence of Column objects which
    # describe the file (after it has
    # been loaded by the loader function).
    #
    # The values passed to a Column object
    # are:
    #   - file name
    #   - column name
    #   - column index (starting from 0)
    return [Column(infile, 'eid',              0),
            Column(infile, 'acquisition_date', 1)]

@loader('my_datefile_loader')
def load_datefile(infile):

    def create_date(row):
        return pd.Timestamp(row['year'], row['month'], row['day'])

    df = pd.read_csv(infile, index_col=0, sep=r'\s+')

    df['acquisition_date'] = df.apply(create_date, axis=1)

    df.drop(['year', 'month', 'day'], axis=1, inplace=True)

    return df

And to see it in action:

[66]:
fmrib_unpack -p plugin_3.py -l data_21.tsv my_datefile_loader out.tsv data_21.tsv
cat out.tsv
eid     acquisition_date
1       2018-06-01T00:00:00
2       2018-06-17T00:00:00
3       2018-07-05T00:00:00
4       2018-03-11T00:00:00
5       2018-02-21T00:00:00
6       2018-08-08T00:00:00
7       2018-11-30T00:00:00
8       2018-12-05T00:00:00
9       2018-04-13T00:00:00

Miscellaneous topics

Non-numeric data

Many UK Biobank variables contain non-numeric data, such as alpha-numeric codes and unstructured text. If you want to select subjects on the basis of such columns, data-field expressions contain some simple mechanisms for doing so. Here is an example of a file containing both numeric and non-numeric data:

[67]:
cat data_22.tsv | column -t -s $'\t'
eid  1-0.0  2-0.0  3-0.0               4-0.0
1    A481   54     red green blue      0.6
2    A590   24     yellow aqua         3.2
3    B391   17     brown black purple  1.4
4    D596   77     green               7.3
5    Z980   26     white red           5.1

Let’s say we are only interested in subjects where data-field 1 contains a value in the A category:

Note the use of the -nb (--no_builtins) option here - this tells FUNPACK to ignore its built-in data-field table, which contains information about the type of each UK Biobank data-field, and would otherwise interfere with this example. You would almost never need to use this option normally.

[68]:
fmrib_unpack -nb -s "v1 contains 'A'" out.tsv data_22.tsv
cat out.tsv
eid     1-0.0   2-0.0   3-0.0   4-0.0
1       A481    54      red green blue  0.6
2       A590    24      yellow aqua     3.2

By default, FUNPACK will save columns containing non-numeric data to the main output file, just like any other column. However, there are a couple of options you can use to control what FUNPACK does with non-numeric data.

If you only care about numeric columns, you can use the --suppress_non_numerics option (-esn for short) - this tells FUNPACK to discard all columns that are not numeric:

[69]:
fmrib_unpack -nb -esn out.tsv data_22.tsv
cat out.tsv
eid     2-0.0   4-0.0
1       54      0.6
2       24      3.2
3       17      1.4
4       77      7.3
5       26      5.1

You can also tell FUNPACK to save all non-numeric columns to a separate file, using the --write_non_numerics option (-wnn for short):

[70]:
fmrib_unpack -nb -esn -wnn out.tsv data_22.tsv
cat out.tsv
cat out_non_numerics.tsv
eid     2-0.0   4-0.0
1       54      0.6
2       24      3.2
3       17      1.4
4       77      7.3
5       26      5.1
eid     1-0.0   3-0.0
1       A481    red green blue
2       A590    yellow aqua
3       B391    brown black purple
4       D596    green
5       Z980    white red

Dry run

The --dry_run (-d for short) option allows you to see what FUNPACK is going to do - it is useful to perform a dry run before running a large processing job, which could take a long time. For example, if we have a complicated configuration such as the following, we can use the --dry_run option to check that FUNPACK is going to do what we expect:

[71]:
fmrib_unpack                             \
  -nb -d                                 \
  -nv  1   "7,8,9"                       \
  -re  2   "1,2,3" "100,200,300"         \
  -cv  3   "v4 != 20" "25"               \
  -cl  4   "makeNa('< 50')"              \
  -apr all "removeIfSparse(minpres=0.5)" \
  out.tsv data_01.tsv
funpack 4.0.1.dev6+gd8aa574c1 dry run

Input data
  Loaded columns:        11
  Ignored columns:       0
  Unknown columns:       10
  Uncategorised columns: 0
  Loaded variables:      11

Cleaning

  NA Insertion: True
    1: [7.0, 8.0, 9.0]

  Cleaning functions: True
    4: [makeNa[cleaner](< 50)]

  Child value replacement: True
    3: [v4 != 20] -> [25.0]
  Categorical recoding: True
    2: [1.0, 2.0, 3.0] -> [100.0, 200.0, 300.0]

Processing: True
  1: ['all' list([])] -> [removeIfSparse[processor](minpres=0.5)]

Built-in rules

FUNPACK has a large number of hand-crafted rules built in, which are specific to UK Biobank data-fields. These rules are part of the fmrib configuration, which can be used by adding -cfg fmrib to the command-line options.

We can use the --dry_run (-d for short) option, along with some dummy data files which just contain the UK Biobank column names, to get a summary of these rules:

[72]:
fmrib_unpack -cfg fmrib -d out.tsv ukbcols.csv
funpack 4.0.1.dev6+gd8aa574c1 dry run

Input data
  Loaded columns:        25198
  Ignored columns:       0
  Unknown columns:       1
  Uncategorised columns: 1403
  Loaded variables:      7240

Cleaning

  NA Insertion: True
    19: [3.0, 6.0, 7.0]
    21: [-1.0, 4.0]
    23: [6.0, 7.0, 9.0]
    84: [-1.0, -3.0]
    87: [-1.0, -3.0]
    92: [-1.0, -3.0]
    129: [-1.0]
    130: [-1.0]
    400: [0.0]
    670: [-3.0]
    680: [-3.0]
    699: [-1.0, -3.0]
    709: [-1.0, -3.0]
    728: [-1.0, -3.0]
    738: [-1.0, -3.0]
    757: [-1.0, -3.0]
    767: [-1.0, -3.0]
    777: [-1.0, -3.0]
    796: [-1.0, -3.0]
    806: [-1.0, -3.0]
    816: [-1.0, -3.0]
    826: [-1.0, -3.0]
    845: [-1.0, -2.0, -3.0]
    864: [-1.0, -3.0]
    874: [-1.0, -3.0]
    884: [-1.0, -3.0]
    894: [-1.0, -3.0]
    904: [-1.0, -3.0]
    914: [-1.0, -3.0]
    924: [-3.0]
    943: [-1.0, -3.0]
    971: [-1.0, -3.0]
    981: [-1.0, -3.0]
    991: [-1.0, -3.0]
    1001: [-1.0, -3.0]
    1011: [-1.0, -3.0]
    1021: [-1.0, -3.0]
    1031: [-1.0, -3.0]

...
[Long output - 3676 more lines hidden]
...

All of these rules are coded in a set of .cfg and .tsv files which are installed alongside the FUNPACK source code:

[73]:
cfgdir=$(python -c "import funpack; print(funpack.findConfigDir())")
ls -l $cfgdir/
ls -l $cfgdir/fmrib/
total 32
-rw-r--r--. 1 root root  236 Apr 13 12:27 README
drwxr-xr-x. 2 root root 4096 Apr 13 12:26 fmrib
-rw-r--r--. 1 root root 2403 Apr 13 12:26 fmrib.cfg
-rw-r--r--. 1 root root  752 Apr 13 12:26 fmrib_cats.cfg
-rw-r--r--. 1 root root  607 Apr 13 12:26 fmrib_logs.cfg
-rw-r--r--. 1 root root  599 Apr 13 12:26 fmrib_new_release.cfg
-rw-r--r--. 1 root root  168 Apr 13 12:26 fmrib_standard.cfg
-rw-r--r--. 1 root root  411 Apr 13 12:26 local.cfg
total 44
-rw-r--r--. 1 root root 12410 Apr 13 12:26 categories.tsv
-rw-r--r--. 1 root root  3601 Apr 13 12:26 datacodings_navalues.tsv
-rw-r--r--. 1 root root  2595 Apr 13 12:26 datacodings_recoding.tsv
-rw-r--r--. 1 root root    61 Apr 13 12:26 datetime_formatting.tsv
-rw-r--r--. 1 root root  3396 Apr 13 12:26 processing.tsv
-rw-r--r--. 1 root root    60 Apr 13 12:26 variables_clean.tsv
-rw-r--r--. 1 root root  5703 Apr 13 12:26 variables_parentvalues.tsv

The key files are:

  • variables_*.tsv: Child value replacement, and cleaning rules for each data-field.

  • datacodings_*.tsv: NA insertion and recoding rules for data-codings - these are used when rules are not explicitly specified in the variables_*.tsv files, for a variable which uses a given data-coding.

  • processing.tsv: List of all processing functions that are applied, in order.

  • categories.tsv: Data-field categories, for use with the --category/-c option.

Note that these rules are released as a separate package called fmrib-unpack-fmrib-config, but are automatically installed when you install FUNPACK.

If you are not happy with some of the rules defined in these files, you have the following options:

  1. Override them on the command-line, or in a configuration file (see below)

  2. Modify them in place.

  3. Create your own versions of the files, and pass them via the following command-line options:

    • --variable_file (-vf for short)

    • --datacoding_file (-df for short)

    • --type_file (-tf for short)

    • --processing_file (-pf for short)

    • --category_file (-cf for short)

Rules which you provide on the command-line or via the -vf and -df options will override built-in rules for the same data-fields/data-codings.

The data-field and data-coding rules can be stored across multiple files - for example, you may want to write all of the NA insertion rules in one file, and all of the child value replacement rules in another. FUNPACK will merge all of the built-in files and any additionally provided files together, so you do not need to maintain large and unwieldy data-field/data-coding tables.

Using a configuration file

FUNPACK has an extensive command-line interface, but you don’t need to pass all of the settings via the command-line. Instead, you can put them into a file, and give that file to FUNPACK with the --config (-cfg for short) option. You need to use the long-form of each command-line option, without the leading --.

Let’s take our example from the dry run section, and put all of the options into a configuration file:

[74]:
cat <<EOF > config.txt
no_builtins
quiet
overwrite
na_values      1   "7,8,9"
recoding       2   "1,2,3" "100,200,300"
child_values   3   "v4 != 20" "25"
clean          4   "makeNa('< 50')"
append_process all "removeIfSparse(minpres=0.5)"
EOF

cat config.txt
no_builtins
quiet
overwrite
na_values      1   "7,8,9"
recoding       2   "1,2,3" "100,200,300"
child_values   3   "v4 != 20" "25"
clean          4   "makeNa('< 50')"
append_process all "removeIfSparse(minpres=0.5)"

Now we can pass this file to FUNPACK instead of having to pass all of the command line options:

[75]:
fmrib_unpack -d -cfg config.txt out.tsv data_01.tsv
funpack 4.0.1.dev6+gd8aa574c1 dry run

Input data
  Loaded columns:        11
  Ignored columns:       0
  Unknown columns:       10
  Uncategorised columns: 0
  Loaded variables:      11

Cleaning

  NA Insertion: True
    1: [7.0, 8.0, 9.0]

  Cleaning functions: True
    4: [makeNa[cleaner](< 50)]

  Child value replacement: True
    3: [v4 != 20] -> [25.0]
  Categorical recoding: True
    2: [1.0, 2.0, 3.0] -> [100.0, 200.0, 300.0]

Processing: True
  1: ['all' list([])] -> [removeIfSparse[processor](minpres=0.5)]

Working with unknown/uncategorised variables

Future UK Biobank data releases may contain new data-fields that are not present in the built-in data-field table used by FUNPACK, and thus are not recognised as UKB data-fields.

Furthermore, if you are working with the hand-crafted data-field categories from the built-in fmrib configuration, or your own categories, a new data release may contain data-fields which are not included in any category.

To help you identify these new data-fields, FUNPACK has the ability to inform you about data-fields that it does not know about, or that are not categorised.

To demonstrate, let’s create a dummy data-field table which lists all of the data-fields that we do know about - data-fields 1 to 5. We’ll also create a custom category file, which adds data-fields 1 to 3 to a category:

[76]:
echo -e "ID\n1\n2\n3\n4\n5\n"      > custom_variables.tsv
echo -e "ID\tCategory\tVariables"  > custom_categories.tsv
echo -e "1\tmy category\t1:3"     >> custom_categories.tsv
cat custom_variables.tsv
cat custom_categories.tsv
ID
1
2
3
4
5

ID      Category        Variables
1       my category     1:3

Now we can use the --write_unknown_vars option (-wu for short) to generate a summary of the data-fields which FUNPACK did not recognise, or which were not in any category:

Again we are using the -nb (--no_builtins) option, which tells FUNPACK not to load its built-in table of UK Biobank variables.

[77]:
fmrib_unpack -nb -vf custom_variables.tsv -cf custom_categories.tsv -wu out.tsv data_01.tsv
cat out_unknown_vars.txt
name    file    class   exported
6-0.0   {{ dir_prefix }}/data_01.tsv    unknown 1
7-0.0   {{ dir_prefix }}/data_01.tsv    unknown 1
8-0.0   {{ dir_prefix }}/data_01.tsv    unknown 1
9-0.0   {{ dir_prefix }}/data_01.tsv    unknown 1
10-0.0  {{ dir_prefix }}/data_01.tsv    unknown 1
4-0.0   {{ dir_prefix }}/data_01.tsv    uncategorised   1
5-0.0   {{ dir_prefix }}/data_01.tsv    uncategorised   1

This file contains a summary of the columns that were uncategorised or not recognised, and whether they were exported to the output file. If an unknown/uncategorised data-field was not exported (e.g. it was removed during processing by a sparsity or redundancy check), the exported column will contain a 0 for that data-field.

If you are specifically interested in querying or working with these unknown/uncategorised data-fields, you can select them with the automatically-generated unknown/uncategorised categories:

[78]:
fmrib_unpack -nb -vf custom_variables.tsv -cf custom_categories.tsv -c unknown       unknowns.tsv      data_01.tsv
fmrib_unpack -nb -vf custom_variables.tsv -cf custom_categories.tsv -c uncategorised uncategorised.tsv data_01.tsv
echo "Unknown data-fields:"
cat unknowns.tsv
echo "Uncategorised data-fields:"
cat uncategorised.tsv
Unknown data-fields:
eid     6-0.0   7-0.0   8-0.0   9-0.0   10-0.0
1       22      56      65      90      12
2       35      3       65      50      67
3       36      96      62      48      59
4       20      18      72      37      27
5       84      11      28      69      10
6       46      14      60      73      80
7       23      39      68      7       63
8       83      98      36      6       23
9       12      95      73      2       62
10      20      97      57      59      23
Uncategorised data-fields:
eid     4-0.0   5-0.0
1       11      84
2       42      89
3       84      93
4       48      80
5       68      80
6       59      7
7       46      92
8       30      92
9       79      16
10      41      8