Tutorial
Tutorial¶
This tutorial walks you through the following stages of working with UK biobank data:
- Importing UK biobank data and converting to Parquet format.
- Filtering and exporting the dataset from the command line.
- Loading and filtering the dataset from Python.
Importing dataset¶
This tutorial assumes that you have access to UK Biobank data and have downloaded and extracted the data to either comma separated (.CSV) or tab separated (.TSV) format.
Use the biobank import
command to import UK Biobank data.
$ biobank import --help
Usage: biobank import [OPTIONS] FILENAME
Import biobank dataset.
Arguments:
FILENAME [required]
Options:
--dictionary PATH
--force / --no-force [default: no-force]
--help Show this message and exit.
Depending on the size of the dataset, the import command could take several minutes.
$ biobank import ukb12345.csv
loading dictionary from https://biobank.ctsu.ox.ac.uk
---> 100%
importing ukb12345.csv
---> 100%
saving dataset to .biobank/dataset.parquet
---> 100%
Working with data dictionary¶
With Biobank Tools, you can search the data dictionary for fields by ID or by search terms.
If you already know the field ID, you can check whether the field is available in your dataset.
$ biobank fields 20006 20007
FieldID ValueType Units Field
849 20006 Continuous micrometres Interpolated Year when cancer first diagnosed
850 20007 Continuous micrometres Interpolated Age of participant when cancer fi...
You can get the same results using regular expressions as well.
$ biobank fields "2000[67]"
FieldID ValueType Units Field
849 20006 Continuous micrometres Interpolated Year when cancer first diagnosed
850 20007 Continuous micrometres Interpolated Age of participant when cancer fi...
If you don't know the field ID, you can use a search term to find the field IDs.
$ biobank fields --search cancer
FieldID ValueType Units Field
849 20006 Continuous micrometres Interpolated Year when cancer first diagnosed
850 20007 Continuous micrometres Interpolated Age of participant when cancer fi...
851 20008 Continuous micrometres Interpolated Year when non-cancer illness firs...
852 20009 Continuous micrometres Interpolated Age of participant when non-cance...
Finally, you can combine field ID and search terms to further narrow the results.
$ biobank fields 200.. --search age
FieldID ValueType Units Field
850 20007 Continuous micrometres Interpolated Age of participant when cancer fi...
852 20009 Continuous micrometres Interpolated Age of participant when non-cance...
854 20011 Continuous micrometres Interpolated Age of participant when operation...
Excluding withdrawn participants¶
When you receive a list of withdrawn participants, you can exclude them from the dataset permanently.
$ biobank exclude data/w12345_*.csv
removing 20 records from 2 files:
- data/w12345_20220101.csv
- data/w12345_20220201.csv
saving dataset to .biobank/dataset.parquet
---> 100%
Extracting data¶
There are two ways to work with dataset created by Biobank Tools.
- You can export the data to a .CSV file
- You can load the dataset directly in Python
Exporting to .CSV¶
With biobank select
command, you can export a subset of the dataset that
only contains the fields you're interested in.
$ biobank select "2000[67]" --output data/ukb_cancer.csv
---> 100%
writing data/ukb_cancer.csv
If you omit the --output
option, the biobank select
command will output the
results to the screen instead.
$ biobank select "2000[67]"
[########################################] | 100% Completed | 9.8s
20006-0.0 20006-0.1 20006-0.2 20006-0.3 20006-0.4 20006-0.5 20006-1.0 20006-1.1 ... 20007-1.3 20007-2.0 20007-2.1 20007-2.2 20007-2.3 20007-2.4 20007-3.0 20007-3.1
eid ...
1000016 1951.771450 1959.965254 1959.758862 2108.331238 1793.185997 2052.090334 ...
1000048 1927.954080 1780.621043 1768.346392 2166.976559 1092.823152 1958.248591 ... 57.135394 52.331757 49.393116 64.879196 68.577423 51.484781 66.384343
1000057 1783.957575 2055.652053 2055.706534 1854.930873 1398.633216 1942.852454 ...
1000059 2036.818137 2222.614293 2239.943216 1932.932354 2801.628978 1920.073145 ...
1000063 2037.248163 2142.236323 2146.492165 2092.512233 1534.692268 1844.065170 ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9999968 2175.387378 2169.194698 2177.700380 1996.962135 1942.853501 1916.324173 ...
9999979 1994.938137 1830.922253 1823.217556 1993.885899 1436.625129 1899.035208 ...
9999980 1958.735897 2086.810754 2103.205981 2000.599709 3291.533833 1963.010565 ...
9999986 2002.481976 2039.945909 2038.244240 2068.425323 1250.346010 2066.319885 ...
9999992 1977.196447 1949.820874 1951.181053 1881.749111 2033.254502 1808.468535 ...
[599995 rows x 34 columns]
Loading dataset in Python¶
You can load the dataset in Python and apply the same field filters that you
would at the command line. The select
method returns the data as a Pandas
DataFrame.
from biobank import Dataset
dataset = Dataset()
cancer_data = dataset.select(fields=["2000[67]"])
print(cancer_data)
20006-0.0 20006-0.1 20006-0.2 20006-0.3 20006-0.4 20006-0.5 20006-1.0 20006-1.1 ... 20007-1.3 20007-2.0 20007-2.1 20007-2.2 20007-2.3 20007-2.4 20007-3.0 20007-3.1
eid ...
1000016 1951.771450 1959.965254 1959.758862 2108.331238 1793.185997 2052.090334 ...
1000048 1927.954080 1780.621043 1768.346392 2166.976559 1092.823152 1958.248591 ... 57.135394 52.331757 49.393116 64.879196 68.577423 51.484781 66.384343
1000057 1783.957575 2055.652053 2055.706534 1854.930873 1398.633216 1942.852454 ...
1000059 2036.818137 2222.614293 2239.943216 1932.932354 2801.628978 1920.073145 ...
1000063 2037.248163 2142.236323 2146.492165 2092.512233 1534.692268 1844.065170 ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9999968 2175.387378 2169.194698 2177.700380 1996.962135 1942.853501 1916.324173 ...
9999979 1994.938137 1830.922253 1823.217556 1993.885899 1436.625129 1899.035208 ...
9999980 1958.735897 2086.810754 2103.205981 2000.599709 3291.533833 1963.010565 ...
9999986 2002.481976 2039.945909 2038.244240 2068.425323 1250.346010 2066.319885 ...
9999992 1977.196447 1949.820874 1951.181053 1881.749111 2033.254502 1808.468535 ...
[599995 rows x 34 columns]