Working with data: FHIRflat

Working with data: FHIRflat#

This Jupyter notebook shows how to load a sample FHIRflat folder and do simple statistics and plots. You can view a live version of this notebook on Google Colab or MyBinder by clicking the ‘Launch’ button (rocket icon) in the top right corner.

Note

On Google Colab, you will need to install the polyflame package first. You can use pip to install the package by typing into an empty code cell:

!pip install git+https://github.com/globaldothealth/polyflame

First we import the necessary functions:

import pandas as pd
import polyflame.samples
from polyflame import load_taxonomy, plot, plot_unpacked, with_readable_terms
from polyflame.fhirflat import (
    use_source,
    list_parts,
    read_part,
    condition_proportion,
    condition_upset,
    age_pyramid
)

Loading a source#

Then we load a source using the polyflame.fhirflat.use_source function. A checksum must be specified. This is to ensure reproducibility of outputs by being able to verify data integrity of FHIRflat data.

source = use_source(polyflame.samples.fhirflat, checksum=polyflame.samples.checksum_fhirflat)
tx = load_taxonomy("fhirflat-isaric3")
source
{'N': 10,
 'checksum': '03cc8e28d97a6a3ab20926d7c3f891f14e119eb882c6e8d3deb07e1b79eed089',
 'checksum_file': 'sha256sums.txt',
 'path': PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/polyflame/envs/latest/lib/python3.11/site-packages/polyflame/samples/sample-fhirflat')}

A source is a Python dictionary with pre-specified keys that tells data processing and visualization functions where to get information from. Some source types, such as FHIRflat, also have parts, which can be read in separately – in the case of FHIRflat, parts correspond to FHIR resources, with one parquet file for each resource. A list of parts for a source can be obtained using the polyflame.fhirflat.list_parts function:

list_parts(source)
['condition', 'encounter', 'patient']

We can read parts as a DataFrame using the polyflame.fhirflat.read_part function:

read_part(source, "patient")
extension.age.code extension.age.unit extension.age.value extension.birthSex.code id
0 [https://unitsofmeasure.org|a] Years 49 [http://snomed.info/sct|248152002] 4308f3e1-76e9-47ee-920e-a06fb472b9cc
1 [https://unitsofmeasure.org|a] Years 19 [http://snomed.info/sct|248152002] 7ef87588-55c7-4f48-bbf2-896e1a837454
2 [https://unitsofmeasure.org|a] Years 7 [http://snomed.info/sct|248153007] 9ded1a45-69b7-4200-9d4d-34f9996cbea6
3 [https://unitsofmeasure.org|a] Years 97 [http://snomed.info/sct|248153007] b4a9a271-2bc0-42b9-ab7f-f99c7a66b983
4 [https://unitsofmeasure.org|a] Years 90 [http://snomed.info/sct|248153007] 9a5be5ad-637a-44cb-a525-80abd55e9961
5 [https://unitsofmeasure.org|a] Years 81 [http://snomed.info/sct|248152002] 6f3681fb-dd4f-49a7-9d1a-ccb8fb7940f1
6 [https://unitsofmeasure.org|a] Years 41 [http://snomed.info/sct|248152002] 540f5e28-6fc6-4625-9c4a-f66a0fefb7aa
7 [https://unitsofmeasure.org|a] Years 60 [http://snomed.info/sct|248153007] cfd5ad0f-cace-4ecd-891d-2b4ab8a10245
8 [https://unitsofmeasure.org|a] Years 5 [http://snomed.info/sct|248153007] 0f0836fe-8e72-4e2b-8869-2807b3599beb
9 [https://unitsofmeasure.org|a] Years 34 [http://snomed.info/sct|248153007] 6e86a167-c523-48bc-8af0-8b7d004cfc01

The column names in FHIRflat resource parquet files are named after the nested FHIR attribute, such as extension.birthSex.code. These dotted fields can be cumbersome to work with, which is why read_part() provides a way to map columns:

patient = read_part(
    source, "patient",
    {
        "extension.birthSex.code": "gender",
        "extension.age.value": "age",
        "extension.age.code": "age_unit",
        "id": "subject",
    }
)
patient
gender age age_unit subject
0 [http://snomed.info/sct|248152002] 49 [https://unitsofmeasure.org|a] 4308f3e1-76e9-47ee-920e-a06fb472b9cc
1 [http://snomed.info/sct|248152002] 19 [https://unitsofmeasure.org|a] 7ef87588-55c7-4f48-bbf2-896e1a837454
2 [http://snomed.info/sct|248153007] 7 [https://unitsofmeasure.org|a] 9ded1a45-69b7-4200-9d4d-34f9996cbea6
3 [http://snomed.info/sct|248153007] 97 [https://unitsofmeasure.org|a] b4a9a271-2bc0-42b9-ab7f-f99c7a66b983
4 [http://snomed.info/sct|248153007] 90 [https://unitsofmeasure.org|a] 9a5be5ad-637a-44cb-a525-80abd55e9961
5 [http://snomed.info/sct|248152002] 81 [https://unitsofmeasure.org|a] 6f3681fb-dd4f-49a7-9d1a-ccb8fb7940f1
6 [http://snomed.info/sct|248152002] 41 [https://unitsofmeasure.org|a] 540f5e28-6fc6-4625-9c4a-f66a0fefb7aa
7 [http://snomed.info/sct|248153007] 60 [https://unitsofmeasure.org|a] cfd5ad0f-cace-4ecd-891d-2b4ab8a10245
8 [http://snomed.info/sct|248153007] 5 [https://unitsofmeasure.org|a] 0f0836fe-8e72-4e2b-8869-2807b3599beb
9 [http://snomed.info/sct|248153007] 34 [https://unitsofmeasure.org|a] 6e86a167-c523-48bc-8af0-8b7d004cfc01

This is more readable, however the field values are all coded into numerical terms from standard terminologies such as SNOMED and LOINC. While this is good for reproducibility and precision, it is easier for us to work with readable names. A helper function polyflame.fhirflat.with_readable_terms maps clinical coded terms to readable terms given a taxonomy file. A taxonomy is a TOML file containing these mappings with sections for each type of variable:

[outcome]
"https://snomed.info/sct|371827001" = "alive"
"https://snomed.info/sct|32485007" = "censored"    # still hospitalised
"https://snomed.info/sct|306685000" = "censored"   # transferred
"https://snomed.info/sct|419099009" = "death"
"https://snomed.info/sct|306237005" = "censored"   # palliative care
"https://snomed.info/sct|225928004" = "discharged"

[gender]
"http://snomed.info/sct|248153007" = "male"
"http://snomed.info/sct|248152002" = "female"

[presenceAbsence]
"https://snomed.info/sct|373066003" = true
"https://snomed.info/sct|373067005" = false

PolyFLAME ships with a small taxonomy file to work with sample data. In actual use cases, you would have to provide this file yourself.

with_readable_terms(patient, tx, [{"term_column": "gender"}])
gender age age_unit subject
0 female 49 [https://unitsofmeasure.org|a] 4308f3e1-76e9-47ee-920e-a06fb472b9cc
1 female 19 [https://unitsofmeasure.org|a] 7ef87588-55c7-4f48-bbf2-896e1a837454
2 male 7 [https://unitsofmeasure.org|a] 9ded1a45-69b7-4200-9d4d-34f9996cbea6
3 male 97 [https://unitsofmeasure.org|a] b4a9a271-2bc0-42b9-ab7f-f99c7a66b983
4 male 90 [https://unitsofmeasure.org|a] 9a5be5ad-637a-44cb-a525-80abd55e9961
5 female 81 [https://unitsofmeasure.org|a] 6f3681fb-dd4f-49a7-9d1a-ccb8fb7940f1
6 female 41 [https://unitsofmeasure.org|a] 540f5e28-6fc6-4625-9c4a-f66a0fefb7aa
7 male 60 [https://unitsofmeasure.org|a] cfd5ad0f-cace-4ecd-891d-2b4ab8a10245
8 male 5 [https://unitsofmeasure.org|a] 0f0836fe-8e72-4e2b-8869-2807b3599beb
9 male 34 [https://unitsofmeasure.org|a] 6e86a167-c523-48bc-8af0-8b7d004cfc01

Most standard analysis such as those described in the next section shouldn’t require you to perform these transformations yourself as they will be handled by the FHIRflat adapter. These are useful when you want to develop your own analyses using FHIRflat data.

Analysis#

Once we have a source, we can start looking at standard analyses, such as the proportion of patients having a particular condition:

plot(condition_proportion(source, tx))

Or, an UpSet plot showing top conditions and their co-occurrence:

plot(condition_upset(source))
/home/docs/checkouts/readthedocs.org/user_builds/polyflame/envs/latest/lib/python3.11/site-packages/polyflame/fhirflat.py:160: FutureWarning:

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

We can also look at the age pyramid, grouped by outcome type:

plot(age_pyramid(source))
/home/docs/checkouts/readthedocs.org/user_builds/polyflame/envs/latest/lib/python3.11/site-packages/polyflame/plots.py:318: FutureWarning:

The provided callable <built-in function max> is currently using np.maximum.reduce. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string np.maximum.reduce instead.

While we have shown examples using the standard FHIRflat analyses above, the plotting functions can take any generic dataframe as an input as long as they follow a particular shape. Here, we will use the plot_unpacked() function which allows us to pass dataframes directly, instead of expecting them as part of a dictionary like plot(). For example, to show a hypothetical UpSet plot showing frequency of intersection of movie genres:

df = pd.DataFrame({'crime': [1, 0, 1], 'fantasy': [0, 1, 1], 'drama': [1, 0, 0]})
df
crime fantasy drama
0 1 0 1
1 0 1 0
2 1 1 0
plot_unpacked(df, "upset")

Having plot_unpacked() be a generic function makes PolyFLAME easy to extend to other data source types, like REDCap, or your own source.

The API reference contains the full list of analyses that this adapter supports.