Quickstart

If you want to get up and running quickly with creating an ingest package and running it, then the quick start guide is the best place to start. It’s ok if you do not understand what each of the steps are doing. The tutorial section goes through the exact same demonstration, but it includes detailed explanations for each step.

Generate a new ingest package

Each study that needs to be ingested will have a single outer directory and one or more inner directories that each contain the configuration files needed by the ingest pipeline to extract, transform, and load parts of the study into a target data store or service. The inner directories are known as the ingest packages.

Note

We want to have an outer directory for each study, because we often need multiple ingest packages to complete an entire study for practical/logistical reasons.

The first thing to do after installing the ingest library CLI is to create an outer directory for a study named my_study and inside it a new ingest package named ingest_package_1:

$ kidsfirst new --dest_dir=my_study/ingest_package_1

Your new ingest package is extremely basic. It has 1 test source data file and the Python modules needed to extract, clean, and transform the data for the target service. If you look inside it, it will look like this:

my_study/
└── ingest_package_1/
    ├── ingest_package_config.py
    ├── transform_module.py
    ├── data/
    │   └── clinical.tsv
    ├── extract_configs/
    │   └── extract_config.py
    └── tests/
        ├── conftest.py
        └── test_custom_counts.py

And the included test data looks like

my_study/ingest_package_1/data/clinical.tsv

family

subject

sample

analyte

diagnosis

gender

f1

PID001

SP001A

dna (1)

flu

Female

f1

PID001

SP001B

rna (2)

cold

Female

f1

PID002

SP002A

dna (1)

flu

Female

f1

PID002

SP002B

rna (2)

cold

Female

f1

PID003

SP003A

dna (1)

flu

Male

f1

PID003

SP003B

rna (2)

cold

Male

f2

PID004

SP004A

dna (1)

flu

Male

f2

PID004

SP004B

rna (2)

cold

Male

f2

PID005

SP005A

dna (1)

flu

Female

f2

PID005

SP005B

rna (2)

flu

Female

f3

PID006

SP006

rna (2)

flu

Male

f3

PID007

SP007

rna (2)

flu

Male

f3

PID008

SP008A

dna (1)

flu

Female

f3

PID008

SP008B

rna (2)

flu

Female

f3

PID009

SP009A

dna (1)

flu

Male

f3

PID009

SP009B

rna (2)

flu

Male

Test

We can make sure everything works with the test command:

$ kidsfirst test my_study/ingest_package_1 --no_validate

The logs on your screen should indicate a successful unvalidated test run.

$ kidsfirst test my_study/ingest_package_1 --no_validate
2020-11-27 11:39:26,161 - DataIngestPipeline - Thread: MainThread - WARNING - Ingest will run with validation disabled!
...
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - Ingest skipped validation!
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - ✅ Ingest pipeline completed execution!
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - END data ingestion

If you scroll up through the log output, you should see that it pretends to load families, participants, diagnoses, and biospecimens from clinical.tsv.

Add another extract config

We’re going to create a second extract config to extract data from the following source data (different from the file you already have and stored in the cloud):

family_and_phenotype.tsv

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

participant

mother

father

gender

specimens

age (hrs)

CLEFT_EGO

CLEFT_ID

age (hrs)

EXTRA_EARDRUM

[ignore]

PID001

2

3

F

SP001A,SP001B

4

TRUE

FALSE

4

FALSE

[ignore]

PID002

SP002A; SP002B

435

TRUE

FALSE

435

FALSE

[ignore]

PID003

SP003A;SP003B

34

TRUE

FALSE

34

FALSE

[ignore]

PID004

5

6

M

SP004A; SP004B

4

TRUE

TRUE

4

FALSE

[ignore]

PID005

SP005A, SP005B

345

TRUE

TRUE

34

FALSE

[ignore]

PID006

SP006

34

TRUE

TRUE

43545

FALSE

[ignore]

PID007

8

9

M

SP007

34

TRUE

FALSE

5

TRUE

[ignore]

PID008

SP008A,SP008B

43545

TRUE

TRUE

52

TRUE

[ignore]

PID009

SP009A,SP009B

5

FALSE

TRUE

25

TRUE

Download the extract configuration for this new data family_and_phenotype.py and put it in my_study/ingest_package_1/extract_configs like this:

my_study/
└── ingest_package_1/
    └── extract_configs/
       └── family_and_phenotype.py

The full tutorial has a detailed explanation of what this file does, but for now the file should look like this:

my_study/ingest_package_1/extract_configs/family_and_phenotype.py
import re
from kf_lib_data_ingest.common import constants
from kf_lib_data_ingest.common.concept_schema import CONCEPT
from kf_lib_data_ingest.common.pandas_utils import Split
from kf_lib_data_ingest.etl.extract.operations import *

source_data_url = "https://raw.githubusercontent.com/kids-first/kf-lib-data-ingest/master/docs/data/family_and_phenotype.tsv"

source_data_read_params = {
    "header": 1,
    "usecols": lambda x: x != "[ignore]"
}


def observed_yes_no(x):
    if isinstance(x, str):
        x = x.lower()
    if x in {"true", "yes", 1}:
        return constants.PHENOTYPE.OBSERVED.YES
    elif x in {"false", "no", 0}:
        return constants.PHENOTYPE.OBSERVED.NO
    elif x in {"", None}:
        return None


operations = [
    value_map(
        in_col="participant",
        m={
            r"PID(\d+)": lambda x: int(x),  # strip PID and 0-padding
        },
        out_col=CONCEPT.PARTICIPANT.ID
    ),
    keep_map(
        in_col="mother",
        out_col=CONCEPT.PARTICIPANT.MOTHER_ID
    ),
    keep_map(
        in_col="father",
        out_col=CONCEPT.PARTICIPANT.FATHER_ID
    ),
    value_map(
        in_col="gender",
        # Don't worry about mother/father gender here.
        # We can create them in a later phase.
        m={
            "F": constants.GENDER.FEMALE,
            "M": constants.GENDER.MALE
        },
        out_col=CONCEPT.PARTICIPANT.GENDER
    ),
    value_map(
        in_col="specimens",
        m=lambda x: Split(re.split("[,;]", x)),
        out_col=CONCEPT.BIOSPECIMEN.ID
    ),
    [
        value_map(
            in_col=6,  # age (hrs) (first)
            m=lambda x: int(x) / 24,
            out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
        ),
        melt_map(
            var_name=CONCEPT.PHENOTYPE.NAME,
            map_for_vars={
                "CLEFT_EGO": "Cleft ego",
                "CLEFT_ID": "Cleft id"
            },
            value_name=CONCEPT.PHENOTYPE.OBSERVED,
            map_for_values=observed_yes_no
        )
    ],
    [
        value_map(
            in_col=9,  # age (hrs) (second)
            m=lambda x: int(x) / 24,
            out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
        ),
        melt_map(
            var_name=CONCEPT.PHENOTYPE.NAME,
            map_for_vars={
                "EXTRA_EARDRUM": "Extra eardrum"
            },
            value_name=CONCEPT.PHENOTYPE.OBSERVED,
            map_for_values=observed_yes_no
        )
    ]
]

Modify the Transform module

Next we want to merge our extracted tables together to properly connect the data needed for generating complete target entities.

For the quickstart, this will be performed by my_study/ingest_package_1/transform_module.py which currently looks like this:

my_study/ingest_package_1/transform_module.py
"""
Auto-generated transform module

Replace the contents of transform_function with your own code

See documentation at
https://kids-first.github.io/kf-lib-data-ingest/ for information on
implementing transform_function.
"""

from kf_lib_data_ingest.common.concept_schema import CONCEPT  # noqa F401

# Use these merge funcs, not pandas.merge
from kf_lib_data_ingest.common.pandas_utils import (  # noqa F401
    merge_wo_duplicates,
    outer_merge,
)
from kf_lib_data_ingest.config import DEFAULT_KEY


def transform_function(mapped_df_dict):
    """
    Merge DataFrames in mapped_df_dict into 1 DataFrame if possible.

    Return a dict that looks like this:

    {
        DEFAULT_KEY: all_merged_data_df
    }

    If not possible to merge all DataFrames into a single DataFrame then
    you can return a dict that looks something like this:

    {
        '<name of target concept>': df_for_<target_concept>,
        DEFAULT_KEY: all_merged_data_df
    }

    Target concept instances will be built from the default DataFrame unless
    another DataFrame is explicitly provided via a key, value pair in the
    output dict. They key must match the name of an existing target concept.
    The value will be the DataFrame to use when building instances of the
    target concept.

    A typical example would be:

    {
        'family_relationship': family_relationship_df,
        'default': all_merged_data_df
    }

    """
    df = mapped_df_dict["extract_config.py"]

    # df = outer_merge(
    #     mapped_df_dict['extract_config.py'],
    #     mapped_df_dict['family_and_phenotype.py'],
    #     on=CONCEPT.BIOSPECIMEN.ID,
    #     with_merge_detail_dfs=False
    # )

    return {DEFAULT_KEY: df}

Find the line near the bottom that says (highlighted above)

df = mapped_df_dict["extract_config.py"]

and replace that line with

df = outer_merge(
    mapped_df_dict['extract_config.py'],
    mapped_df_dict['family_and_phenotype.py'],
    on=CONCEPT.BIOSPECIMEN.ID,
    with_merge_detail_dfs=False
)

You can also just uncomment the block immediately below it that contains the same new code in a Python comment. This now defines df as the combination of the two data files joined together according to the values in their biospecimen identifier columns.

Test Again

Then again run

$ kidsfirst test my_study/ingest_package_1 --no_validate

Now you should see that it also pretends to load phenotypes from the new data in addition to the rest of the information.