Quickstart¶

If you want to get up and running quickly with creating an ingest package and running it, then the quick start guide is the best place to start. It’s ok if you do not understand what each of the steps are doing. The tutorial section goes through the exact same demonstration, but it includes detailed explanations for each step.

Generate a new ingest package¶

Each study that needs to be ingested will have a single outer directory and one or more inner directories that each contain the configuration files needed by the ingest pipeline to extract, transform, and load parts of the study into a target data store or service. The inner directories are known as the ingest packages.

Note

We want to have an outer directory for each study, because we often need multiple ingest packages to complete an entire study for practical/logistical reasons.

The first thing to do after installing the ingest library CLI is to create an outer directory for a study named my_study and inside it a new ingest package named ingest_package_1:

$ kidsfirst new --dest_dir=my_study/ingest_package_1

Your new ingest package is extremely basic. It has 1 test source data file and the Python modules needed to extract, clean, and transform the data for the target service. If you look inside it, it will look like this:

my_study/
└── ingest_package_1/
    ├── ingest_package_config.py
    ├── transform_module.py
    ├── data/
    │   └── clinical.tsv
    ├── extract_configs/
    │   └── extract_config.py
    └── tests/
        ├── conftest.py
        └── test_custom_counts.py

And the included test data looks like

my_study/ingest_package_1/data/clinical.tsv¶
family	subject	sample	analyte	diagnosis	gender
f1	PID001	SP001A	dna (1)	flu	Female
f1	PID001	SP001B	rna (2)	cold	Female
f1	PID002	SP002A	dna (1)	flu	Female
f1	PID002	SP002B	rna (2)	cold	Female
f1	PID003	SP003A	dna (1)	flu	Male
f1	PID003	SP003B	rna (2)	cold	Male
f2	PID004	SP004A	dna (1)	flu	Male
f2	PID004	SP004B	rna (2)	cold	Male
f2	PID005	SP005A	dna (1)	flu	Female
f2	PID005	SP005B	rna (2)	flu	Female
f3	PID006	SP006	rna (2)	flu	Male
f3	PID007	SP007	rna (2)	flu	Male
f3	PID008	SP008A	dna (1)	flu	Female
f3	PID008	SP008B	rna (2)	flu	Female
f3	PID009	SP009A	dna (1)	flu	Male
f3	PID009	SP009B	rna (2)	flu	Male

Test¶

We can make sure everything works with the test command:

$ kidsfirst test my_study/ingest_package_1 --no_validate

The logs on your screen should indicate a successful unvalidated test run.

$ kidsfirst test my_study/ingest_package_1 --no_validate
2020-11-27 11:39:26,161 - DataIngestPipeline - Thread: MainThread - WARNING - Ingest will run with validation disabled!
...
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - Ingest skipped validation!
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - ✅ Ingest pipeline completed execution!
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - END data ingestion

If you scroll up through the log output, you should see that it pretends to load families, participants, diagnoses, and biospecimens from clinical.tsv.

Add another extract config¶

We’re going to create a second extract config to extract data from the following source data (different from the file you already have and stored in the cloud):

family_and_phenotype.tsv¶
[ignore]	[ignore]	[ignore]	[ignore]	[ignore]	[ignore]	[ignore]	[ignore]	[ignore]	[ignore]	[ignore]
[ignore]	participant	mother	father	gender	specimens	age (hrs)	CLEFT_EGO	CLEFT_ID	age (hrs)	EXTRA_EARDRUM
[ignore]	PID001	2	3	F	SP001A,SP001B	4	TRUE	FALSE	4	FALSE
[ignore]	PID002				SP002A; SP002B	435	TRUE	FALSE	435	FALSE
[ignore]	PID003				SP003A;SP003B	34	TRUE	FALSE	34	FALSE
[ignore]	PID004	5	6	M	SP004A; SP004B	4	TRUE	TRUE	4	FALSE
[ignore]	PID005				SP005A, SP005B	345	TRUE	TRUE	34	FALSE
[ignore]	PID006				SP006	34	TRUE	TRUE	43545	FALSE
[ignore]	PID007	8	9	M	SP007	34	TRUE	FALSE	5	TRUE
[ignore]	PID008				SP008A,SP008B	43545	TRUE	TRUE	52	TRUE
[ignore]	PID009				SP009A,SP009B	5	FALSE	TRUE	25	TRUE

Download the extract configuration for this new data family_and_phenotype.py and put it in my_study/ingest_package_1/extract_configs like this:

my_study/
└── ingest_package_1/
    └── extract_configs/
       └── family_and_phenotype.py

The full tutorial has a detailed explanation of what this file does, but for now the file should look like this:

my_study/ingest_package_1/extract_configs/family_and_phenotype.py¶

import re
from kf_lib_data_ingest.common import constants
from kf_lib_data_ingest.common.concept_schema import CONCEPT
from kf_lib_data_ingest.common.pandas_utils import Split
from kf_lib_data_ingest.etl.extract.operations import *

source_data_url = "https://raw.githubusercontent.com/kids-first/kf-lib-data-ingest/master/docs/data/family_and_phenotype.tsv"

source_data_read_params = {
    "header": 1,
    "usecols": lambda x: x != "[ignore]"
}


def observed_yes_no(x):
    if isinstance(x, str):
        x = x.lower()
    if x in {"true", "yes", 1}:
        return constants.PHENOTYPE.OBSERVED.YES
    elif x in {"false", "no", 0}:
        return constants.PHENOTYPE.OBSERVED.NO
    elif x in {"", None}:
        return None


operations = [
    value_map(
        in_col="participant",
        m={
            r"PID(\d+)": lambda x: int(x),  # strip PID and 0-padding
        },
        out_col=CONCEPT.PARTICIPANT.ID
    ),
    keep_map(
        in_col="mother",
        out_col=CONCEPT.PARTICIPANT.MOTHER_ID
    ),
    keep_map(
        in_col="father",
        out_col=CONCEPT.PARTICIPANT.FATHER_ID
    ),
    value_map(
        in_col="gender",
        # Don't worry about mother/father gender here.
        # We can create them in a later phase.
        m={
            "F": constants.GENDER.FEMALE,
            "M": constants.GENDER.MALE
        },
        out_col=CONCEPT.PARTICIPANT.GENDER
    ),
    value_map(
        in_col="specimens",
        m=lambda x: Split(re.split("[,;]", x)),
        out_col=CONCEPT.BIOSPECIMEN.ID
    ),
    [
        value_map(
            in_col=6,  # age (hrs) (first)
            m=lambda x: int(x) / 24,
            out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
        ),
        melt_map(
            var_name=CONCEPT.PHENOTYPE.NAME,
            map_for_vars={
                "CLEFT_EGO": "Cleft ego",
                "CLEFT_ID": "Cleft id"
            },
            value_name=CONCEPT.PHENOTYPE.OBSERVED,
            map_for_values=observed_yes_no
        )
    ],
    [
        value_map(
            in_col=9,  # age (hrs) (second)
            m=lambda x: int(x) / 24,
            out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
        ),
        melt_map(
            var_name=CONCEPT.PHENOTYPE.NAME,
            map_for_vars={
                "EXTRA_EARDRUM": "Extra eardrum"
            },
            value_name=CONCEPT.PHENOTYPE.OBSERVED,
            map_for_values=observed_yes_no
        )
    ]
]

Modify the Transform module¶

Next we want to merge our extracted tables together to properly connect the data needed for generating complete target entities.

For the quickstart, this will be performed by my_study/ingest_package_1/transform_module.py which currently looks like this:

my_study/ingest_package_1/transform_module.py¶

"""
Auto-generated transform module

Replace the contents of transform_function with your own code

See documentation at
https://kids-first.github.io/kf-lib-data-ingest/ for information on
implementing transform_function.
"""

from kf_lib_data_ingest.common.concept_schema import CONCEPT  # noqa F401

# Use these merge funcs, not pandas.merge
from kf_lib_data_ingest.common.pandas_utils import (  # noqa F401
    merge_wo_duplicates,
    outer_merge,
)
from kf_lib_data_ingest.config import DEFAULT_KEY


def transform_function(mapped_df_dict):
    """
    Merge DataFrames in mapped_df_dict into 1 DataFrame if possible.

    Return a dict that looks like this:

    {
        DEFAULT_KEY: all_merged_data_df
    }

    If not possible to merge all DataFrames into a single DataFrame then
    you can return a dict that looks something like this:

    {
        '<name of target concept>': df_for_<target_concept>,
        DEFAULT_KEY: all_merged_data_df
    }

    Target concept instances will be built from the default DataFrame unless
    another DataFrame is explicitly provided via a key, value pair in the
    output dict. They key must match the name of an existing target concept.
    The value will be the DataFrame to use when building instances of the
    target concept.

    A typical example would be:

    {
        'family_relationship': family_relationship_df,
        'default': all_merged_data_df
    }

    """
    df = mapped_df_dict["extract_config.py"]

    # df = outer_merge(
    #     mapped_df_dict['extract_config.py'],
    #     mapped_df_dict['family_and_phenotype.py'],
    #     on=CONCEPT.BIOSPECIMEN.ID,
    #     with_merge_detail_dfs=False
    # )

    return {DEFAULT_KEY: df}

Find the line near the bottom that says (highlighted above)

df = mapped_df_dict["extract_config.py"]

and replace that line with

df = outer_merge(
    mapped_df_dict['extract_config.py'],
    mapped_df_dict['family_and_phenotype.py'],
    on=CONCEPT.BIOSPECIMEN.ID,
    with_merge_detail_dfs=False
)

You can also just uncomment the block immediately below it that contains the same new code in a Python comment. This now defines df as the combination of the two data files joined together according to the values in their biospecimen identifier columns.

Test Again¶

Then again run

$ kidsfirst test my_study/ingest_package_1 --no_validate

Now you should see that it also pretends to load phenotypes from the new data in addition to the rest of the information.