Quickstart¶
If you want to get up and running quickly with creating an ingest package and running it, then the quick start guide is the best place to start. It’s ok if you do not understand what each of the steps are doing. The tutorial section goes through the exact same demonstration, but it includes detailed explanations for each step.
Generate a new ingest package¶
Each study that needs to be ingested will have a single outer directory and one or more inner directories that each contain the configuration files needed by the ingest pipeline to extract, transform, and load parts of the study into a target data store or service. The inner directories are known as the ingest packages.
Note
We want to have an outer directory for each study, because we often need multiple ingest packages to complete an entire study for practical/logistical reasons.
The first thing to do after installing the ingest library CLI is to create an
outer directory for a study named my_study
and inside it a new ingest
package named ingest_package_1
:
$ kidsfirst new --dest_dir=my_study/ingest_package_1
Your new ingest package is extremely basic. It has 1 test source data file and the Python modules needed to extract, clean, and transform the data for the target service. If you look inside it, it will look like this:
my_study/
└── ingest_package_1/
├── ingest_package_config.py
├── transform_module.py
├── data/
│ └── clinical.tsv
├── extract_configs/
│ └── extract_config.py
└── tests/
├── conftest.py
└── test_custom_counts.py
And the included test data looks like
family |
subject |
sample |
analyte |
diagnosis |
gender |
---|---|---|---|---|---|
f1 |
PID001 |
SP001A |
dna (1) |
flu |
Female |
f1 |
PID001 |
SP001B |
rna (2) |
cold |
Female |
f1 |
PID002 |
SP002A |
dna (1) |
flu |
Female |
f1 |
PID002 |
SP002B |
rna (2) |
cold |
Female |
f1 |
PID003 |
SP003A |
dna (1) |
flu |
Male |
f1 |
PID003 |
SP003B |
rna (2) |
cold |
Male |
f2 |
PID004 |
SP004A |
dna (1) |
flu |
Male |
f2 |
PID004 |
SP004B |
rna (2) |
cold |
Male |
f2 |
PID005 |
SP005A |
dna (1) |
flu |
Female |
f2 |
PID005 |
SP005B |
rna (2) |
flu |
Female |
f3 |
PID006 |
SP006 |
rna (2) |
flu |
Male |
f3 |
PID007 |
SP007 |
rna (2) |
flu |
Male |
f3 |
PID008 |
SP008A |
dna (1) |
flu |
Female |
f3 |
PID008 |
SP008B |
rna (2) |
flu |
Female |
f3 |
PID009 |
SP009A |
dna (1) |
flu |
Male |
f3 |
PID009 |
SP009B |
rna (2) |
flu |
Male |
Test¶
We can make sure everything works with the test
command:
$ kidsfirst test my_study/ingest_package_1 --no_validate
The logs on your screen should indicate a successful unvalidated test run.
$ kidsfirst test my_study/ingest_package_1 --no_validate
2020-11-27 11:39:26,161 - DataIngestPipeline - Thread: MainThread - WARNING - Ingest will run with validation disabled!
...
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - Ingest skipped validation!
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - ✅ Ingest pipeline completed execution!
2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - END data ingestion
If you scroll up through the log output, you should see that it pretends to load families, participants, diagnoses, and biospecimens from clinical.tsv.
Add another extract config¶
We’re going to create a second extract config to extract data from the following source data (different from the file you already have and stored in the cloud):
[ignore] |
[ignore] |
[ignore] |
[ignore] |
[ignore] |
[ignore] |
[ignore] |
[ignore] |
[ignore] |
[ignore] |
[ignore] |
---|---|---|---|---|---|---|---|---|---|---|
[ignore] |
participant |
mother |
father |
gender |
specimens |
age (hrs) |
CLEFT_EGO |
CLEFT_ID |
age (hrs) |
EXTRA_EARDRUM |
[ignore] |
PID001 |
2 |
3 |
F |
SP001A,SP001B |
4 |
TRUE |
FALSE |
4 |
FALSE |
[ignore] |
PID002 |
SP002A; SP002B |
435 |
TRUE |
FALSE |
435 |
FALSE |
|||
[ignore] |
PID003 |
SP003A;SP003B |
34 |
TRUE |
FALSE |
34 |
FALSE |
|||
[ignore] |
PID004 |
5 |
6 |
M |
SP004A; SP004B |
4 |
TRUE |
TRUE |
4 |
FALSE |
[ignore] |
PID005 |
SP005A, SP005B |
345 |
TRUE |
TRUE |
34 |
FALSE |
|||
[ignore] |
PID006 |
SP006 |
34 |
TRUE |
TRUE |
43545 |
FALSE |
|||
[ignore] |
PID007 |
8 |
9 |
M |
SP007 |
34 |
TRUE |
FALSE |
5 |
TRUE |
[ignore] |
PID008 |
SP008A,SP008B |
43545 |
TRUE |
TRUE |
52 |
TRUE |
|||
[ignore] |
PID009 |
SP009A,SP009B |
5 |
FALSE |
TRUE |
25 |
TRUE |
Download the extract configuration for this new data
family_and_phenotype.py
and put it in my_study/ingest_package_1/extract_configs
like this:
my_study/
└── ingest_package_1/
└── extract_configs/
└── family_and_phenotype.py
The full tutorial has a detailed explanation of what this file does, but for now the file should look like this:
import re
from kf_lib_data_ingest.common import constants
from kf_lib_data_ingest.common.concept_schema import CONCEPT
from kf_lib_data_ingest.common.pandas_utils import Split
from kf_lib_data_ingest.etl.extract.operations import *
source_data_url = "https://raw.githubusercontent.com/kids-first/kf-lib-data-ingest/master/docs/data/family_and_phenotype.tsv"
source_data_read_params = {
"header": 1,
"usecols": lambda x: x != "[ignore]"
}
def observed_yes_no(x):
if isinstance(x, str):
x = x.lower()
if x in {"true", "yes", 1}:
return constants.PHENOTYPE.OBSERVED.YES
elif x in {"false", "no", 0}:
return constants.PHENOTYPE.OBSERVED.NO
elif x in {"", None}:
return None
operations = [
value_map(
in_col="participant",
m={
r"PID(\d+)": lambda x: int(x), # strip PID and 0-padding
},
out_col=CONCEPT.PARTICIPANT.ID
),
keep_map(
in_col="mother",
out_col=CONCEPT.PARTICIPANT.MOTHER_ID
),
keep_map(
in_col="father",
out_col=CONCEPT.PARTICIPANT.FATHER_ID
),
value_map(
in_col="gender",
# Don't worry about mother/father gender here.
# We can create them in a later phase.
m={
"F": constants.GENDER.FEMALE,
"M": constants.GENDER.MALE
},
out_col=CONCEPT.PARTICIPANT.GENDER
),
value_map(
in_col="specimens",
m=lambda x: Split(re.split("[,;]", x)),
out_col=CONCEPT.BIOSPECIMEN.ID
),
[
value_map(
in_col=6, # age (hrs) (first)
m=lambda x: int(x) / 24,
out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
),
melt_map(
var_name=CONCEPT.PHENOTYPE.NAME,
map_for_vars={
"CLEFT_EGO": "Cleft ego",
"CLEFT_ID": "Cleft id"
},
value_name=CONCEPT.PHENOTYPE.OBSERVED,
map_for_values=observed_yes_no
)
],
[
value_map(
in_col=9, # age (hrs) (second)
m=lambda x: int(x) / 24,
out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
),
melt_map(
var_name=CONCEPT.PHENOTYPE.NAME,
map_for_vars={
"EXTRA_EARDRUM": "Extra eardrum"
},
value_name=CONCEPT.PHENOTYPE.OBSERVED,
map_for_values=observed_yes_no
)
]
]
Modify the Transform module¶
Next we want to merge our extracted tables together to properly connect the data needed for generating complete target entities.
For the quickstart, this will be performed by
my_study/ingest_package_1/transform_module.py
which currently looks like
this:
"""
Auto-generated transform module
Replace the contents of transform_function with your own code
See documentation at
https://kids-first.github.io/kf-lib-data-ingest/ for information on
implementing transform_function.
"""
from kf_lib_data_ingest.common.concept_schema import CONCEPT # noqa F401
# Use these merge funcs, not pandas.merge
from kf_lib_data_ingest.common.pandas_utils import ( # noqa F401
merge_wo_duplicates,
outer_merge,
)
from kf_lib_data_ingest.config import DEFAULT_KEY
def transform_function(mapped_df_dict):
"""
Merge DataFrames in mapped_df_dict into 1 DataFrame if possible.
Return a dict that looks like this:
{
DEFAULT_KEY: all_merged_data_df
}
If not possible to merge all DataFrames into a single DataFrame then
you can return a dict that looks something like this:
{
'<name of target concept>': df_for_<target_concept>,
DEFAULT_KEY: all_merged_data_df
}
Target concept instances will be built from the default DataFrame unless
another DataFrame is explicitly provided via a key, value pair in the
output dict. They key must match the name of an existing target concept.
The value will be the DataFrame to use when building instances of the
target concept.
A typical example would be:
{
'family_relationship': family_relationship_df,
'default': all_merged_data_df
}
"""
df = mapped_df_dict["extract_config.py"]
# df = outer_merge(
# mapped_df_dict['extract_config.py'],
# mapped_df_dict['family_and_phenotype.py'],
# on=CONCEPT.BIOSPECIMEN.ID,
# with_merge_detail_dfs=False
# )
return {DEFAULT_KEY: df}
Find the line near the bottom that says (highlighted above)
df = mapped_df_dict["extract_config.py"]
and replace that line with
df = outer_merge(
mapped_df_dict['extract_config.py'],
mapped_df_dict['family_and_phenotype.py'],
on=CONCEPT.BIOSPECIMEN.ID,
with_merge_detail_dfs=False
)
You can also just uncomment the block immediately below it that contains the
same new code in a Python comment. This now defines df
as the combination
of the two data files joined together according to the values in their
biospecimen identifier columns.
Test Again¶
Then again run
$ kidsfirst test my_study/ingest_package_1 --no_validate
Now you should see that it also pretends to load phenotypes from the new data in addition to the rest of the information.