Transform Stage¶
The transform stage does one major thing:
Merge extracted tables together so that the data needed for generating complete target entities from the extracted data is properly connected.
Guided Transform Module¶
This is a Python module in your ingest package which must include a method
called transform_function
. This method will merge the extract stage’s
output Pandas DataFrames into one or more composite DataFrame(s) and then
returns the result.
If you used kidsfirst new
to create your ingest package, you should already
have a transform_module.py file with the correct method signature, sample
code, and return type for transform_function
.
Let’s take a look:
"""
Auto-generated transform module
Replace the contents of transform_function with your own code
See documentation at
https://kids-first.github.io/kf-lib-data-ingest/ for information on
implementing transform_function.
"""
from kf_lib_data_ingest.common.concept_schema import CONCEPT # noqa F401
# Use these merge funcs, not pandas.merge
from kf_lib_data_ingest.common.pandas_utils import ( # noqa F401
merge_wo_duplicates,
outer_merge,
)
from kf_lib_data_ingest.config import DEFAULT_KEY
def transform_function(mapped_df_dict):
"""
Merge DataFrames in mapped_df_dict into 1 DataFrame if possible.
Return a dict that looks like this:
{
DEFAULT_KEY: all_merged_data_df
}
If not possible to merge all DataFrames into a single DataFrame then
you can return a dict that looks something like this:
{
'<name of target concept>': df_for_<target_concept>,
DEFAULT_KEY: all_merged_data_df
}
Target concept instances will be built from the default DataFrame unless
another DataFrame is explicitly provided via a key, value pair in the
output dict. They key must match the name of an existing target concept.
The value will be the DataFrame to use when building instances of the
target concept.
A typical example would be:
{
'family_relationship': family_relationship_df,
'default': all_merged_data_df
}
"""
df = mapped_df_dict["extract_config.py"]
# df = outer_merge(
# mapped_df_dict['extract_config.py'],
# mapped_df_dict['family_and_phenotype.py'],
# on=CONCEPT.BIOSPECIMEN.ID,
# with_merge_detail_dfs=False
# )
return {DEFAULT_KEY: df}
The transform_function
method has only one argument, the mapped_df_dict
which is a dict of extract config file paths and the corresponding Pandas
DataFrames produced by the extract configs.
In the Example Extract Configuration section of this guide, we explored an extract configuration for a file called family_and_phenotype.tsv. Note that the function shown above has a commented-out section of code that would, if we chose to name that new configuration family_and_phenotype.py, merge the two extracted outputs together by joining them on their respective participant ID columns.
Modify your transform_function so that it merges your extracted DataFrames appropriately according to the function docstring shown above.
There are a few important things to note in the commented-out example code:
We outer merge DataFrames so that we don’t lose data
We use CONCEPT.PARTICIPANT.ID, not the string “PARTICIPANT|ID” itself
We do not use the
Pandas.merge
method to merge the DataFrames
Merge Strategy¶
You will likely never want to inner merge/join your DataFrames since this will result in a DataFrame with records that only match in both DataFrames. This may cause you to lose records. You may sometimes want to left/right join, however, depending on circumstances.
Use concept schema to reference columns¶
The value of CONCEPT.PARTICIPANT.ID
equates to “PARTICIPANT|ID”, a string
representing the participant concept’s identifier.
You should always use the CONCEPT
class from concept schema and not strings
to reference join columns. This way, if the value of the concept attribute
changes (to say, “PARTICIPANT.IDENTIFIER”), your code won’t break silently.
Avoid Pandas.merge - use ingest library’s pandas_utils.py¶
You may use the Pandas.merge method if you want, but the ingest library provides merge functions that add useful functionality on top of Pandas.merge.
outer_merge
automatically fills some of the data holes that will naturally
result from doing multiple sequential merges on partially-overlapping data.
It also has a keyword argument called with_merge_detail_dfs that will output 3 additional DataFrames useful for debugging:
a DataFrame of rows that matched in both the left and right DataFrames (equivalent to the DataFrame returned by an inner merge)
a DataFrame of rows that were ONLY in the left DataFrame
a DataFrame of rows that were ONLY in the right DataFrame
If you need to do a non-outer merge, you should use the merge_wo_duplicates
method, which is what provides outer_merge
’s automatic hole-filling
behavior.
See kf_lib_data_ingest.common.pandas_utils
for details.
Optional - Return more than 1 DataFrame¶
Sometimes it isn’t possible to cleanly merge all of your extracted data into one monolithic DataFrame. In this case, you might merge subsets of your data into less-than-everything DataFrames which you only use to build instances of particular target concepts.
You can do this by setting a key in the transform function’s output dict specifically named after a particular target concept, with its value set to the DataFrame containing the data to build instances of that target concept.
Note
To further understand this, read: What the Load Stage Expects
For example:
from kf_lib_data_ingest.common.concept_schema import CONCEPT
# Use these merge funcs, not pandas.merge
from kf_lib_data_ingest.common.pandas_utils import (
merge_wo_duplicates,
outer_merge
)
from kf_lib_data_ingest.config import DEFAULT_KEY
def transform_function(mapped_df_dict):
clinical_df = mapped_df_dict['extract_config.py']
family_and_phenotype_df = mapped_df_dict['family_and_phenotype.py']
merged = outer_merge(
clinical_df,
family_and_phenotype_df,
on=CONCEPT.PARTICIPANT.ID,
with_merge_detail_dfs=False
)
family_relationship_df = mapped_df_dict['pedigree_to_fam_rels.py']
return {
'family_relationship': family_relationship_df,
DEFAULT_KEY: merged
}
This transform stage output signals to use the family_relationship_df
DataFrame to build instances of family_relationship
and the merged
DataFrame to build instances of all the other target concepts (e.g.
participant, biospecimen, genomic_file)
Test Your Package¶
If you run your ingest package now with kidsfirst test
and everything goes
well, your log should indicate that the transform function was applied and
that the GuidedTransformStage began and ended.