Transform Stage

The transform stage does one major thing:

  1. Merge extracted tables together so that the data needed for generating complete target entities from the extracted data is properly connected.

Guided Transform Module

This is a Python module in your ingest package which must include a method called transform_function. This method will merge the extract stage’s output Pandas DataFrames into one or more composite DataFrame(s) and then returns the result.

If you used kidsfirst new to create your ingest package, you should already have a transform_module.py file with the correct method signature, sample code, and return type for transform_function.

Let’s take a look:

my_study/ingest_package_1/transform_module.py
"""
Auto-generated transform module

Replace the contents of transform_function with your own code

See documentation at
https://kids-first.github.io/kf-lib-data-ingest/ for information on
implementing transform_function.
"""

from kf_lib_data_ingest.common.concept_schema import CONCEPT  # noqa F401

# Use these merge funcs, not pandas.merge
from kf_lib_data_ingest.common.pandas_utils import (  # noqa F401
    merge_wo_duplicates,
    outer_merge,
)
from kf_lib_data_ingest.config import DEFAULT_KEY


def transform_function(mapped_df_dict):
    """
    Merge DataFrames in mapped_df_dict into 1 DataFrame if possible.

    Return a dict that looks like this:

    {
        DEFAULT_KEY: all_merged_data_df
    }

    If not possible to merge all DataFrames into a single DataFrame then
    you can return a dict that looks something like this:

    {
        '<name of target concept>': df_for_<target_concept>,
        DEFAULT_KEY: all_merged_data_df
    }

    Target concept instances will be built from the default DataFrame unless
    another DataFrame is explicitly provided via a key, value pair in the
    output dict. They key must match the name of an existing target concept.
    The value will be the DataFrame to use when building instances of the
    target concept.

    A typical example would be:

    {
        'family_relationship': family_relationship_df,
        'default': all_merged_data_df
    }

    """
    df = mapped_df_dict["extract_config.py"]

    # df = outer_merge(
    #     mapped_df_dict['extract_config.py'],
    #     mapped_df_dict['family_and_phenotype.py'],
    #     on=CONCEPT.BIOSPECIMEN.ID,
    #     with_merge_detail_dfs=False
    # )

    return {DEFAULT_KEY: df}

The transform_function method has only one argument, the mapped_df_dict which is a dict of extract config file paths and the corresponding Pandas DataFrames produced by the extract configs.

In the Example Extract Configuration section of this guide, we explored an extract configuration for a file called family_and_phenotype.tsv. Note that the function shown above has a commented-out section of code that would, if we chose to name that new configuration family_and_phenotype.py, merge the two extracted outputs together by joining them on their respective participant ID columns.

Modify your transform_function so that it merges your extracted DataFrames appropriately according to the function docstring shown above.

There are a few important things to note in the commented-out example code:

  1. We outer merge DataFrames so that we don’t lose data

  2. We use CONCEPT.PARTICIPANT.ID, not the string “PARTICIPANT|ID” itself

  3. We do not use the Pandas.merge method to merge the DataFrames

Merge Strategy

You will likely never want to inner merge/join your DataFrames since this will result in a DataFrame with records that only match in both DataFrames. This may cause you to lose records. You may sometimes want to left/right join, however, depending on circumstances.

Use concept schema to reference columns

The value of CONCEPT.PARTICIPANT.ID equates to “PARTICIPANT|ID”, a string representing the participant concept’s identifier.

You should always use the CONCEPT class from concept schema and not strings to reference join columns. This way, if the value of the concept attribute changes (to say, “PARTICIPANT.IDENTIFIER”), your code won’t break silently.

Avoid Pandas.merge - use ingest library’s pandas_utils.py

You may use the Pandas.merge method if you want, but the ingest library provides merge functions that add useful functionality on top of Pandas.merge.

outer_merge automatically fills some of the data holes that will naturally result from doing multiple sequential merges on partially-overlapping data.

It also has a keyword argument called with_merge_detail_dfs that will output 3 additional DataFrames useful for debugging:

  1. a DataFrame of rows that matched in both the left and right DataFrames (equivalent to the DataFrame returned by an inner merge)

  2. a DataFrame of rows that were ONLY in the left DataFrame

  3. a DataFrame of rows that were ONLY in the right DataFrame

If you need to do a non-outer merge, you should use the merge_wo_duplicates method, which is what provides outer_merge’s automatic hole-filling behavior.

See kf_lib_data_ingest.common.pandas_utils for details.

Optional - Return more than 1 DataFrame

Sometimes it isn’t possible to cleanly merge all of your extracted data into one monolithic DataFrame. In this case, you might merge subsets of your data into less-than-everything DataFrames which you only use to build instances of particular target concepts.

You can do this by setting a key in the transform function’s output dict specifically named after a particular target concept, with its value set to the DataFrame containing the data to build instances of that target concept.

Note

To further understand this, read: What the Load Stage Expects

For example:

from kf_lib_data_ingest.common.concept_schema import CONCEPT
# Use these merge funcs, not pandas.merge
from kf_lib_data_ingest.common.pandas_utils import (
    merge_wo_duplicates,
    outer_merge
)
from kf_lib_data_ingest.config import DEFAULT_KEY


def transform_function(mapped_df_dict):
    clinical_df = mapped_df_dict['extract_config.py']
    family_and_phenotype_df = mapped_df_dict['family_and_phenotype.py']
    merged = outer_merge(
        clinical_df,
        family_and_phenotype_df,
        on=CONCEPT.PARTICIPANT.ID,
        with_merge_detail_dfs=False
    )

    family_relationship_df = mapped_df_dict['pedigree_to_fam_rels.py']

    return {
        'family_relationship': family_relationship_df,
        DEFAULT_KEY: merged
    }

This transform stage output signals to use the family_relationship_df DataFrame to build instances of family_relationship and the merged DataFrame to build instances of all the other target concepts (e.g. participant, biospecimen, genomic_file)

Test Your Package

If you run your ingest package now with kidsfirst test and everything goes well, your log should indicate that the transform function was applied and that the GuidedTransformStage began and ended.