.. _Tutorial-Transform-Stage: =============== Transform Stage =============== The transform stage does one major thing: 1. Merge extracted tables together so that the data needed for generating complete target entities from the extracted data is properly connected. Guided Transform Module ======================= This is a Python module in your ingest package which must include a method called ``transform_function``. This method will merge the extract stage's output Pandas DataFrames into one or more composite DataFrame(s) and then returns the result. If you used ``kidsfirst new`` to create your ingest package, you should already have a `transform_module.py` file with the correct method signature, sample code, and return type for ``transform_function``. Let's take a look: .. literalinclude:: ../../../kf_lib_data_ingest/templates/my_ingest_package/transform_module.py :language: python :caption: my_study/ingest_package_1/transform_module.py The ``transform_function`` method has only one argument, the ``mapped_df_dict`` which is a dict of extract config file paths and the corresponding Pandas DataFrames produced by the extract configs. In the :ref:`Extract-Example` section of this guide, we explored an extract configuration for a file called `family_and_phenotype.tsv`. Note that the function shown above has a commented-out section of code that would, if we chose to name that new configuration `family_and_phenotype.py`, merge the two extracted outputs together by joining them on their respective participant ID columns. Modify your transform_function so that it merges your extracted DataFrames appropriately according to the function docstring shown above. There are a few important things to note in the commented-out example code: 1. We outer merge DataFrames so that we don't lose data 2. We use CONCEPT.PARTICIPANT.ID, not the string "PARTICIPANT|ID" itself 3. We do **not** use the ``Pandas.merge`` method to merge the DataFrames Merge Strategy ^^^^^^^^^^^^^^ You will likely never want to inner merge/join your DataFrames since this will result in a DataFrame with records that only match in both DataFrames. This may cause you to lose records. You may sometimes want to left/right join, however, depending on circumstances. Use concept schema to reference columns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The value of ``CONCEPT.PARTICIPANT.ID`` equates to "PARTICIPANT|ID", a string representing the participant concept's identifier. You should always use the ``CONCEPT`` class from concept schema and not strings to reference join columns. This way, if the value of the concept attribute changes (to say, "PARTICIPANT.IDENTIFIER"), your code won't break silently. Avoid Pandas.merge - use ingest library's pandas_utils.py ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You `may` use the Pandas.merge method if you want, but the ingest library provides merge functions that add useful functionality on top of Pandas.merge. ``outer_merge`` automatically fills some of the data holes that will naturally result from doing multiple sequential merges on partially-overlapping data. It also has a keyword argument called `with_merge_detail_dfs` that will output 3 additional DataFrames useful for debugging: 1. a DataFrame of rows that matched in both the left and right DataFrames (equivalent to the DataFrame returned by an inner merge) 2. a DataFrame of rows that were ONLY in the left DataFrame 3. a DataFrame of rows that were ONLY in the right DataFrame If you need to do a non-outer merge, you should use the ``merge_wo_duplicates`` method, which is what provides ``outer_merge``'s automatic hole-filling behavior. See ``kf_lib_data_ingest.common.pandas_utils`` for details. Optional - Return more than 1 DataFrame ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Sometimes it isn't possible to cleanly merge all of your extracted data into one monolithic DataFrame. In this case, you might merge subsets of your data into less-than-everything DataFrames which you only use to build instances of particular target concepts. You can do this by setting a key in the transform function's output dict specifically named after a particular target concept, with its value set to the DataFrame containing the data to build instances of that target concept. .. note:: To further understand this, read: :ref:`Tutorial-Load-Stage-Expects` For example: .. code-block:: python from kf_lib_data_ingest.common.concept_schema import CONCEPT # Use these merge funcs, not pandas.merge from kf_lib_data_ingest.common.pandas_utils import ( merge_wo_duplicates, outer_merge ) from kf_lib_data_ingest.config import DEFAULT_KEY def transform_function(mapped_df_dict): clinical_df = mapped_df_dict['extract_config.py'] family_and_phenotype_df = mapped_df_dict['family_and_phenotype.py'] merged = outer_merge( clinical_df, family_and_phenotype_df, on=CONCEPT.PARTICIPANT.ID, with_merge_detail_dfs=False ) family_relationship_df = mapped_df_dict['pedigree_to_fam_rels.py'] return { 'family_relationship': family_relationship_df, DEFAULT_KEY: merged } This transform stage output signals to use the ``family_relationship_df`` DataFrame to build instances of ``family_relationship`` and the ``merged`` DataFrame to build instances of all the other target concepts (e.g. participant, biospecimen, genomic_file) Test Your Package ================= If you run your ingest package now with ``kidsfirst test`` and everything goes well, your log should indicate that the transform function was applied and that the GuidedTransformStage began and ended.