.. _Quickstart: ********** Quickstart ********** If you want to get up and running quickly with creating an ingest package and running it, then the quick start guide is the best place to start. It's ok if you do not understand what each of the steps are doing. The tutorial section goes through the exact same demonstration, but it includes detailed explanations for each step. Generate a new ingest package ============================= Each study that needs to be ingested will have a single outer directory and one or more inner directories that each contain the configuration files needed by the ingest pipeline to extract, transform, and load parts of the study into a target data store or service. The inner directories are known as the ingest packages. .. note:: We want to have an outer directory for each study, because we often need multiple ingest packages to complete an entire study for practical/logistical reasons. The first thing to do after installing the ingest library CLI is to create an outer directory for a study named ``my_study`` and inside it a new ingest package named ``ingest_package_1``: .. code-block:: text $ kidsfirst new --dest_dir=my_study/ingest_package_1 Your new ingest package is extremely basic. It has 1 test source data file and the Python modules needed to extract, clean, and transform the data for the target service. If you look inside it, it will look like this:: my_study/ └── ingest_package_1/ ├── ingest_package_config.py ├── transform_module.py ├── data/ │ └── clinical.tsv ├── extract_configs/ │ └── extract_config.py └── tests/ ├── conftest.py └── test_custom_counts.py And the included test data looks like .. csv-table:: my_study/ingest_package_1/data/clinical.tsv :delim: tab :header-rows: 1 :file: ../../kf_lib_data_ingest/templates/my_ingest_package/data/clinical.tsv Test ==== We can make sure everything works with the ``test`` command: .. code-block:: text $ kidsfirst test my_study/ingest_package_1 --no_validate The logs on your screen should indicate a successful unvalidated test run. .. code-block:: text $ kidsfirst test my_study/ingest_package_1 --no_validate 2020-11-27 11:39:26,161 - DataIngestPipeline - Thread: MainThread - WARNING - Ingest will run with validation disabled! ... 2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - Ingest skipped validation! 2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - ✅ Ingest pipeline completed execution! 2020-11-27 11:39:26,575 - DataIngestPipeline - Thread: MainThread - INFO - END data ingestion If you scroll up through the log output, you should see that it pretends to load families, participants, diagnoses, and biospecimens from clinical.tsv. Add another extract config ========================== We're going to create a second extract config to extract data from the following source data (different from the file you already have and stored in the cloud): .. csv-table:: family_and_phenotype.tsv :delim: tab :header-rows: 2 :file: ../../docs/data/family_and_phenotype.tsv Download the extract configuration for this new data :download:`family_and_phenotype.py` and put it in ``my_study/ingest_package_1/extract_configs`` like this:: my_study/ └── ingest_package_1/ └── extract_configs/    └── family_and_phenotype.py The full tutorial has a detailed explanation of what this file does, but for now the file should look like this: .. literalinclude:: tutorial/extract/family_and_phenotype.py :caption: my_study/ingest_package_1/extract_configs/family_and_phenotype.py Modify the Transform module =========================== Next we want to merge our extracted tables together to properly connect the data needed for generating complete target entities. For the quickstart, this will be performed by ``my_study/ingest_package_1/transform_module.py`` which currently looks like this: .. literalinclude:: ../../kf_lib_data_ingest/templates/my_ingest_package/transform_module.py :caption: my_study/ingest_package_1/transform_module.py :emphasize-lines: 53 Find the line near the bottom that says (highlighted above) .. code-block:: Python df = mapped_df_dict["extract_config.py"] and replace that line with .. code-block:: Python df = outer_merge( mapped_df_dict['extract_config.py'], mapped_df_dict['family_and_phenotype.py'], on=CONCEPT.BIOSPECIMEN.ID, with_merge_detail_dfs=False ) You can also just uncomment the block immediately below it that contains the same new code in a Python comment. This now defines ``df`` as the combination of the two data files joined together according to the values in their biospecimen identifier columns. Test Again ========== Then again run .. code-block:: text $ kidsfirst test my_study/ingest_package_1 --no_validate Now you should see that it also pretends to load phenotypes from the new data in addition to the rest of the information.