New Ingest Package Setup¶
Each study that needs to be ingested will have a single outer directory and one or more inner directories that each contain the configuration files needed by the ingest pipeline to extract, transform, and load parts of the study into a target data store or service. The inner directories are known as the ingest packages.
Note
We want to have an outer directory for each study, because we often need multiple ingest packages to complete an entire study for practical/logistical reasons.
Create an outer directory for my_study
and inside it a new ingest package
named ingest_package_1
:
$ kidsfirst new --dest_dir=my_study/ingest_package_1
This will generate a minimal study directory for you that looks like this:
my_study/
└── ingest_package_1/
├── ingest_package_config.py
├── transform_module.py
├── data
│ └── clinical.tsv
└── extract_configs/
└── extract_config.py
Ingest Package Components¶
Each ingest package contains three components:
1. A top-level configuration file called ingest_package_config.py
. This
defines study metadata such as the study’s target ID as well as paths to the
other two components and logging parameters. It will look something like:
project = 'SD_ME0WME0W' extract_config_dir = 'extract_configs' transform_function_path = 'transform_module.py' log_level = 'debug' overwrite_log = True
2. A directory of Extract Configuration files. These files define how the extract stage interprets each of the data files provided by the investigator. These are described in the Extract Stage section.
3. A guided Transform Module file. This is used by the transform stage and replaces automatic graph-based transformation with a predictable set of user-programmed operations. This is described in the Transform Stage section.