Library Code Repository: https://github.com/kids-first/kf-lib-data-ingest
Overview¶
The Kids First Data Ingest Library is both an ETL (extract, transform, load) framework and a library that standardizes the ingestion of raw Kids First study data into target Kids First services.
Framework and Library¶
Since this is a library as well as a framework, there are components of the
kf_lib_data_ingest
package that can be used as standalone tools or
utilities. The ExtractStage
for example can be used as a standalone tool
for data selection and mapping (see Extract Stage for details).
Anything in kf_lib_data_ingest.utilities
can be used as a standalone tool.
Additionally, there are a handful of helper functions in
kf_lib_data_ingest.common.pandas_utils
that could be used outside of the
framework for data wrangling tasks. See <TODO> for details.
Ingest App¶
The library comes with a built-in command-line-interface-based (CLI) app which is the primary user interface for executing the ingest pipeline. Most users will use this app to create new ingest packages, test packages, and ingest Kids First study datasets into the Kids First Data Service.
Ingest Packages¶
In order to use this library to ingest a study, users must create ingest packages which contain all of the necessary configuration defining basic ingest input parameters and also how to extract and transform data.
Ingest Pipeline¶
Users¶
Users of the ingest system will likely fall into 3 categories:
Ingest Operator
Will likely just be running existing ingest packages
May need to learn how to modify configuration by inspection of existing ingest package configurations
Ingest Package Developer
Understands how to create new ingest packages and modify existing ones
Knows Python well and likely knows Pandas fairly well
Target Service Plugin Developer
Understands the intricacies of how to query and submit data to their target service
Understands the Load process and plugin API defined in this documentation and explored in existing target service plugins
Knows Python well and likely knows Pandas fairly well
Inputs¶
Local or remote source data files
Configuration files
Source data files might look like this:
p id |
gender |
sample id |
---|---|---|
PID001 |
f |
SS001 |
PID002 |
female |
SS002 |
PID003 |
m |
SS003 |
Outputs¶
Ingest log - A log file containing runtime details of an executed ingest job
Serialized stage output - The output of each stage execution in a serializable form written to disk
Extract Stage¶
The extract stage does the following:
Retrieve - Fetch the source data files, local or remote, and read them into memory
Select - Extract the desired subset of data from the source data files
Clean - Clean the source data (remove trailing spaces, select substrings, etc)
Map - Map the cleaned data to a set of standard Kids First attributes and values
The standard set of attributes and values are defined in the Kids First Standard Concept Schema. (See kf_lib_data_ingest/common/concept_schema.py and kf_lib_data_ingest/common/constants.py)
Extract stage output might look like this:
CONCEPT.PARTICIPANT.ID |
CONCEPT.PARTICIPANT.GENDER |
CONCEPT.BIOSPECIMEN.ID |
---|---|---|
PID001 |
Female |
SS001 |
PID002 |
Female |
SS002 |
PID003 |
Male |
SS003 |
Transform Stage¶
Using a user-defined transform function, the transform stage combines the individual tables created by the Extract Stage into a new set of “records” tables where each table row contains all of the information for an instance of a real world entity or event.
Example: If one of your extracted files contains participant age and consent information, and another extracted file contains participant race and ethnicity information, join the two tables on the participant ID column to produce a new participant records table that contains age, consent, race, and ethnicity for your participants.
Example: If half of your specimens are recorded in one file, and the other half are recorded in another file, concatenate the extracted tables from those two files together to produce one unified specimen records table.
Load Stage¶
The Load Stage builds target entity payloads from the records emitted by the transform stage, determines whether each entity already exists in the target service by its uniquely identifying components, and then submits each completed entity payload to the target service.