Understanding the Load Stage¶
What the Load Stage Expects¶
The Load stage expects to receive a dictionary of pandas DataFrames keyed by hints about which data can be found in the tables.
Example input:
{
"default": <pandas DataFrame 1>,
"phenotypes": <pandas DataFrame 2>,
"family_relationships": <pandas DataFrame 3>,
...
}
The keys must either be the value of a class_name
attribute from one of
your target service plugin’s included entity builders or "default"
(where the
"default"
entry will be used for any entity builder that isn’t explicitly named
in the dict), and each DataFrame must contain all of the necessary data used by
its respective target service entity builder.
In the above example, if building "phenotypes"
requires knowing who is
being described, what characteristic they have, and when the characteristic was
observed, the attached DataFrame must therefore include those values.
Using The Load Stage On Its Own¶
If your data is already appropriately standardized, you don’t need to use the Extract and Transform stages in order to use the Load Stage. The Load stage can be invoked on its own in Python like so:
from kf_lib_data_ingest.common.io import read_df
from kf_lib_data_ingest.etl.load.load_shim import LoadStage
path_to_my_target_service_plugin = "foo/bar/my_plugin.py"
target_service_base_url = "https://my_service:8080"
list_of_class_names_to_load = ["patient", "family_relationship", "specimen"]
study_id = "SD_ME0WME0W"
path_to_cache_storage_directory = "foo/bar/my_output"
LoadStage(
path_to_my_target_service_plugin,
target_service_base_url,
list_of_class_names_to_load,
study_id,
path_to_cache_storage_directory
).run(
{
"default": read_df("my_input/default.tsv"),
"family_relationship": read_df("my_input/family_relationships.tsv")
}
)
As mentioned, you may choose to do this if all of your incoming data is guaranteed to be in a strict standardized format that requires no special slicing, chopping, or blending.