Design Overview¶
The ultimate goal for ingest is to take raw investigator data and compose from it a series of simple and unambiguous factual statements of the form:
There is a thing X.
Thing X has properties A, B, and C.
Thing Y comes from thing X.
Example:
There is a participant P1.
Participant P1 has properties age=7, sex=male, and race=unknown.
There is a biospecimen S1.
Biospecimen S1 comes from Participant P1.
For the most part, the investigator’s raw data tables already implicitly encode the above statements as row colinearity. When the investigator puts a specimen ID in the same row as a participant ID, usually that means that the specimen came from that participant.
For instance, the above statements might have been written like this:
Participant ID |
Specimen ID |
Participant Age |
Participant Sex |
Participant Race |
---|---|---|---|---|
P1 |
S1 |
7 |
m |
unknown |
The Extract stage exists partly to fix any source data that either does not encode the three fundamental statements by row colinearity or accidentally encodes a relationship that does not actually exist among the things being described.