Multiple Meltable Groups¶
This can get tricky. There are many ways to deal with this, each with their own pros and cons.
Scenario¶
You want to create KFDRC dataservice Participant and Phenotype entities from data that looks like this:
Participant |
Age |
Mass |
Cleft Ear |
Mass/Cleft Age |
Tennis Fingers |
Tennis Age |
---|---|---|---|---|---|---|
P1 |
70 |
yes |
yes |
60 |
no |
65 |
P2 |
80 |
no |
yes |
80 |
no |
45 |
Note that there two different groups of phenotype measurements recorded at different ages.
Correct Output¶
Participant |
Age |
Phenotype Name |
Phenotype Observed |
Phenotype Age |
---|---|---|---|---|
P1 |
70 |
Mass |
yes |
60 |
P2 |
80 |
Mass |
no |
80 |
P1 |
70 |
Cleft Ear |
yes |
60 |
P2 |
80 |
Cleft Ear |
yes |
80 |
P1 |
70 |
Tennis Fingers |
no |
65 |
P2 |
80 |
Tennis Fingers |
no |
45 |
Non-Melting Solution¶
You can avoid melting and use basic Parallel Column Stacking with a triplet of operations for each phenotype. Just be careful about operation order.
operations = [
# participants
keep_map(
in_col="Participant",
out_col=CONCEPTS.PARTICIPANT.ID
),
keep_map(
in_col="Age",
out_col=CONCEPTS.PARTICIPANT.AGE
),
# massless phenotypes
keep_map(
in_col="Mass/Cleft Age",
out_col=CONCEPTS.PHENOTYPE.AGE_AT_OBSERVATION
),
constant_map(
m="Mass",
out_col=CONCEPTS.PHENOTYPE.NAME
),
value_map(
in_col="Mass",
out_col=CONCEPTS.PHENOTYPE.OBSERVED,
m={
"yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
"no": constants.PHENOTYPE.OBSERVED.NEGATIVE
}
)
# cleft elbow phenotypes
keep_map(
in_col="Mass/Cleft Age",
out_col=CONCEPTS.PHENOTYPE.AGE_AT_OBSERVATION
),
constant_map(
m="Cleft Ear",
out_col=CONCEPTS.PHENOTYPE.NAME
),
value_map(
in_col="Cleft Ear",
out_col=CONCEPTS.PHENOTYPE.OBSERVED,
m={
"yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
"no": constants.PHENOTYPE.OBSERVED.NEGATIVE
}
)
# tennis finger phenotypes
keep_map(
in_col="Tennis Age",
out_col=CONCEPTS.PHENOTYPE.AGE_AT_OBSERVATION
),
constant_map(
m="Tennis Fingers",
out_col=CONCEPTS.PHENOTYPE.NAME
),
value_map(
in_col="Tennis Fingers",
out_col=CONCEPTS.PHENOTYPE.OBSERVED,
m={
"yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
"no": constants.PHENOTYPE.OBSERVED.NEGATIVE
}
)
]
As long as you consistently put the grouped operations together, the result should be correct. If, however, you were to swap the two value_map operations with each other, then you would be associating the wrong observation to each phenotype.
Melting Solution¶
Use melt, and wrap the melt groups in Nested Operations Sublists to safely navigate the column-length consequences of having multiple columns melted together and then grouped with another shorter column which is then lengthened by Parallel Column Stacking from another melt group.
operations = [
# participants
keep_map(
in_col="Participant",
out_col=CONCEPTS.PARTICIPANT.ID
),
keep_map(
in_col="Age",
out_col=CONCEPTS.PARTICIPANT.AGE
),
# mass/cleft phenotypes group
[
keep_map(
in_col="Mass/Cleft Age",
out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
),
melt_map(
var_name=CONCEPT.PHENOTYPE.NAME,
map_for_vars={
"Mass": "Mass",
"Cleft Ear": "Cleft Ear"
},
value_name=CONCEPT.PHENOTYPE.OBSERVED,
map_for_values={
"yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
"no": constants.PHENOTYPE.OBSERVED.NEGATIVE
}
)
],
# tennis fingers phenotype group
[
keep_map(
in_col="Tennis Age",
out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
),
melt_map(
var_name=CONCEPT.PHENOTYPE.NAME,
map_for_vars={
"Tennis Fingers": "Tennis Fingers"
},
value_name=CONCEPT.PHENOTYPE.OBSERVED,
map_for_values={
"yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
"no": constants.PHENOTYPE.OBSERVED.NEGATIVE
}
)
],
Caution
Without nested sublists clustering the operations into groups, the result would be wrong.
Melting Solution With Multiple Smaller Extracts¶
Nothing says that you need to extract the whole file all at once. You can also choose to virtually divide the source data into simple chunks, which is another way of dealing with the complex length stacking problem:
Participant |
Age |
---|---|
P1 |
70 |
P2 |
80 |
operations = [
keep_map(
in_col="Participant",
out_col=CONCEPTS.PARTICIPANT.ID
),
keep_map(
in_col="Age",
out_col=CONCEPTS.PARTICIPANT.AGE
)
]
and
Participant |
Mass/Cleft Age |
Mass |
Cleft Ear |
---|---|---|---|
P1 |
60 |
yes |
yes |
P2 |
80 |
no |
yes |
operations = [
keep_map(
in_col="Participant",
out_col=CONCEPTS.PARTICIPANT.ID
),
keep_map(
in_col="Mass/Cleft Age",
out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
),
melt_map(
var_name=CONCEPT.PHENOTYPE.NAME,
map_for_vars={
"Mass": "Mass",
"Cleft Ear": "Cleft Ear"
},
value_name=CONCEPT.PHENOTYPE.OBSERVED,
map_for_values={
"yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
"no": constants.PHENOTYPE.OBSERVED.NEGATIVE
}
)
]
and
Participant |
Tennis Age |
Tennis Fingers |
---|---|---|
P1 |
65 |
no |
P2 |
45 |
no |
operations = [
keep_map(
in_col="Participant",
out_col=CONCEPTS.PARTICIPANT.ID
),
keep_map(
in_col="Tennis Age",
out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
),
melt_map(
var_name=CONCEPT.PHENOTYPE.NAME,
map_for_vars={
"Tennis Fingers": "Tennis Fingers"
},
value_name=CONCEPT.PHENOTYPE.OBSERVED,
map_for_values={
"yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
"no": constants.PHENOTYPE.OBSERVED.NEGATIVE
}
)
]