Extract Stage

The extract stage does 3 main things:

  1. Select the desired subset of data from the source data files

  2. Clean the selected data (remove trailing whitespaces, etc)

  3. Map the cleaned data’s attributes and values to Kids First entity attributes and acceptable values.

The extract configuration files instruct the extract stage how to accomplish the above.

Example Extract Configuration

Every data file provided by the investigator will require at least one extract configuration file. The reasons why will hopefully become clear after reading this section.

We will step through how to write an extract configuration file for a source data file.

Here’s an example source data file:

family_and_phenotype.tsv

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

[ignore]

participant

mother

father

gender

specimens

age (hrs)

CLEFT_EGO

CLEFT_ID

age (hrs)

EXTRA_EARDRUM

[ignore]

PID001

2

3

F

SP001A,SP001B

4

TRUE

FALSE

4

FALSE

[ignore]

PID002

SP002A; SP002B

435

TRUE

FALSE

435

FALSE

[ignore]

PID003

SP003A;SP003B

34

TRUE

FALSE

34

FALSE

[ignore]

PID004

5

6

M

SP004A; SP004B

4

TRUE

TRUE

4

FALSE

[ignore]

PID005

SP005A, SP005B

345

TRUE

TRUE

34

FALSE

[ignore]

PID006

SP006

34

TRUE

TRUE

43545

FALSE

[ignore]

PID007

8

9

M

SP007

34

TRUE

FALSE

5

TRUE

[ignore]

PID008

SP008A,SP008B

43545

TRUE

TRUE

52

TRUE

[ignore]

PID009

SP009A,SP009B

5

FALSE

TRUE

25

TRUE

In order to ingest this data into the Kids First ecosystem, we will need to:

  • Ignore everything marked with [ignore], meaning our table actually starts at the second row and second column

  • Convert the column headers into a standardized set of concepts that the toolchain can understand. The values are defined in the Standard Concept Schema located in kf_lib_data_ingest.common.concept_schema.CONCEPT

  • Unify the formats of the participant IDs and the mother/father IDs

  • Convert the M and F in the gender column to standardized values for Male/Female found in kf_lib_data_ingest.common.constants, because we want to use standard constant codes wherever possible.

  • Split delimited specimens apart into their own rows

  • Convert age (hrs) to age in days

  • Reshape the CLEFT_EGO, CLEFT_ID, and EXTRA_EARDRUM columns into observation events

  • Convert the TRUE and FALSE strings into standardized observation codes

The following extract configuration file accomplishes all of those needs

(We will explain how each piece works after showing the whole thing)

family_and_phenotype.py
import re
from kf_lib_data_ingest.common import constants
from kf_lib_data_ingest.common.concept_schema import CONCEPT
from kf_lib_data_ingest.common.pandas_utils import Split
from kf_lib_data_ingest.etl.extract.operations import *

source_data_url = "https://raw.githubusercontent.com/kids-first/kf-lib-data-ingest/master/docs/data/family_and_phenotype.tsv"

source_data_read_params = {
    "header": 1,
    "usecols": lambda x: x != "[ignore]"
}


def observed_yes_no(x):
    if isinstance(x, str):
        x = x.lower()
    if x in {"true", "yes", 1}:
        return constants.PHENOTYPE.OBSERVED.YES
    elif x in {"false", "no", 0}:
        return constants.PHENOTYPE.OBSERVED.NO
    elif x in {"", None}:
        return None


operations = [
    value_map(
        in_col="participant",
        m={
            r"PID(\d+)": lambda x: int(x),  # strip PID and 0-padding
        },
        out_col=CONCEPT.PARTICIPANT.ID
    ),
    keep_map(
        in_col="mother",
        out_col=CONCEPT.PARTICIPANT.MOTHER_ID
    ),
    keep_map(
        in_col="father",
        out_col=CONCEPT.PARTICIPANT.FATHER_ID
    ),
    value_map(
        in_col="gender",
        # Don't worry about mother/father gender here.
        # We can create them in a later phase.
        m={
            "F": constants.GENDER.FEMALE,
            "M": constants.GENDER.MALE
        },
        out_col=CONCEPT.PARTICIPANT.GENDER
    ),
    value_map(
        in_col="specimens",
        m=lambda x: Split(re.split("[,;]", x)),
        out_col=CONCEPT.BIOSPECIMEN.ID
    ),
    [
        value_map(
            in_col=6,  # age (hrs) (first)
            m=lambda x: int(x) / 24,
            out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
        ),
        melt_map(
            var_name=CONCEPT.PHENOTYPE.NAME,
            map_for_vars={
                "CLEFT_EGO": "Cleft ego",
                "CLEFT_ID": "Cleft id"
            },
            value_name=CONCEPT.PHENOTYPE.OBSERVED,
            map_for_values=observed_yes_no
        )
    ],
    [
        value_map(
            in_col=9,  # age (hrs) (second)
            m=lambda x: int(x) / 24,
            out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
        ),
        melt_map(
            var_name=CONCEPT.PHENOTYPE.NAME,
            map_for_vars={
                "EXTRA_EARDRUM": "Extra eardrum"
            },
            value_name=CONCEPT.PHENOTYPE.OBSERVED,
            map_for_values=observed_yes_no
        )
    ]
]

First, download family_and_phenotype.py and put it in the extract_configs directory like this:

my_study/
└── ingest_package_1/
    └── extract_configs/
       └── family_and_phenotype.py

Now let’s break down how it works…

Imports!

import re
from kf_lib_data_ingest.common import constants
from kf_lib_data_ingest.common.concept_schema import CONCEPT
from kf_lib_data_ingest.common.pandas_utils import Split
from kf_lib_data_ingest.etl.extract.operations import *

It’s a Python module! Cool! That lets us do all kinds of neat stuff like import predefined constants and functions and use arbitrarily complex function code to process our data.

Fetching the data

source_data_url = "https://raw.githubusercontent.com/kids-first/kf-lib-data-ingest/master/docs/data/family_and_phenotype.tsv"

The first thing that the extractor does for every config file is fetch the related source data. This specifies where the file lives so that the code can fetch it.

Supported protocol prefixes are: file://, s3://, http://, https://

For files stored by the Kids First Data Tracker, see the “Source Data stored by the Kids First Data Tracker” tutorial.

If you downloaded the test data file and want to load your local copy instead of from a remote server, you would use:

source_data_url = "file://path/to/family_and_phenotype.tsv"

Reading the file

source_data_read_params = {
    "header": 1,
    "usecols": lambda x: x != "[ignore]"
}

Extract tries to automatically pick the right pandas file reader for the given file extension (read_json for json files, read_csv for csv/tsv, read_excel for excel files), but sometimes the reader needs guidance from input parameters. To pass keyword parameters to the file reader, we define a dict called source_data_read_params which corresponds with the Python pandas IO parameters described in http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

The example data file contains tab-separated values (hence the filename ending with “.tsv”) with a non-standard layout where we need to ignore the first row and the first column.

If the data had had the simplest layout (the column headers being on the first row, etc), then it would get loaded correctly by default without needing any parameters here, but with complex arrangements we have to configure the reader.

Extract operations

The operations list

operations = [
    ...
]

This is a list of special functions that the extract stage will execute to select subsets of source data and then clean and map that data to the desired attributes and value formats. The most useful functions are already written for you. You just have to invoke them appropriately.

For more information about extract operation functions, read Predefined Mapping Operations.

A value map operation with functional replacements

value_map(
    in_col="participant",
    m={
        r"PID(\d+)": lambda x: int(x),  # strip PID and 0-padding
    },
    out_col=CONCEPT.PARTICIPANT.ID
)

This says “Use the "participant" column as input, replace everything that matches (m={...}) the regular expression pattern ^PID(\d+)$ with just the captured part and remove the zero padding by running the captured part through the function lambda x: int(x), and then output the result to a standard concept column for the participant ID.”

The resulting intermediate output will look like:

index

<CONCEPT.PARTICIPANT.ID>

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

lambda x: int(x) could be replaced by just int, since the two expressions are functionally equivalent (both single-argument functions that effectively strip the leading zeros).

We could also have kept these IDs as they were and instead converted the mother/father IDs, but, in the absence of an overriding directive such as input from the investigators about their preferences, it doesn’t really make a difference which way we choose.

A keep-the-original-values map operation

keep_map(
    in_col="mother",
    out_col=CONCEPT.PARTICIPANT.MOTHER_ID
)

This says “Put all of the values from the "mother" column into a standard concept column for the ID of the participant’s mother, but keep all of the values the same.” keep_map is the same as a value_map where every value is mapped to itself.

The resulting intermediate output will look like:

index

<CONCEPT.PARTICIPANT.MOTHER_ID>

0

2

1

2

3

5

4

5

6

8

7

8

A value map operation with variable replacements

value_map(
    in_col="gender",
    m={
        "F": constants.GENDER.FEMALE,
        "M": constants.GENDER.MALE
    },
    out_col=CONCEPT.PARTICIPANT.GENDER
)

This says “Use the gender column as input, replace everything that matches the regular expression pattern ^F$ with the standard code for Female and replace everything that matches ^M$ with the standard code for Male, and then output the result to a standard concept column for participant gender.”

The resulting intermediate output will look like:

index

<CONCEPT.PARTICIPANT.ID>

0

Female

1

2

3

Male

4

5

6

Male

7

8

A value map that splits cells apart

value_map(
    in_col="specimens",
    m=lambda x: Split(re.split("[,;]", x)),
    out_col=CONCEPT.BIOSPECIMEN.ID
)

This says “Use the specimens column as input, split any , or ; delimited values apart into their own entries, and then output the result to a standard concept column for biospecimen ID.”

Note

We use the special Split() object for lists of values that we want to split apart into multiple rows. Just returning a list will not split the contained items apart. Read Splitting Cells.

The resulting intermediate output will look like:

index

<CONCEPT.BIOSPECIMEN.ID>

0

SP001A

0

SP001B

1

SP002A

1

SP002B

2

SP003A

2

SP003B

3

SP004A

3

SP004B

4

SP005A

4

SP005B

5

SP006

6

SP007

7

SP008A

7

SP008B

8

SP009A

8

SP009B

0

SP001A

0

SP001B

1

SP002A

1

SP002B

2

SP003A

2

SP003B

3

SP004A

3

SP004B

4

SP005A

4

SP005B

5

SP006

6

SP007

7

SP008A

7

SP008B

8

SP009A

8

SP009B

0

SP001A

0

SP001B

1

SP002A

1

SP002B

2

SP003A

2

SP003B

3

SP004A

3

SP004B

4

SP005A

4

SP005B

5

SP006

6

SP007

7

SP008A

7

SP008B

8

SP009A

8

SP009B

A melt map operation

melt_map(
    var_name=CONCEPT.PHENOTYPE.NAME,
    map_for_vars={
        "CLEFT_EGO": "Cleft ego",
        "CLEFT_ID": "Cleft id"
    },
    value_name=CONCEPT.PHENOTYPE.OBSERVED,
    map_for_values=observed_yes_no
)

This says “Generate new standard concept columns for phenotype name and observation by melting (read https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) the CLEFT_EGO and CLEFT_ID columns into the variables Cleft ego and Cleft id and map the TRUE/FALSE values by passing them through the included observed_yes_no function.”

The resulting intermediate output will look like:

index

<CONCEPT.PHENOTYPE.NAME>

<CONCEPT.PHENOTYPE.OBSERVED>

0

Cleft ego

Positive

1

Cleft ego

Positive

2

Cleft ego

Positive

3

Cleft ego

Positive

4

Cleft ego

Positive

5

Cleft ego

Positive

6

Cleft ego

Positive

7

Cleft ego

Positive

8

Cleft ego

Negative

0

Cleft id

Negative

1

Cleft id

Negative

2

Cleft id

Negative

3

Cleft id

Positive

4

Cleft id

Positive

5

Cleft id

Positive

6

Cleft id

Negative

7

Cleft id

Positive

8

Cleft id

Positive

A nested operation sub-list

[
    value_map(
        in_col=6,  # age (hrs) (first)
        m=lambda x: int(x) / 24,
        out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
    ),
    melt_map(
        var_name=CONCEPT.PHENOTYPE.NAME,
        map_for_vars={
            "CLEFT_EGO": "Cleft ego",
            "CLEFT_ID": "Cleft id"
        },
        value_name=CONCEPT.PHENOTYPE.OBSERVED,
        map_for_values=observed_yes_no
    )
]

Having a sub-list says “Treat the enclosed operations as a single logically-linked unit”.

For this particular scenario it gives a way to say that these phenotype columns go with this age column and not that other age column. It should also always be possible to accomplish the same thing by making a separate extract configuration file for those operations.

The resulting intermediate output for both of these operations together will look like:

index

<CONCEPT.PHENOTYPE.EVENT_AGE_DAYS>

<CONCEPT.PHENOTYPE.NAME>

<CONCEPT.PHENOTYPE.OBSERVED>

0

0.166667

Cleft ego

Positive

1

18.125

Cleft ego

Positive

2

1.416667

Cleft ego

Positive

3

0.166667

Cleft ego

Positive

4

14.375

Cleft ego

Positive

5

1.416667

Cleft ego

Positive

6

1.416667

Cleft ego

Positive

7

1814.375

Cleft ego

Positive

8

0.208333

Cleft ego

Negative

0

0.166667

Cleft id

Negative

1

18.125

Cleft id

Negative

2

1.416667

Cleft id

Negative

3

0.166667

Cleft id

Positive

4

14.375

Cleft id

Positive

5

1.416667

Cleft id

Positive

6

1.416667

Cleft id

Negative

7

1814.375

Cleft id

Positive

8

0.208333

Cleft id

Positive

The final Extraction product

Once all of the operations are complete and the extract stage has done its magic, the final extracted result given the data and our configuration is:

index

<CONCEPT.PARTICIPANT.ID>

<CONCEPT.PARTICIPANT.MOTHER_ID>

<CONCEPT.PARTICIPANT.FATHER_ID>

<CONCEPT.PARTICIPANT.GENDER>

<CONCEPT.BIOSPECIMEN.ID>

<CONCEPT.PHENOTYPE.EVENT_AGE_DAYS>

<CONCEPT.PHENOTYPE.NAME>

<CONCEPT.PHENOTYPE.OBSERVED>

0

1

2

3

Female

SP001A

0.166666667

Cleft ego

Positive

0

1

2

3

Female

SP001B

0.166666667

Cleft ego

Positive

1

2

SP002A

18.125

Cleft ego

Positive

1

2

SP002B

18.125

Cleft ego

Positive

2

3

SP003A

1.416666667

Cleft ego

Positive

2

3

SP003B

1.416666667

Cleft ego

Positive

3

4

5

6

Male

SP004A

0.166666667

Cleft ego

Positive

3

4

5

6

Male

SP004B

0.166666667

Cleft ego

Positive

4

5

SP005A

14.375

Cleft ego

Positive

4

5

SP005B

14.375

Cleft ego

Positive

5

6

SP006

1.416666667

Cleft ego

Positive

6

7

8

9

Male

SP007

1.416666667

Cleft ego

Positive

7

8

SP008A

1814.375

Cleft ego

Positive

7

8

SP008B

1814.375

Cleft ego

Positive

8

9

SP009A

0.208333333

Cleft ego

Negative

8

9

SP009B

0.208333333

Cleft ego

Negative

0

1

2

3

Female

SP001A

0.166666667

Cleft id

Negative

0

1

2

3

Female

SP001B

0.166666667

Cleft id

Negative

1

2

SP002A

18.125

Cleft id

Negative

1

2

SP002B

18.125

Cleft id

Negative

2

3

SP003A

1.416666667

Cleft id

Negative

2

3

SP003B

1.416666667

Cleft id

Negative

3

4

5

6

Male

SP004A

0.166666667

Cleft id

Positive

3

4

5

6

Male

SP004B

0.166666667

Cleft id

Positive

4

5

SP005A

14.375

Cleft id

Positive

4

5

SP005B

14.375

Cleft id

Positive

5

6

SP006

1.416666667

Cleft id

Positive

6

7

8

9

Male

SP007

1.416666667

Cleft id

Negative

7

8

SP008A

1814.375

Cleft id

Positive

7

8

SP008B

1814.375

Cleft id

Positive

8

9

SP009A

0.208333333

Cleft id

Positive

8

9

SP009B

0.208333333

Cleft id

Positive

0

1

2

3

Female

SP001A

0.166666667

Extra eardrum

Negative

0

1

2

3

Female

SP001B

0.166666667

Extra eardrum

Negative

1

2

SP002A

18.125

Extra eardrum

Negative

1

2

SP002B

18.125

Extra eardrum

Negative

2

3

SP003A

1.416666667

Extra eardrum

Negative

2

3

SP003B

1.416666667

Extra eardrum

Negative

3

4

5

6

Male

SP004A

0.166666667

Extra eardrum

Negative

3

4

5

6

Male

SP004B

0.166666667

Extra eardrum

Negative

4

5

SP005A

1.416666667

Extra eardrum

Negative

4

5

SP005B

1.416666667

Extra eardrum

Negative

5

6

SP006

1814.375

Extra eardrum

Negative

6

7

8

9

Male

SP007

0.208333333

Extra eardrum

Positive

7

8

SP008A

2.166666667

Extra eardrum

Positive

7

8

SP008B

2.166666667

Extra eardrum

Positive

8

9

SP009A

1.041666667

Extra eardrum

Positive

8

9

SP009B

1.041666667

Extra eardrum

Positive

Preprocessing Before the Operations List

Source data can’t always be mapped right away, or perhaps your preferred mapping operations require an initial pre-processing step, like e.g. transposing the table to turn rows into columns.

Initial manipulations can be done in a do_after_read function that you may optionally define in your extract configuration file like this:

# This function will be called before performing the mapping operations list.
def do_after_read(df):
    # do something with df
    return new_df

Files That Need Custom Readers

Source data can’t always be read automatically by the default tabular read functions. We provide readers for CSV/TSV/Excel sheets, but maybe the data you need is embedded in a PDF or PowerPoint file.

If the right reader cannot be automatically inferred, you can define a custom function called source_data_read_func that takes a local file path and keyword arguments (also given in source_data_read_params) and returns a pandas DataFrame:

# a special custom file reader which must return a pandas DataFrame
def source_data_read_func(filepath, **kwargs):
    return my_df

Predefined Mapping Operations

Note

Mapping operations share common input parameters m, in_col, and out_col.

in_col: The name or numerical index of the column in the input data file.

out_col: The name of the standard concept to use for the output column in the DataFrame containing extracted data.

m: The magic (or the “map”). Depending on the mapping operation, this will be a dict, function, or constant value.

Keep Map

def keep_map(in_col, out_col):

The keep_map operation copies all of the values from in_col to out_col without modifying the values.

keep_map is a special case of value_map with m=lambda x: x.

Constant Map

def constant_map(m, out_col):

The constant_map operation writes N iterations of constant value m to out_col, where N is the number of rows in the source data table.

constant_map is a special case of value_map with m=lambda x: <some constant>.

Caution

Only use constant_map for columns that are not in the data. Why? What if the column is 99% one value, so you think “I’ll just constant map to that value”, but then you change the other 1% of values accidentally because you never saw them? Or what if the next version of the data introduces a new value? Your error. Don’t do it. Use value_map instead for existing columns.

Value Map

def value_map(m, in_col, out_col):

The value_map operation takes values from in_col, performs some transformation m on them, and then writes the result to out_col in your result dataframe.

m can be a function, dict, or regex pattern string:

  • If it’s a function, the returned column values will be the result of applying that function to each value in in_col.

  • If it’s a dict, each key is a regex pattern string, and each value is either a direct replacement or a function applied to values in in_col matching the pattern.

  • If it’s a regex pattern string, the result is the same as if it had been {m: lambda x: x}.

Note

  • Regex patterns without ^ and $ anchors will have them automatically added.

  • m dict regex pattern matches are checked in order, and values from in_col are each only matched once, so {"a": "b", "b": "c"} will never convert “a” into “c”.

  • If a regex pattern includes capture groups, the receiving function will receive only the captures instead of the whole cell value.

Row Map

def row_map(m, out_col):

The row_map operation lets you build a single output column using data from across entire rows of source data when your output column needs to combine data from multiple cells in the row. m is a function that takes an entire row as input and returns a single value using values from that row.

Column Map

def column_map(m, in_col, out_col):

The column_map operation lets you pass an entire column of data into a custom function m which returns a new column (of the same length) as the result.

Note

Needing to use column_map is a sign that the original data is poorly constructed.

Melt Map

def melt_map(var_name, map_for_vars, value_name, map_for_values):

The melt_map operation combines the pandas melt function and the value_map operation. It both reshapes data and maps values.

  • var_name is the melt’s resulting var column name.

  • map_for_vars is a dict with keys equal to column names or numeric indices from the source data and values equal to replacement var names.

  • value_name is the melt’s resulting value column name.

  • map_for_values is the same as value_map’s m.

Note

Any use of melt_map can be replaced by multiple uses of value_map with appropriate column stacking.

DF Map

def df_map(m):

The df_map operation is for when you just can’t figure out how to do what you want to do with the other predefined operations. Take a whole DataFrame, perform some function m on it, and return a new DataFrame of the same size.

Note

You’ll probably never use df_map. That’s fine. It’s there for geometric completeness. Needing to use it is a sign that the original data is poorly constructed.

Important Strategic Details

Avoid Hard-coded Strings

Avoid using hard-coded strings for output column headers and values. As much as possible, favor variables in common.constants for cell values and variables in common.concept_schema.CONCEPT for out_col assignment. That way everyone has the chance to use the same shared set of values.

Repeated Information

Multiple identical declarations don’t add new information and therefore don’t matter. They also don’t hurt anything. This table means exactly the same thing as if it had only one row:

Participant ID

Specimen ID

Participant Age

Participant Sex

Participant Race

P1

S1

7

m

unknown

P1

S1

7

m

unknown

You Don’t Need To Extract The Whole Document All At Once

If creating multiple extraction configurations for a single source data file to generate multiple smaller tables during Extract that can be manipulated in a Transform function is easier for you than making one complex table for the whole data file, then do it. If not, that’s fine too.

Parallel Column Stacking

If the same out_col is used multiple times, the column output from successive operations will be vertically concatenated.

../../_images/columns.svg

Length Rectification

The length disparity between output columns produced by Parallel Column Stacking is then rectified by least-common-multiple repetition of output columns which gives a final result that preserves row colinearity from the original source data. Note in the next image that A is inline with A, B is inline with B, etc.

../../_images/length_rectification.svg

Caution

The lengths of all output columns must be multiples of the original source data length. If they aren’t, the library will raise an error.

Splitting Cells

Sometimes you will have source documents that lump multiple distinct values into the same table cell. For instance, study participants might have multiple specimens reported together as follows:

Participant

Specimens

P1

SP1a/SP1b

P2

SP2

P3

SP3a/SP3b/SP3c

Because of the practical implications of the Value Principles, you will want to have independent associations (P1, SP1a), (P1, SP1b), etc, so you need to separate the “/”-delimited specimen values into separate entries. Replacing each of the Specimens cells with the appropriate Split object will tell the Extract Stage to do the right thing.

The Split Object

The Split object (common.pandas_utils.Split) is instantiated with a list of values and optionally a group ID. After all of the extract mapping operations are finished, any Split objects found in your output will trigger a “split” of the row into multiple rows with identical non-Split cells and each with a different value from among the Split members.

For instance, from:

A

B

C

Split([1, 2, 3])

Split([4, 5, 6])

7

to:

A

B

C

1

4

7

1

5

7

1

6

7

2

4

7

2

5

7

2

6

7

3

4

7

3

5

7

3

6

7

Without groups, multiple Splits on the same row will multiply to produce the cartesian product of the values in those Splits. However, if Splits are assigned a group value, then Splits on the same row in the same group will be linked to each other so that they do not form a cartesian product with each other.

A

B

C

Split([8, 9], group=1)

Split([10, 11, 12], group=1)

13

to:

A

B

C

8

10

13

9

11

13

None

12

13

And if you combine both of these forms together, you go from:

A

B

C

Split([1, 2], group=1)

Split([3, 4, 5], group=1)

Split([6, 7])

to:

A

B

C

1

3

6

2

4

6

None

5

6

1

3

7

2

4

7

None

5

7

Going back to the participants and specimens example, you might use a Split object as part of a value_map operation like this:

value_map(
    in_col="Specimens",
    out_col="<example_output_col>",
    m=lambda x: Split(x.split("/"))
)

Note

Interpretation of Split objects happens after all the mapping operations are complete, so the newly created rows will not be affected by the length-multiple restriction cautioned in Parallel Column Stacking.

Nested Operations Sublists

You can nest groups of operations inside of sublists to do length rectification on just the group before joining it with the rest of the operations output.

It looks like this:

operations = [
    [
        # operations group 1
    ],
    [
        # operations group 2
    ],
    # etc
]

The reason for doing this is that Parallel Column Stacking can lead to row mismatch problems in scenarios where several columns form logical groups with each other and the mapping operations don’t all produce the same length output.

Consider the situation of extracting KFDRC phenotypes from:

Participant

Age

Mass

Cleft Ear

Mass/Cleft Age

Tennis Fingers

Tennis Age

P1

70

yes

yes

60

no

65

P2

80

no

yes

80

no

45

Note that there are two different groups of phenotype measurements recorded at different ages.

A naïve and incorrect approach would be to do this:

operations = [
    melt_map(
        var_name=CONCEPT.PHENOTYPE.NAME,
        map_for_vars={
            "Mass": "Mass",
            "Cleft Ear": "Cleft Ear"
        },
        value_name=CONCEPT.PHENOTYPE.OBSERVED,
        map_for_values={
            "yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
            "no": constants.PHENOTYPE.OBSERVED.NEGATIVE
        }
    ),
    keep_map(
        in_col="Mass/Cleft Age",
        out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
    ),
    melt_map(
        var_name=CONCEPT.PHENOTYPE.NAME,
        map_for_vars={
            "Tennis Fingers": "Tennis Fingers"
        },
        value_name=CONCEPT.PHENOTYPE.OBSERVED,
        map_for_values={
            "yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
            "no": constants.PHENOTYPE.OBSERVED.NEGATIVE
        }
    ),
    keep_map(
        in_col="Tennis Age",
        out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
    )
]

But this will cause the following bad result because the operations stack output columns with different length outputs:

../../_images/bad_melt.svg

Then the Length Rectification will produce this mess:

../../_images/wrong.svg

If you need to group values together and also stack columns that belong to the groups, all of the columns in each group must be the same length, otherwise groups will invade each others’ space. When melting multiple columns, this is naturally not going to be the case.

Putting each of the groups in its own nested sublist solves this problem by rectifying lengths for each of the groups independently first.

operations = [
    [   # mass/cleft group
        melt_map(
            var_name=CONCEPT.PHENOTYPE.NAME,
            map_for_vars={
                "Mass": "Mass",
                "Cleft Ear": "Cleft Ear"
            },
            value_name=CONCEPT.PHENOTYPE.OBSERVED,
            map_for_values={
                "yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
                "no": constants.PHENOTYPE.OBSERVED.NEGATIVE
            }
        ),
        keep_map(
            in_col="Mass/Cleft Age",
            out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
        ),
    ],
    [   # tennis fingers group
        melt_map(
            var_name=CONCEPT.PHENOTYPE.NAME,
            map_for_vars={
                "Tennis Fingers": "Tennis Fingers"
            },
            value_name=CONCEPT.PHENOTYPE.OBSERVED,
            map_for_values={
                "yes": constants.PHENOTYPE.OBSERVED.POSITIVE,
                "no": constants.PHENOTYPE.OBSERVED.NEGATIVE
            }
        ),
        keep_map(
            in_col="Tennis Age",
            out_col=CONCEPT.PHENOTYPE.EVENT_AGE_DAYS
        )
    ]
]

So now you get this instead:

../../_images/good_melt.svg