Running the Full Matrix Pipeline
Now that your Docker environment is optimized for large data processing, you can run the complete Matrix pipeline. There are two main approaches depending on your needs:
- Full e2e run: this includes running data engineering pipeline first to process raw data sources to create your own data release and then using those as input to feature/modelling pipeline.
- Using existing releases: this includes utilizing existing releases created by EC team and only running feature & modelling pipeline
Note that even though feature/modelling pipeline takes less time, both approaches are very time- and compute-consuming as they are processing very large datasets. When parallelizing the pipeline on our kubernetes cluster, we can complete e2e run in less than 16 hours however when limited to a single instance, you can expect it to run for more than 24 hours.
Option 1: Full e2e run
The first part is to run the data engineering pipeline with desired raw data sources. By default all raw data sources which are non-proprietary are enabled in our pipeline:
DYNAMIC_PIPELINES_MAPPING = lambda: disable_private_datasets(
generate_dynamic_pipeline_mapping(
{
"cross_validation": {
"n_cross_val_folds": 3,
},
"integration": [
{"name": "rtx_kg2", "integrate_in_kg": True, "is_private": False},
{"name": "spoke", "integrate_in_kg": True, "is_private": True}, # NOTE: will only be ingested by users who are part of matrix project & have granted access to proprietary datasets
{"name": "embiology", "integrate_in_kg": True, "is_private": True}, # NOTE: will only be ingested by users who are part of matrix project & have granted access to proprietary datasets
{"name": "robokop", "integrate_in_kg": True, "is_private": False},
{"name": "ec_medical_team", "integrate_in_kg": True},
{"name": "drug_list", "integrate_in_kg": False, "has_edges": False},
{"name": "disease_list", "integrate_in_kg": False, "has_edges": False},
{"name": "ground_truth", "integrate_in_kg": False, "has_nodes": False},
# {"name": "drugmech", "integrate_in_kg": False, "has_nodes": False},
{"name": "ec_clinical_trials", "integrate_in_kg": False},
{"name": "off_label", "integrate_in_kg": False, "has_nodes": False},
],
],
}
)
)
integrated_in_kg as False. Note however that to successfully run the pipeline e2e, you will need to ingest, normalize and process:
- A Knowledge Graph - to be used to calculate topological embeddings for drugs & diseases
- A Ground Truth Set - to train a predictive ML model
- Drugs List & Disease List - to run inference on combination of 60m drug-disease pairs
- Evaluation sets (e.g. clinical trials & off label) - to run tests evaluating performance
Modifying data versions
It's also possible to modify which exact version you want to ingest - if you are interested in this, please go to the walkthrough
2. Kick off the run
Once you disabled/enabled datasets of interest, you can kick off the Matrix pipeline simply by running
kedro run -p data_engineering -e base
Option 2: Run from a specific release
The first part is modify your .env set up to ensure you are using the right pipeline setup.
Step 1: Set Environment Variables
Create or modify your .env file:
# Set a unique run name for your run
RUN_NAME=my-full-data-run
# Set release version for output, this has to match the release version
RELEASE_VERSION=v0.7.0
RELEASE_FOLDER_NAME=releases
Step 2: Feature/Modelling pipeline
The Feature pipeline can be used to extract only subgraph of interest from the release. As mentioned in first steps section you can optimize the parameters file to select what features/graphs you want to exclude and keep for your run.
filtering:
node_filters:
filter_sources:
_object: matrix.pipelines.filtering.filters.KeepRowsContaining
column: upstream_data_source
keep_list:
- rtxkg2
# - robokop # Uncomment to include ROBOKOP data
# ...
edge_filters:
filter_sources:
_object: matrix.pipelines.filtering.filters.KeepRowsContaining
column: upstream_data_source
keep_list:
- rtxkg2
# - robokop
You might also want to optimize the embedding parameters
embeddings.topological_estimator:
_object: matrix.pipelines.embeddings.graph_algorithms.GDSNode2Vec
concurrency: 4
embedding_dim: 512
random_seed: 42
iterations: 10
walk_length: 30
walks_per_node: 10
window_size: 10
Neo4J Requirements
The Feature pipeline at the moment also relies on neo4j instance with a lot of memory. Make sure that your docker instance of Neo4J has appropriate amount of memory allocated (as we have specified in cloud parameters).
Neo4J Memory Settings
Please add the following to the Neo4J Environment file or settings (as per your Neo4J instruction). Adjust accordingly.
NEO4J_server_memory_heap_initial__size=40g
NEO4J_server_memory_heap_max__size=40g
NEO4J_server_memory_pagecache_size=8g
Once you have that ready, you can run
# Run after data engineering runs to completion
kedro run -e base -p feature
After completing the feature extraction step, you should be ready to kick off modelling run. As mentioned in first steps section, make sure you select classifier and train-test-split of interest: by default we use stratified randomized train-test-split with an ensemble XGBoost model.
Once you are ready, you can kick off:
# Run after data engineering runs to completion
kedro run -e base -p modelling_run
Pipeline Output
After successful completion, you'll have intermediate data products saved in your data directory with each pipeline having its own directory with intermediate data products.