Skip to content

Ground Truth Lists

The MATRIX pipeline integrates several ground truth datasets to train ML models as well as validate and evaluate our drug repurposing predictions. Each dataset represents relationships between drugs and diseases, following our standard edge schema with subject (drug), object (disease), predicate, and metadata fields. Many of those relationships are directly mappable to edges in the our Knowledge Graph, some of them however are independently extracted from the KG (making them valuable for evaluation).

All those data sources require the identifers (subjects/objects) to follow KGX format (see Matrix Validator). Like other data sources in our pipeline, each dataset has a dedicated transformer class that handles transformations and integration; each dataset also goes through normalization for CURIEs to be in the same universe.

Training Datasets

Those datasets are used for training our ML classifiers for predicting drug repurposing candidates. They can be used on their own as well as standalone, dependent on parameters specifications within the modelling configuration file; note that private datasets are only accessible to internal MATRIX developers.

KGML-xDTD Ground Truth Dataset

This ground truth dataset was developed and published as part of the KGML-xDTD publication, and is used for model training and validation. It was specifically designed and validated for use with the RTX-KG2 knowledge graph, providing a comprehensive set of validated drug-disease treatment associations, however it has also been used with other KGs (e.g. ROBOKOP). The versions of this dataset are linked to the versions of RTX-KG2 knowledge graph. Integration is handled by the KGMLTruthTransformer class.

EC Ground Truth Dataset

This ground truth dataset was developed and published as a part of MATRIX project and can be found within Matrix Indication List repo. It's a curated datasets that was developed collaboratively with medical experts to ensure good quality training pairs can be provided for the model. It should be KG-agnostic as it's directly extracted from regulatory authorities bodies. The versions of this dataset are linked to the releases in the Github repository.

Evaluation Datasets

Those datasets are used for evaluating performance of our ML classifiers in predicting good quality drug repurposing candidates. Note that we always evaluate our model performance on standard train-test split in a K-Fold cross-validated fashion however here we are only describing withheld / external datasets.

Clinical Trials Data

Clinical trials data (version 20230309) was manually curated by one of EveryCure Medical Team Members and it contains drug-disease pairs from March to September 2023 clinical trials. Provided that the KG time cut-off was before 2023.03.09 (e.g. RTX-KG 2.7.3), we can use this dataset for both unseen test validation as well as time-split validation for ML models. The ClinicalTrialsTransformer class handles preprocessing to standardize entity identifiers and remove duplicates.

Off-Label Data

Off-label drug usage data containing documented cases of drugs being used for non-approved indications. This prototype dataset (v0.1) currently combines off-label usage information - for details see a notebook in our experiment repository. It can be used for validation of model performance on the unseen test set. Integration is managed through the OffLabelTransformer class. In future releases, this will be replaced with more comprehensive off-label data sourced directly from DrugBank.

These datasets are integrated into our pipeline through the integration module, with configurations specified in settings.py. The data versions are centrally managed through our globals.yml configuration. Each transformer follows the standard Transformer interface, ensuring consistent data processing and schema compliance across all data sources.