Cross run evaluation with the run_comparison pipeline
The run_comparison pipeline generates various performance metrics and visualisations that allow us to compare several sets of drug-disease predictions across all drugs and diseases.
As well as predictions generated by the MATRIX modelling pipeline,
it supports any custom set of predictions satisfying the assumptions and schema described below.
The pipeline includes the following metrics:
- Full matrix ranking. Recall@n vs. n curve for on-label indications, off-label indications and known contraindications.
- Disease-specific ranking. Disease-specific Hit@k vs. k curve for on-label indications and off-label indications.
- Known positive vs. known negative classification. Precision-recall curve.
- Prevalence of frequent flyers. Drug and Disease Entropy@n vs. n curves.
- Similarity between models. Commonality@n vs. n curve.
An overview of these metrics is given in the evaluation suite deep dive.
In addition, the pipeline includes the following features:
- Uncertainty estimation. The pipeline applies multifold uncertainty estimation and bootstrap uncertainty estimation.
- Data consistency and harmonisation. Utilities to ensure a consistent and fair evaluation, such as taking intersection between drug and disease lists for all sets of predictions.
How do I use the run_comparison pipeline?
To use the run comparison pipeline, follow these steps:
- Ensure that the MATRIX repository is cloned and the environment set-up. Create and checkout a new branch off
main. - Modify the parameters configuration file for the run comparison pipeline
to specify the predictions dataframes you would like to include in the evaluation. Details given below. Note: ensure that the input prediction dataframes do not include pairs with known labels used in training (synthesised negatives are ok).conf/base/run_comparison/parameters.yml - Run the command (hint: ensure your Docker daemon is running):
kedro experiment run --pipeline=run_comparison --username=<your name>--run-name=<your run name> - View the results in GCS:
gs://mtrx-us-central1-hub-dev-storage/kedro/data/run_comparison/runs/<your run name>/
How do I configure the parameters.yml file?
Specify data consistency procedure
If the input_data.apply_harmonization parameter is set to true, then post-processing will be performed on the input predictions to ensure that the data allows for consistent evaluation.
If it set to false, then an error is raised unless the raw input predictions are consistent.
Either option allows for a fair comparison.
More precisely, when input_data.apply_harmonization`` istrue` we perform the following operations, which we refer to as matrix harmonisation:
- Take the intersection of drug and disease lists across models.
- For each fold, take the union of exclusion sets (i.e. training set) across models.
- For each fold, take the intersection of test set across models. Any pairs that are in a given test set for one model but not another are added to the exclusion set.
When input_data.apply_harmonization is set to true, the pipeline throws an error unless the drug list, disease lists, exclusion set and tests sets are all consistent across models for each fold.
Note that we require that drug and disease lists are consistent between folds for each single model, regardless of whether input_data.apply_harmonization is true or false.
Specify filepaths for input predictions
Input paths are specified under the input_data.input_paths, which allows for brace expansions to input predictions over several folds.
Usage is best described by an example:
input_paths:
- name: <name of model to appear in output>
fold_paths_list:
- "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v<data release>/runs/<run name>/datasets/matrix_transformations/fold_{0..4}/transformed_matrix"
file_format: "parquet" <"csv" also allowed>
score_col_name: "transformed_treat_score"
Assumptions on custom input predictions
When inputting custom predictions, ensure that the following assumptions and schema are adhered to.
Important: The run comparison pipeline assumes that all drug-disease pairs appearing in the training set of the model have been taken out of the input dataframe.
We make the following assumptions on the input data:
- Each row of the input dataframe corresponds to a drug-disease pair, with a column indicating the score and boolean columns indicating whether each pair belongs to each test set.
- The set of pairs described by the
- The schema of the dataframe should be as follows:
- source: drug ID
- target: disease ID
- The names of the Boolean columns for test set are those specified in the
available_ground_truth_colskey. - The score column name must correspond to that specified in the corresponding entry under the
input_pathskey.
Any additional columns will be ignored.
(Optional) Custom evaluations
- Evaluations are specified under the
evaluationskey using classes found insrc/matrix/pipelines/run_comparison/evaluations.py. - Evaluations may be easily disabled and enabled by modifying the
DYNAMIC_PIPELINES_MAPPINGrun_comparisonvalue in the filesrc/matrix/pipelines/settings.py.
The class hierarchy structure of evaluations is summarised by the following diagram:
The abstract base class for all evaluations is ComparisonEvaluation, which requires two methods: evaluate and plot_results.
All model-specific evaluations, that is those which produce one curve per model, such as Recall@n or Entropy@n, inherit from the abstract subclass ComparisonEvaluationModelSpecific,
which deals with the multifold uncertainty estimation logic and plotting.
Furthermore, evaluations using bootstrap uncertainty estimation inherit from ComparisonEvaluationModelSpecifiBootstrap.
Evaluations that are not model specific, such as Commonality@n which produces one curve per pair of models inherit directly from ComparisonEvaluation