Skip to content

Diseases List

The goal of the MATRIX project is to develop strong candidate suggestions for drug repurposing (see here for a press release).

The MATRIX disease list is an effort to construct a list diseases that that can be targeted by drugs. Many disease terminologies contain diseases that are either disease groupings (such as "hereditary disease" or "cancer") which are too broad to be specifically considered to be targetable by drugs, or diseases that are so specific that they dont have any differentiable criteria relevant to their treatment procedures. The goal of the MATRIX disease list is, therefore, to separate these cases from "drug-targetable diseases".

The list will be used for communication and navigation purposes

Availability on HuggingFace Hub

The EC Disease List is available on HuggingFace Hub at everycure/disease-list under the CC-BY-4.0 license. The HuggingFace dataset is automatically updated with each minor and major release (patch releases are not included).

The public HuggingFace version is designed to be clear, stable, and broadly usable. As part of that goal, certain internal-use fields that are still evolving are not included in the public release.

EC team members looking for the full dataset, including all internal fields, should refer to the datasets repository.

Maintainer team

Contributor Organisation ORCID
Melissa Haendel Monarch Initiative, Tislab, UNC https://orcid.org/0000-0001-9114-8737
Sabrina Toro Monarch Initiative, Tislab, UNC https://orcid.org/0000-0002-4142-7153
Elliott Sharp Every Cure https://orcid.org/0000-0003-2955-4640
Nico Matentzoglu Monarch Initiative, Independent Consultant https://orcid.org/0000-0002-7356-1779
Kevin Schaper Monarch Initiative, Tislab, UNC https://orcid.org/0000-0003-3311-7320

Workflow and Method for creating the MATRIX disease list

This document outlines the basic workflow behind the MATRIX disease list.

  1. The disease list corresponds to a subset of the Mondo disease ontology, which itself can be considered "an ontology of terminologies", integrating other widely used terminologies such as OMIM, NCIT (for neoplasms), Orphanet (for Rare Diseases) and Disease Ontology. It also includes mappings to resources such as MedGen, UMLS and many others.
  2. The list is being kept in sync with the development in Mondo, which includes the addition and removal of disease concepts, synonyms and mappings. This means that if a new disease is added to Mondo, it will be added to the disease list within the same week.
  3. The list works my seperating diagnosable, clinically actionable diseases from groupings and theoretical disease subtypes without differentiable diagnostic criteria. This separation works in three steps:
    1. Creating a default designation based on heuristics
    2. Manually curating ambiguous entries according to special prioritisation metrics (essential, we have scores that are indicators of diagnosable diseases, and we we review cases with very low scores)
    3. Keeping the evolving list open to crowd-curation and having the community provide feedback when they come across a missing or wrong entry
  4. Basic workflow:
    1. Download the latest version of the Mondo disease ontology
    2. Extract all information from Mondo relevant to the disease list as a TSV file, including
      • disease concept metadata such as synonyms and definitions
      • filter criteria such as "grouping" and "subtype" designations
    3. Filter the TSV file according to the currently agreed heuristics
    4. Submit the updated list for review by a member of the MATRIX disease list core team
    5. Merge and publish the disease list as a versioned artefact on Github in various formats, including TSV and XLSX.

Default filters for Disease List

Here we outline and motivate the various filters we apply to construct the diagnosable, clinically actionable diseases.

As described in our workflow specification, these filters serve as heuristics and are gradually overwritten by community and internal expert feedback.

List of current heurstics for the MATRIX disease list:

Leaf Filter

Heuristic
  1. If a disease has no ontological children, it is, by default, included in the list.
Background

"Leaf diseases", ie the most specific disease terms in the ontology, most often represent specific diagnosable diseases. For genetic diseases, these represents diseases caused by variation in a specific gene.

Orphanet Subtype Filter

Heuristic
  1. If a disease term in Mondo corresponds to a subtype of a disorder according to Orphanet, it is, by default, included in the list.
Background

Orphanet organized their rare disorders as "group of disorders", "disorders", and "subtype of disorders". They define "subtype of disorders" as sub-forms of a disease based on distinct presentation, etiology, or histological aspect, see details here. The Mondo disease terms representing diseases considered as "subtype of disorders" in Orphanet are annotated with the 'ordo_subtype_of_a_disorder' subset. These diseases are most often the most specific disease terms (most often "lead diseases").

Orphanet Disorder Filter

Heuristic
  1. If a disease term in Mondo corresponds to an Orphanet disorder, it is, by default, included in the list [CHECK!].
Background

Orphanet organized their rare disorders as "group of disorders", "disorders", and "subtype of disorders". They define "disorders" as entities including diseases, syndromes, anomalies and particular clinical situations. "Disorders" are clinically homogeneous entities described in at least two independent individuals, confirming that the clinical signs are not associated by fortuity. [REF] Orphanet conciders this level of classification as "diagnosable" disorder. These diseases are most often the ontological parents of "disease subtypes"

ClinGen Filter

Heuristic
  1. If a disease is used by the Clinical Genome Resource (ClinGen), it is, by default, included in the list.
Background

The Clinical Genome Resource (ClinGen, https://clinicalgenome.org/) is a National Institutes of Health (NIH)-funded resource dedicated to building a central resource that defines the clinical relevance of genes and variants for use in precision medicine and research.

We consider ClinGen diseases/disorders as diagnosable since they are reported in the database and all have directly associated variant information. ClinGen uses Mondo directly during curation, see for example https://search.clinicalgenome.org/kb/conditions/MONDO:0020119.

OMIM Filter

Heuristic
  1. If a disease has an exact match to an OMIM identifier (ie disease entry), it is, by default, included in the list.
Background

The Online Mendelian Inheritance in Man (OMIM, https://www.omim.org/) catalogs human genes and genetic disorders and traits. All OMIM genetic disorders have direct, equivalent correspondences in Mondo. We consider OMIM diseases/disorders as diagnosable (since they are reported in the database).

ICD10 CM Filter

Heuristic
  1. If a disease has an exact match to an ICD 10 category code, it is, by default, included in the list.
  2. If a disease has an exact match to an ICD 10 chapter or chapter header code, it is, by default, excluded from the list.
Background

There are a few different types of ICD-10 codes that can be roughly identified by their structure:

  1. Chapter codes (or block codes), for example A00-B99 (Certain infectious and parasitic diseases). These codes can be recognised by containing a dash (-) character.
  2. Chapter headers (or chapter titles), for example A00 (Cholera). These can be identified by neither containing a dash, nor a period (.) character.
  3. Category codes (or subcategory codes), for example: A01.1 (Paratyphoid fever A). These can be recognized by containing a period (.) character.

Usually, we can assume the following:

  1. The codes with dashes (chapter codes) represent broad categories of diseases.
  2. The codes with periods (category/subcategory codes) represent more specific diagnoses.
  3. The codes without dashes or periods (chapter headers) are usually the top-level categories within each chapter.

In clinical and coding contexts, people often refer to the codes with periods as the "billable codes" or "billable ICD-10 codes" because these are typically the ones used for specific diagnoses in medical billing and record-keeping. Codes without a period (chapter headers) are generally not billable, and Codes with dashes (chapter codes/block codes) are never billable.

However, it's important to note that not all codes with periods are billable. Some may require additional digits for specificity. The exact rules can vary slightly depending on the specific implementation of ICD-10 (such as ICD-10-CM in the United States), but generally, the most specific codes (usually those with periods) are the billable ones.

OMIMPS Filter

Heuristic
  1. If a disease has an exact match to an OMIMPS identifier, it is, by default, excluded from the list.
Background

OMIM Phenotypic Series (OMIMPS) group diseases based on similar phenotypes. OMIMPS most often refers to the general disease when the OMIM terms are gene-specific subtypes of the disease. For example, the OMIMPS "Usher syndrome" includes all subtypes of Usher syndrome. Sometimes, the OMIMPS group terms based on phenotype similarities, for example "Intellectual developmental disorder, X-linked syndromic". By nature, Mondo terms representing OMIMPS entry are not actual diseases but group of diseases.

OMIMPS descendant Filter

Heuristic
  1. If a disease is a subclass of a disease that corresponds to an OMIM Phenotypic Series, it is, by default, included in the list.
Background

Since OMIMPS group diseases, we determined that ontological children of OMIMPS should be diseases that we would want to include. These would include terms corresponding to OMIM terms, and possibly other disease terms.

Grouping subset Filter

Heuristic
  1. If a disease term in Mondo corresponds to a group of disorders according to Orphanet (has a 'ordo_group_of_disorders' subset annotation), it is, by default, excluded in the list
  2. If a disease term in Mondo has a 'disease_grouping' subset annotation, it is, by default, excluded in the list
  3. If a disease term in Mondo has a 'harrisons_view' subset annotation, it is, by default, excluded in the list
  4. If a disease term in Mondo has a 'rare_grouping' subset annotation, it is, by default, excluded from the list
Background

By nature, Mondo terms in the following subsets are not actual diseases but group of diseases. - 'ordo_group_of_disorders' subset: Orphanet organized their rare disorders as "group of disorders", "disorders", and "subtype of disorders". They define "group of disorders" as a collection of disease/clinical entities sharing a given characteristic. [REF] - 'disease_grouping' subset: Terms in this subset have been manually curated and determined to be a grouping term. - 'harrisons_view' subset: Mondo's high-level classification was created based on the Harrison’s Principle of Internal Medicine textbook. Terms representing this high-level classification are annotated with the 'harrisons_view' subset - 'rare_grouping' subset: The ontological parent of rare diseases (see Mondo rare disease subset here)

Grouping Subset Ancestor Filter

Heuristic
  1. If a disease is an ontological parent of a disease that is a grouping term (as defined in the "Grouping Subset Filter" section), it is, by default, excluded from the list.
Background

Ontologically, a parent of a grouping class would itself be a grouping class.

Leaf Direct Parent Filter

Heuristic
  1. This filter indicates if a disease is a direct parent of a leaf term
  2. This filter is for information purposes and is not used to include/exclude terms from the list [NOT TRUE!].
Background

This filter exists for information purposes. We think that the majority of the "leaf direct parent" would also be in the "orphanet disorder" subset and in the "OMIM" subset, and therefore whould be included in the list.

Subtype Subset Descendant Filter

Heuristic
  1. This filter indicates if a disease is an ontological child of a "subtype of disorders" term.
  2. This filter is for information purposes and is not used to include/exclude terms from the list.
Background

This filter exists for information purposes. We think that the majority of the "orphanet subtype of disorder" would be leaf terms as they are specific. However, if there is a term that is an ontological child of an "orphanet subtype of disorder", it might need to be be included in the list .

Limitations of the filtering approach created in this section:

  • Mondo mappings to ICD10CM are currently incomplete, therefore the filter will result in false positives and false negatives.

Disease list subsets / tagging / grouping

We have developed a system for grouping diseases into categories (called here tagging or grouping).

This has various purposes:

  1. Model evaluation. Similar to the approach of the TxGNN paper, we need to be able to evaluate whether a model trained for one disease area will also work for another. In other words, we do the train - test split by disease "area" (say, endocrine disease for training and cancer for testing).
  2. Display. Sometimes, we want to be able to offer sensible facets in a search interface to a user, like "endocrine system diseaes".

There are three major ways we add these groupings:

  1. Mondo subset tags. Mondo itself curates certain subsets, such as the "Harrisons view", which corresponds to the disease categories from a popular medical textbook.
  2. LLM-generated subset tags. We have developed a Kedro pipeline which generates disease tags flexibly using LLMs.
  3. Manual curation of subset grouping classes. We have provided a way to manually curate subset groupings, so that new subsets can be defined as needed.

Mondo subset tags

(Note this list may be out of date)

Mondo provides two majore subsets:

  1. Its manually curated class hierarchy. Everything under the human disease concept is considered a subset. The advantage of this subset is that it will always cover the entirety of all human diseases in Mondo.
  2. The Harrison subset. Corresponding to a popular medical textbook, this subset contains most Mondo classes, using the groupings provided by the textbook. Note, at the time of this writing the Harrison view and the manually curated disease hierarchy are the same (this was not the case, for example last month, and might change again in the future.)

Manually curated subset tags

We have developed a template based approach using a spreadsheet to manually specify subsets.

The curator simply specifies the root nodes of the Mondo ontology to include in any given subset, say "endocrine disorder" and "cancer".

The pipeline than creates tags based on those groups that are then exported into the disease grouping list.

Outputs

A table is generated that looks like this:

category_class label harrisons_view matrix_txgnn_grouping mondo_top_grouping
MONDO:0000004 adrenocortical insufficiency endocrine_system_disorder endocrine_system_disorder endocrine_system_disorder
MONDO:0000005 alopecia, isolated hereditary_disease other integumentary_system_disorder
MONDO:0000009 inherited bleeding disorder, platelet-type hereditary_disease other hematologic_disorder
MONDO:0000014 colorblindness, partial other psychiatric_disorder nervous_system_disorder
MONDO:0000015 classic complement early component deficiency hereditary_disease other hereditary_disease

The first column (category_class) is the tagged disease in Mondo, and the second column (label) the corresponding disease name.

All the remaining columns corresponds to different groupings. In the example we show three groupings:

  1. harrisons_view
  2. matrix_txgnn_grouping
  3. mondo_top_grouping

The values are the tags, so which specific disease group a disease corresponds to. For example: "adrenocortical insufficiency" (MONDO:0000004) corresponds to an endocrine_system_disorder according to the harrisons_view.

Pipeline

The general pipeline works like this:

  1. All externally provided subsets (llm-based, manually curated) are provided as ROBOT templates.
  2. All subsets are included as ontology annotations into Mondo during the release my converting the templates into ontology annotations.
  3. For ontology-based subsets, the grouping classes are selected and downfilled according to the class hierarchy (so all diseases under "endocrine disorder" get an "endocrine disorder" tag).
  4. During the disease list processing, those subsets are extracted and formatted into the "disease grouping" spreadsheet.