Skip to content

EC Drug List

Overview

The EC Drug List is a manually curated list of drug entities used by the MATRIX project for drug repurposing predictions. The EC Drug List provides stable identifiers for drugs throughout the MATRIX pipeline, to ensure that drug repurposing opportunities are stably tracked over time, and crosslinks to external IDs (CURIEs) using the NCATS Node Normalizer and Name Resolver services. For reference, the list contains ~1,800 drugs with metadata including approval status, drug class flags, therapeutic annotations, and ATC classifications.

Key characteristics of the EC Drug List:

  • Consolidated to drug active ingredients: The list only includes single active drug ingredients, and consolidates different salt forms, and formulations of the same drug. Combination therapies are listed as individual ingredients. Similar drugs (e.g., drugs and prodrugs) may be aggregated into a single entry, as noted in the aggregated_with column
  • Regulatory approved drugs: The list is collated based on treatment guidelines in the US and UK, and may expand to broader geographies in the future.
  • Therapeutic drugs only: The list only includes drugs with therapeutic uses, and excludes non-therapeutic approved drugs e.g., diagnostics, vaccines, allergens
  • Curated annotations: The list includes manually curated annotations for therapeutic area, drug function, drug target, and drug class to enable analysis and display features

Location on GCS

The EC Drug List is stored in the public GCS bucket and can be accessed at the following path (here on GCS):

gs://data.dev.everycure.org/data/01_RAW/drug_list/{version}/ec-drug-list.tsv

You can also find them in the datasets repository here

Availability on HuggingFace Hub

The EC Drug List is also published on HuggingFace Hub at everycure/drug-list under the CC-BY-4.0 license. The HF dataset is updated automatically on minor and major releases (patch releases are skipped). New versions of the drug list are released through the [core-entities pipelines. The latest version can be found in GCS or in the datasets repository, you can then update the version number in the MATRIX configuration (pipelines/matrix/conf/base/globals.yml). EC Data Products team follow a versioning system where users should always use at least the current minor release to ensure compatibility with other systems:

  • PATCH (v0.1.0 → v0.1.1): Fixed incorrect data, corrected values (including changes in translator_id) or added a column
  • MINOR (v0.1.1 → v0.2.0): Changes in spine (EC_ID) or translator_id (Name Resolver or Node Normalizer)
  • MAJOR (v0.2.0 → v1.0.0): Removed columns, changed data types, restructured

How the List is Generated

The EC Drug List is manually curated and maintained by the Every Cure team with the following principles:

  1. Drug collation: Drugs are listed by curators as they go through treatment guidelines, consolidating entries to the active ingredient level in a protected Google Sheet. At the request of the EC Medical Team, additional aggregations of multiple similar drugs into as single entry are done in some edge cases to avoid repeated effort of reviewing redundant pairs (e.g., drugs and prodrugs). Where this is the case, it is noted in the aggregated_with column
  2. Manual annotations: Manually curated columns are appended to the list in the Google Sheet. See below which columns in the list are manual annotations
  3. Stable identifier assignment: The Google Sheet is ingested into Core Entities and each drug receives a unique EC identifier (format: EC:XXXXX) that will not be removed or modified over time, only superseded if necessary
  4. Multi-source identifier mapping: Each EC identifier is mapped to multiple identifier systems:
  5. Translator IDs (CURIEs): Used to join with knowledge graphs in the MATRIX pipeline (e.g., CHEBI, UNII, DRUGBANK). The drug name or synonym is first put through NCATS Name Resolver to return a CURIE, then normalized with NCATS Node Normalizer to be consistent with the MATRIX KG-based pipeline
  6. DrugBank IDs: Links to the DrugBank database for additional drug information, this is done through direct string-matching of the drug name or synonym
  7. Additional feature generation: Using the mappings above, additional features are programmatically generated for each drug, see below which columns in the list are programmatically generated

Data Validation

The EC Drug List undergoes validation during the ingestion pipeline with Pandera schema checks for:

  • Unique EC identifiers: Each id must be unique across the list
  • Unique translator IDs: Each translator_id must be unique across the list
  • Unique DrugBank ID: Each drugbank_id must be unique across the list
  • Format validation: EC identifiers must follow the EC: prefix format
  • Boolean flag validation: Therapeutic flag fields must contain valid boolean values
  • String format validation: Names and synonyms must be capitalized

Integration with MATRIX Pipeline

The drug list is integrated into the MATRIX pipeline through the ingestion stage:

  • Primary identifier: EC identifiers serve as the primary key for tracking drugs through the pipeline
  • CURIE mapping: Translator IDs enable mapping to nodes in various knowledge graphs (RTX-KG2, SPOKE, Robokop, PrimeKG, etc.). Currently only one ID is provided for each drug, but we may expand this to an array of CURIEs in the future.
  • Normalization stability: EC identifiers do not change during the normalization process, preventing conflicts in downstream systems like Orchard
  • Dual identifier tracking: Both EC and translator IDs are maintained through matrix generation to support both curation and knowledge graph operations

For more details on the migration decision and its impact on the MATRIX pipeline, see the Drug List Migration ADR.

Features of the List

The EC Drug List is provided as a tab-separated values (TSV) file and parquet file with the following columns:

Column Type Description Agent type
id String Unique EC identifier for the drug (format: EC:XXXXX). This is the primary key and remains stable over time. Manual
name String Human-readable name of the drug Manual
translator_id String CURIE identifier used for mapping to knowledge graph nodes (e.g., CHEBI:421707, UNII:AVK0I6HY2U) Programmatic
drugbank_id String DrugBank database identifier (e.g., DB01048) Programmatic
synonyms Array[String] Alternative human-readable names of the drug Manual
aggregated_with Array[String] Similar drugs that have been aggregated with this drug entity to avoid repeated effort of reviewing redundant pairs (e.g., drugs and prodrugs) Manual
drug_class String Classification of drugs based on mechanism or similar use in clinical practice (e.g., ACE inhibitor, Statin). See below for more details Manual
therapeutic_area String Primary therapeutic area of diseases drug is used to treat Manual
drug_function String Physiological function of the drug (e.g., Antiinflammatory, Immune checkpoint inhibitor) Manual
drug_target String Primary mechanism of action/molecular target(s) of the drug Manual
approved_usa Categorical Indicates whether the drug is FDA-approved in the United States (APPROVED, NOT_APPROVED, DISCONTINUED) Manual
is_antipsychotic Boolean Flag indicating if the drug is classified as an antipsychotic medication Manual
is_sedative Boolean Flag indicating if the drug is classified as a sedative medication Manual
is_antimicrobial Boolean Flag indicating if the drug is classified as an antimicrobial agent Manual
is_glucose_regulator Boolean Flag indicating if the drug regulates glucose levels Manual
is_chemotherapy Boolean Flag indicating if the drug is a traditional cytotoxic chemotherapy treatment Manual
is_steroid Boolean Flag indicating if the drug is a steroid medication Manual
is_analgesic Boolean Flag indicating if the drug is primarily an analgesic medication Manual
is_cardiovascular Boolean Flag indicating if the drug is used to treat cardiovascular conditions Manual
is_cell_therapy Boolean Flag indicating if the drug is a cell or gene therapy Manual
smiles String Simplified Molecular Input Line Entry System (SMILES) representation of the chemical structure Programmatic
atc_main String ATC main code (abbreviated form) Programmatic
atc_level_1 String ATC level 1 code - anatomical main group (e.g., H for systemic hormonal preparations) Programmatic
atc_level_2 String ATC level 2 code - therapeutic subgroup (e.g., H03 for thyroid therapy) Programmatic
atc_level_3 String ATC level 3 code - pharmacological subgroup (e.g., H03A for thyroid preparations) Programmatic
atc_level_4 String ATC level 4 code - chemical subgroup (e.g., H03AA for thyroid hormones) Programmatic
atc_level_5 String ATC level 5 code - chemical substance (e.g., H03AA01 for levothyroxine sodium) Programmatic
l1_label String Human-readable label for ATC level 1 (e.g., "systemic hormonal preparations, excl. sex hormones and insulins") Programmatic
l2_label String Human-readable label for ATC level 2 (e.g., "thyroid therapy") Programmatic
l3_label String Human-readable label for ATC level 3 (e.g., "thyroid preparations") Programmatic
l4_label String Human-readable label for ATC level 4 (e.g., "thyroid hormones") Programmatic
l5_label String Human-readable label for ATC level 5 (e.g., "levothyroxine sodium") Programmatic
deleted Boolean Flag indicating if the drug has been deprecated from the list Manual
deleted_reason String Description of rationale for the drug being deprecated Manual
new_id String EC identifier of drug that should supercede deleted drug (if any) (format: EC:XXXXX) Manual

Understanding the ATC Classification System

The Anatomical Therapeutic Chemical (ATC) Classification System is a hierarchical system maintained by the World Health Organization (WHO) for classifying drugs. The EC Drug List includes all 5 levels of this classification:

  • Level 1: Anatomical main group (single letter, e.g., H = systemic hormonal preparations)
  • Level 2: Therapeutic subgroup (3 characters, e.g., H03 = thyroid therapy)
  • Level 3: Pharmacological subgroup (4 characters, e.g., H03A = thyroid preparations)
  • Level 4: Chemical subgroup (5 characters, e.g., H03AA = thyroid hormones)
  • Level 5: Chemical substance (7 characters, e.g., H03AA01 = levothyroxine sodium)

Each level is provided with both the code and a human-readable label for easier interpretation and analysis.

Curated Annotation Columns

The EC Drug List includes manually curated annotation columns that provide additional context for each drug:

  • drug_class: Classification of the drug type (e.g., Antiinflammatory, Immune checkpoint inhibitor). This field is populated for all drugs.
  • therapeutic_area: Primary therapeutic areas where the drug is used (e.g., "Cardiology", "Neurology"). This field is populated for all drugs in the list. Where a drug is commonly used in multiple therapeutic areas, these are separated with ; (e.g., Targeted cancer therapy; Dermatology)
Click to view all 26 therapeutic area options (not including combinations of therapeutic areas) - Antidote - Antimicrobial - Cardiovascular - Cell therapy - Chemotherapy adjunct - Central nervous system - Dermatology - Endocrine - Gastrointestinal - Gene therapy - Genitourinary - Hematology - Immune - Metabolic - Misc - Musculoskeletal - Obstetrics - Ophthalmic - Peripheral nervous system - Renal - Respiratory - Sex hormone - Supplement - Targeted cancer therapy - Traditional cancer therapy - Urological
  • drug_function: Functional classification describing the drug's mechanism or role (e.g., "Kinase inhibitor", "Receptor antagonist", "Enzyme replacement"). This field is populated for most drugs with a small proportion of NULL for edge cases.
  • drug_target: Primary molecular target(s) of the drug (e.g., "COX-1 and COX-2 inhibitor", "MEK inhibitor"). This field is populated for most drugs, with a small proportion of NULL for edge cases.

These annotations enable more sophisticated filtering, grouping, and analysis of drugs within the MATRIX pipeline and support various display and visualization features in downstream applications.

Drug class distribution

The drug class column is designed to group drugs that share a mechanism of action/drug target, to enable navigation and linking between similar drugs. At present there are ~680 unique drug classes, with ~50% drug classes containing a single drug (i.e., n=1 classes). For n=1 drug classes, these are largely named after the drug itself (e.g., drug_name = "Pizotifen", drug_class = "Pizotifen").

Drug class is defined orthogonally to the therapeutic_area column, as drugs with a similar mechanism can be applied in different clinical settings (e.g., Retinoid drugs in Targeted cancer therapy, Dermatology, and MSK therapeutic areas)

The plot below shows the distribution for the top 100 largest drug classes in the drugs list :

Drug Class Distribution (Top 100 classes)

Therapeutic Classification Flags

The drug list includes nine boolean flags that identify drugs belonging to specific therapeutic categories of interest for drug repurposing research:

  • is_antipsychotic: Drugs used to treat psychotic disorders
  • is_sedative: Drugs with sedative or hypnotic effects
  • is_antimicrobial: Antibacterial, antiviral, antifungal, and antiparasitic agents
  • is_glucose_regulator: Drugs affecting glucose metabolism (antidiabetic agents, insulin, etc.)
  • is_chemotherapy: Drugs used as traditional cytotoxic chemotherapies
  • is_steroid: Corticosteroids which are often used to manage a wide array of symptoms
  • is_analgesic: Drugs primarily used for pain relief
  • is_cardiovascular: Drugs used to treat cardiovascular conditions
  • is_cell_therapy: Cell or gene therapy products

These flags enable rapid filtering and analysis of specific drug categories within the MATRIX pipeline and are particularly useful for identifying potential safety concerns or drug-drug interactions.