Attribution
General acknowledgement
Work in the Matrix project was made possible by the contributions of countless free-software developers and open data curators, upon which much of our modern infrastructure is built.
In the following we acknowledge some of the core resources that drive our success.
- Data sources
- First level knowledge sources are the immediate data sources included in the Matrix project, typically for the purpose of pair prediction.
- Primary knowledge sources are the raw data sources that are part of the knowledge graphs used for pair prediction.
- Ground Truth lists serve as evaluation data for drug-disease pair prediction algorithms.
- Mondo Disease Ontology is used as the backbone for the Every Cure disease list.
- Core Non-KG data resources
- Software sources
Data and knowledge sources
First-level knowledge sources
First-level data sources are those that we leverage directly in the context of the Matrix pipeline. You can find a brief attribution with citation in the following. To get more information about use cases of these specific source, and applicability to drug repurposing, see here.
RTX-KG2
MATRIX integrates information from RTX-KG2, a large-scale biomedical knowledge graph developed by the Translator RTX team. RTX-KG2 aggregates and harmonizes knowledge from dozens of authoritative biomedical databases and ontologies into a single, semantically consistent graph aligned with the Biolink Model. It provides a rich source of curated biological and clinical associations that support reasoning and drug repurposing use cases. For details on sources and construction, see RTX-KG2 documentation and this publication:
Wood, E.C., Glen, A.K., Kvarfordt, L.G. et al.
RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine.
BMC Bioinformatics 23, 400 (2022).
doi: 10.1186/s12859-022-04932-3
ROBOKOP
MATRIX builds on resources from ROBOKOP (Reasoning Over Biomedical Objects linked in Knowledge Oriented Pathways), a question-answering system and knowledge graph developed as part of the NCATS Translator program. ROBOKOP combines graph reasoning services with biomedical knowledge integration, enabling exploration of mechanistic hypotheses across diseases, drugs, and biological processes. Its graph-based reasoning services have informed MATRIX’s approach to query expansion, pathway exploration, and candidate prioritization. For details, see the ROBOKOP portal and this publication:
Bizon C, Cox S, Balhoff J, Kebede Y, Wang P, Morton K, Fecho K, Tropsha A. ROBOKOP KG and KGB: Integrated Knowledge Graphs from Federated Sources. J Chem Inf Model. 2019 Dec 23;59(12):4968-4973. doi: 10.1021/acs.jcim.9b00683. Epub 2019 Dec 12. PMID: 31769676; PMCID: PMC11646564.
SPOKE
Private data source
Note that Every Cure utilize this data source in the MATRIX pipeline but do not distribute it, please reach out to data owners directly for access.
MATRIX builds on SPOKE (Scalable Precision Medicine Oriented Knowledge Engine), a large heterogeneous biomedical knowledge graph developed at UCSF. SPOKE integrates a wide variety of biomedical databases into a single graph, capturing relationships among genes, proteins, diseases, drugs, and clinical concepts. Its graph-based representations have informed MATRIX’s downstream analyses for identifying novel therapeutic opportunities. For details, see the SPOKE portal and this publication:
Morris JH, Soman K, Akbas RE, Zhou X, Smith B, Meng EC, Huang CC, Cerono G, Schenk G, Rizk-Jackson A, Harroud A, Sanders L, Costes SV, Bharat K, Chakraborty A, Pico AR, Mardirossian T, Keiser M, Tang A, Hardi J, Shi Y, Musen M, Israni S, Huang S, Rose PW, Nelson CA, Baranzini SE. The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information. Bioinformatics. 2023 Feb 3;39(2):btad080. doi: 10.1093/bioinformatics/btad080. PMID: 36759942; PMCID: PMC9940622.
EmBiology
Private data source
Note that Every Cure utilize this data source in the MATRIX pipeline but do not distribute it, please reach out to data owners directly for access.
MATRIX leverages EmBiology, a proprietary dataset from Elsevier that encodes curated relationships among biomedical entities extracted from the scientific literature. EmBiology combines large-scale natural language processing with expert curation to capture connections between diseases, drugs, targets, and mechanisms of action to better understand disease biology. Further information on EmBiology is available from Elsevier.
PrimeKG
MATRIX builds on resources from PrimeKG, a precision medicine knowledge graph developed to support drug repurposing and clinical translation research. PrimeKG integrates a wide range of biomedical entities — including diseases, drugs, genes, and biological pathways — into a single harmonized framework. For details, see the GitHub repo and this publication:
Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Sci Data 10, 67 (2023). https://doi.org/10.1038/s41597-023-01960-3
Core Non-KG data resources
Core Non-KG data resources are resources that are used throughout the Matrix ecosystem, but do not directly feed into its data infrastructure.
Monarch Initiative
- Provides additional information about diseases during drug-disease medical reviews.
- API access (https://api-v3.monarchinitiative.org/openapi.json)
MalaCards
Provides additional information about diseases during drug-disease medical reviews.
PubMed (NCBI)
Literature search to provide additional evidence for drug-disease pairs during medical reviews.
PubChem (NCBI)
Provides additional information about drugs during drug-disease medical reviews.
DrugBank
Provides additional information about drugs during drug-disease medical reviews.
ClinicalTrials.gov
Clinical trials search to provide additional evidence for drug-disease pairs during medical reviews.
Mondo Disease Ontology
MATRIX builds on resources from Mondo Disease Ontology (MONDO), an open, community-driven ontology that harmonizes disease definitions across numerous medical vocabularies. MONDO provides a unified, semantically consistent set of disease identifiers, enabling interoperability across biomedical datasets and facilitating disease-centric reasoning within MATRIX. Its integrative approach to aligning rare and common diseases is especially valuable for drug repurposing applications. For details, see the Mondo project page and this publication.
Nicole A Vasilevsky et. al
Mondo: Unifying diseases for the world, by the world.
medRxiv 2022.04.13.22273750
doi: doi:10.1101/2022.04.13.22273750
Software
Here we acknowledge a few of the central pieces in our ecosystem. This list is not exhaustive. If you think a piece of software is worth highlighting here, let us know on our issue tracker.
Kedro
A Python framework for building reproducible, maintainable data science pipelines, used in MATRIX to structure and orchestrate ETL workflows.
PySpark
The Python interface to Apache Spark, enabling distributed data processing and large-scale transformations in MATRIX’s integration pipeline.
Docker
A containerization platform that ensures MATRIX software components run in consistent, portable environments.
Neo4j
A graph database optimized for querying and exploring biomedical relationships, used to host and analyze the integrated MATRIX knowledge graph.
Acknowledgment
Neo4J provides Every Cure with a free license as part of its Graphs4Good scheme (https://neo4j.com/graphs4good/).
MLflow
An open-source platform for managing machine learning experiments, tracking, and reproducibility within MATRIX’s AI workflows.
Kubernetes (K8s)
A container orchestration system that manages scaling, deployment, and resilience of MATRIX’s cloud-native components.
Argo Workflows
A Kubernetes-native workflow engine used for defining and running MATRIX’s complex, multi-step data pipelines.
Terraform / Terragrunt
Infrastructure-as-code tools that provision and manage MATRIX’s cloud environments in a reproducible and versioned way.
Docker Compose
A tool for defining and running multi-container applications locally, supporting MATRIX development and testing.
NCATS Node Normalizer
A Translator service for mapping biomedical entity identifiers across vocabularies, supporting consistent normalization in MATRIX.
NCATS Name Resolver
A Translator service for resolving biomedical entity names into standardized identifiers used throughout MATRIX.
ARAX Node Normalizer
An alternative node normalization service developed by the ARAX team, leveraged in MATRIX for identifier harmonization and redundancy checks.
LiteLLM
A lightweight library that unifies APIs for large language models, enabling MATRIX workflows to flexibly integrate multiple LLM providers.
GitHub
A collaborative platform for version control and open-source development, hosting MATRIX code, issues, and community contributions.
Acknowledgment
Every Cure benefits from GitHub for Nonprofits.
Slack
A team communication platform used by MATRIX collaborators for coordination, discussions, and rapid issue resolution.
Acknowledgment
Every Cure benefits from Slack for Charities.
Anthropic
An AI research company providing advanced language models, integrated in MATRIX exploratory workflows for curation and analysis.
Acknowledgment
Anthropic provides Every Cure with free credits.
Google Cloud
A cloud services platform supporting MATRIX infrastructure, including compute, storage, and scalable deployment environments.
Acknowledgment
Google Cloud provides Every Cure with free credits.