v0.12.0
Breaking Changes 🛠
No breaking changes in this release.
Data Release Summary
Knowledge Graph v0.12.0 contains RTX-KG2, ROBOKOP, and PrimeKG, see our release history page for versioning details of each KG. Please note that to make PrimeKG Biolink compliant, we have unmerged diseases which were merged into one concept, therefore this KG is not exactly the same as the KG used for TxGNN. This release also introduces an ABox/TBox classification on all edges in the integrated graph based on the Biolink edge type, which can be used for filtering/modeling experiments. The v0.12.0 release of the EC Integrated KG is constructed with the version 2.3.26 of Node Normalizer without issue, and can be used for modelling experiments with the new drugs list (see below)
Drug List We are now using a new manually curated drug list in the MATRIX pipeline, which is shorter than the previous drugs list and mainly focused on FDA-approved drugs of therapeutic value which the EC Medical Team find most relevant with persisting EC IDs (more documentation to come). This means the size of the matrix will now be smaller, which is expected to change how modelling evaluation metrics look. The new drug list file can be found here (will also be available through core entities release) The MATRIX pipeline will consume the new drug list by default, but is backwards compatible with the previous drug list (any release before v0.11.3)
Exciting New Features 🎉
-
Automated primary knowledge source documentation pipeline: Introduced a new documentation pipeline that automatically generates content for primary knowledge sources, streamlining the documentation process and ensuring consistency across knowledge graph sources #1846
-
ABox/TBox node classification: Added support for distinguishing between ABox (assertional) and TBox (terminological) edges in the knowledge graph, enabling better ontological reasoning and knowledge representation #1895
Experiments 🧪
-
AggPath: AggPath is a transformer-based path classifier that is trained using drug-disease pair indication data via aggregation functions. link to report
-
Ground Truth Reshuffling: We examined whether with esentially random training data we get 'non-random' ranking predictions of drug diseases pairs. link to notebook
-
CBR-X Explainer: Evaluation of a case-based reasoning explainer (CBR-X) for drug–disease link prediction that is designed to be both predictive and mechanistically interpretable. link to notebook
-
Structural Bias in Drug Repurposing Model Predictions: An experiment to understanding and quantifying the effect of structural bias in drug repurposing models link to notebook
-
Improved LLM descriptions of drug and diseases: Experiment with LLM descriptions to Improve Drug/Disease Embeddings. link to notebook
-
Inclusion of additional information for drug and diseases: Experiment with MONDO Hierarchy and SMILES to Improve Drug/Disease Embeddings. link to notebook
Bugfixes 🐛
-
MLflow image pull issue resolution: Fixed critical MLflow deployment issues caused by Bitnami registry changes, ensuring reliable experiment tracking and model management #1891
-
Release patch pipeline fix: Added missing
document_kgto the release patch pipeline, ensuring all necessary components are included in patch releases #1913 -
PKS markdown generation variable fix: Corrected variable usage in primary knowledge source markdown generation, preventing template rendering errors #1909
-
Infrastructure typo fix: Fixed minor typo in infrastructure file comments for improved code clarity #1902
Technical Enhancements 🧰
-
New cross-validation strategy: Implemented an improved cross-validation approach for model training, enhancing model evaluation robustness and reliability. #1847
-
Drug list ingestion refactor: Refactored the matrix pipeline to support the new drug list ingestion format, improving data processing efficiency and maintainability #1885
-
Memory-efficient predictions: Created a memory-efficient restrict predictions node and migrated to partitioned datasets, significantly reducing memory footprint for large-scale inference tasks #1898
-
BigQuery location support: Added location parameter to SparkDatasetWithBQExternalTable for better multi-region support and data locality #1897
-
Epistemic robustness documentation: Enhanced knowledge source pages with epistemic robustness information, providing transparency about data quality and reliability #1896
-
Spot instance improvements: Disabled spot instances for non-dev environments and added conditional spot node pool configuration for improved production stability #1907
-
Spot instance removal: Completely removed spot instances from both dev and prod environments to ensure consistent infrastructure performance #1910
-
Orchard compute IAM configuration: Added orchard compute service accounts to IAM configuration for enhanced access management #1912
-
Py4J gateway timeout: Added configurable Py4J gateway startup timeout to Spark configuration, preventing connection failures in resource-constrained environments #1903
-
Workbench IAM improvements: Added IAM member resource for Service Account User role in workbench configuration, streamlining user access management #1883
-
LiteLLM Redis cache support: Added supported call types for Redis cache configuration in litellm, improving caching capabilities for LLM operations #1881
Documentation ✏️
- Attribution documentation: Added comprehensive attribution documentation for the Matrix project, properly crediting data sources and collaborators #1867
Other Changes
-
Argo Events dependency update: Updated argo-events dependency to version 2.4.16 and synchronized subproject commit for latest features and fixes #1915
-
Neo4j query logging: Enabled Neo4j query logging by default for improved debugging and performance monitoring #1906
-
BigQuery permissions: Added read permissions for the evidence project to access BigQuery datasets #1901