Skip to content

v0.10.0

Breaking Changes πŸ› 

  • Migration to UV Package Manager: Major dependency management overhaul replacing requirements.txt with UV workspace. This introduces a new workspace structure with individual libraries (matrix-auth, matrix-fabricator, matrix-gcp-datasets, matrix-mlflow-utils) extracted from the main pipeline. This change improves dependency isolation and build times but requires developers to use uv sync instead of pip install -r requirements.txt #1768

Exciting New Features πŸŽ‰

  • Orchard Feedback Dataset Addition: Added Orchard feedback dataset integration for external validation and feedback loop improvements #1740

  • Orchard Feedback Data Integration: Updated orchard transformer to map feedback data to MATRIX format, enabling integration of external validation data #1782

  • Enhanced Validation for Fabricator Pipeline: Added comprehensive data validation to the fabricator pipeline using Pandera schemas, improving data quality assurance and early error detection during synthetic data generation #1714

  • DrugBank & EC Ground Truth Lists Integration: Integrated authoritative drug and indication lists from DrugBank and Every Cure, expanding the knowledge base with high-quality ground truth data for improved drug repurposing predictions #1763

  • EC Indication List Ingestion: Added support for ingesting Every Cure's curated indication list, providing additional ground truth data for model training and validation #1787

  • Docker Image Cleanup Automation: Implemented automated cleanup of Docker images on workflow success, reducing storage costs and improving resource management in the CI/CD pipeline #1805

Experiments πŸ§ͺ

  • Features and Modelling Integration: Added features and modelling components to the weekly pipeline run, enabling regular evaluation of model performance and feature engineering improvements #1631
  • Graph Rewiring: Experiment with random shuffling of edges and also of embeddings to assess impact on model performance Report
  • Graph Slicing: Experiment filtering out certain node types to assess impact on model performance if 'noise' is removed Report
  • Ground Truth Experiments: Experiment benchmarking different ground truth sets for training our ML system Several Reports Here
  • Ground Truth Experiments- Negative Sampling: Experiment benchmarking different ground truth sets for training our ML system, comparing different negative sampling strategies. Several Reports Here
  • Evidence Synthesis: Evidence Synthesis Benchmark and Comparison with Matrix Predictions Report

Bugfixes πŸ›

  • Neo4j Topological Embeddings Fix: Resolved critical issue in Neo4j configuration that was preventing proper generation of topological embeddings, restoring graph-based feature extraction capabilities #1815

  • Module Name Correction: Fixed incorrect module names that were causing import errors in production deployments #1821

  • Release Process UV Command Issue: Fixed missing UV command in the automated release process that was preventing proper dependency resolution during release builds #1825

  • BigQuery SQL Query Fixes: Corrected broken SQL queries in the KG dashboard that were preventing proper data visualization and reporting #1808

  • Node Normalization Error Logging: Improved error logging in core node normalization process to provide better debugging information when data processing fails #1806

  • Ground Truth Table Names Update: Updated ground truth table references in the KG dashboard to match the new table naming conventions #1817

  • Release History Page Fix: Fixed broken release history page generation and display, ensuring proper documentation of version history #1792

  • Requirements.txt Synchronization: Fixed synchronization issues with requirements.txt to ensure consistent dependency versions across environments #1774

  • Documentation .gitignore Fix: Added docs data directory to .gitignore to prevent accidental commit of generated documentation files #1828

Technical Enhancements 🧰

  • Spot Instance Implementation: Migrated MATRIX pipeline runs to GKE Spot Instances with fallback mechanisms, reducing infrastructure costs by up to 80% while maintaining reliability #1771, #1788

  • Artifact Registry with Cleanup Policies: Added comprehensive Artifact Registry module with automated cleanup policies and documentation, improving container image lifecycle management #1717

  • GKE Node Capacity Increase: Bumped GKE node disk size to 1.5TB and disabled image deletion policy to support larger workloads and improve storage reliability #1798

  • Spark Temporary Directory Configuration: Enhanced Spark configuration with proper temporary directory management, preventing disk space issues during large data processing jobs #1816

  • Enhanced Node Deduplication: Improved category assignment logic in node deduplication process, resulting in better data quality and reduced redundancy #1786

  • Matrix Transformations Output Repartitioning: Optimized data partitioning for matrix transformation outputs, improving processing performance and reducing memory pressure #1726

  • Ephemeral Volume Management: Created generic ephemeral volumes with persistent disk CSI tied to pods, improving storage performance and cost efficiency #1799

  • Argo Workflows Archive Logging: Enabled archive logs for Argo Workflows controller, improving debugging capabilities and workflow monitoring #1795

  • Enhanced Monitoring Configuration: Updated kube-state-metrics configuration to include pod containers in metric labels, providing better observability #1733

  • Dynamic Ground Truth Ingestion: Made ground truth data ingestion more dynamic and configurable, allowing for easier addition of new data sources #1766

  • Weekly Dependency Updates: Added automated weekly workflow to update MATRIX dependencies, ensuring security patches and performance improvements are regularly applied #1775

  • Cost Optimization Infrastructure: Multiple cost-cutting measures including removal of local SSDs, backup agent configuration optimization, and improved resource allocation #1796, #1731

Documentation ✏️

  • Installation Instructions Update: Enhanced Linux installation guide with pyenv setup steps and improved developer onboarding documentation #1748

  • External Contributor Documentation: Updated documentation to reflect lessons learned from public external contributor testing, improving the contribution experience #1764

Other Changes

  • Neo4j Ingestion Optimization: Modified Neo4j ingestion to only occur on monthly minor releases, reducing resource usage and improving pipeline efficiency #1823

  • BigQuery Output Optimization: Only write final filtered tables to BigQuery, reducing storage costs and improving query performance #1819

  • BigQuery Access Permissions: Allowed MATRIX PROD environment to access Orchard Datasets in BigQuery for cross-project data integration #1803

  • Clinical Trials Data Migration: Moved Clinical Trials and off-label data to public datasets, improving data accessibility and compliance #1760

  • Payload Size Optimization: Increased payload size limits and fixed string conversion issues for better data handling capacity #1773, #1776

  • PySpark Version Update: Updated PySpark to version 3.5.6 for improved performance and bug fixes #1753

  • Disease List Ingestion Refactor: Refactored disease list ingestion to use pandas.CSVDataset for better data handling and validation #1750

  • ARGO Configuration for Stability Pipeline: Added ARGO configuration to core stability pipeline for better workflow management #1747

  • Node Category Filtering: Added node category filters to the filtering pipeline, improving data quality and reducing noise #1730

  • Release History Link: Added release history link to KG dashboard home page for better user navigation #1790

  • Sampling Pipeline Schedule: Modified sampling pipeline to run only on weekdays, optimizing resource usage #1804

  • Platform Documentation: Added comprehensive platform refactor and standardization documentation #1706