v0.10.0
Breaking Changes π
- Migration to UV Package Manager: Major dependency management overhaul replacing requirements.txt with UV workspace. This introduces a new workspace structure with individual libraries (
matrix-auth,matrix-fabricator,matrix-gcp-datasets,matrix-mlflow-utils) extracted from the main pipeline. This change improves dependency isolation and build times but requires developers to useuv syncinstead ofpip install -r requirements.txt#1768
Exciting New Features π
-
Orchard Feedback Dataset Addition: Added Orchard feedback dataset integration for external validation and feedback loop improvements #1740
-
Orchard Feedback Data Integration: Updated orchard transformer to map feedback data to MATRIX format, enabling integration of external validation data #1782
-
Enhanced Validation for Fabricator Pipeline: Added comprehensive data validation to the fabricator pipeline using Pandera schemas, improving data quality assurance and early error detection during synthetic data generation #1714
-
DrugBank & EC Ground Truth Lists Integration: Integrated authoritative drug and indication lists from DrugBank and Every Cure, expanding the knowledge base with high-quality ground truth data for improved drug repurposing predictions #1763
-
EC Indication List Ingestion: Added support for ingesting Every Cure's curated indication list, providing additional ground truth data for model training and validation #1787
-
Docker Image Cleanup Automation: Implemented automated cleanup of Docker images on workflow success, reducing storage costs and improving resource management in the CI/CD pipeline #1805
Experiments π§ͺ
- Features and Modelling Integration: Added features and modelling components to the weekly pipeline run, enabling regular evaluation of model performance and feature engineering improvements #1631
- Graph Rewiring: Experiment with random shuffling of edges and also of embeddings to assess impact on model performance Report
- Graph Slicing: Experiment filtering out certain node types to assess impact on model performance if 'noise' is removed Report
- Ground Truth Experiments: Experiment benchmarking different ground truth sets for training our ML system Several Reports Here
- Ground Truth Experiments- Negative Sampling: Experiment benchmarking different ground truth sets for training our ML system, comparing different negative sampling strategies. Several Reports Here
- Evidence Synthesis: Evidence Synthesis Benchmark and Comparison with Matrix Predictions Report
Bugfixes π
-
Neo4j Topological Embeddings Fix: Resolved critical issue in Neo4j configuration that was preventing proper generation of topological embeddings, restoring graph-based feature extraction capabilities #1815
-
Module Name Correction: Fixed incorrect module names that were causing import errors in production deployments #1821
-
Release Process UV Command Issue: Fixed missing UV command in the automated release process that was preventing proper dependency resolution during release builds #1825
-
BigQuery SQL Query Fixes: Corrected broken SQL queries in the KG dashboard that were preventing proper data visualization and reporting #1808
-
Node Normalization Error Logging: Improved error logging in core node normalization process to provide better debugging information when data processing fails #1806
-
Ground Truth Table Names Update: Updated ground truth table references in the KG dashboard to match the new table naming conventions #1817
-
Release History Page Fix: Fixed broken release history page generation and display, ensuring proper documentation of version history #1792
-
Requirements.txt Synchronization: Fixed synchronization issues with requirements.txt to ensure consistent dependency versions across environments #1774
-
Documentation .gitignore Fix: Added docs data directory to .gitignore to prevent accidental commit of generated documentation files #1828
Technical Enhancements π§°
-
Spot Instance Implementation: Migrated MATRIX pipeline runs to GKE Spot Instances with fallback mechanisms, reducing infrastructure costs by up to 80% while maintaining reliability #1771, #1788
-
Artifact Registry with Cleanup Policies: Added comprehensive Artifact Registry module with automated cleanup policies and documentation, improving container image lifecycle management #1717
-
GKE Node Capacity Increase: Bumped GKE node disk size to 1.5TB and disabled image deletion policy to support larger workloads and improve storage reliability #1798
-
Spark Temporary Directory Configuration: Enhanced Spark configuration with proper temporary directory management, preventing disk space issues during large data processing jobs #1816
-
Enhanced Node Deduplication: Improved category assignment logic in node deduplication process, resulting in better data quality and reduced redundancy #1786
-
Matrix Transformations Output Repartitioning: Optimized data partitioning for matrix transformation outputs, improving processing performance and reducing memory pressure #1726
-
Ephemeral Volume Management: Created generic ephemeral volumes with persistent disk CSI tied to pods, improving storage performance and cost efficiency #1799
-
Argo Workflows Archive Logging: Enabled archive logs for Argo Workflows controller, improving debugging capabilities and workflow monitoring #1795
-
Enhanced Monitoring Configuration: Updated kube-state-metrics configuration to include pod containers in metric labels, providing better observability #1733
-
Dynamic Ground Truth Ingestion: Made ground truth data ingestion more dynamic and configurable, allowing for easier addition of new data sources #1766
-
Weekly Dependency Updates: Added automated weekly workflow to update MATRIX dependencies, ensuring security patches and performance improvements are regularly applied #1775
-
Cost Optimization Infrastructure: Multiple cost-cutting measures including removal of local SSDs, backup agent configuration optimization, and improved resource allocation #1796, #1731
Documentation βοΈ
-
Installation Instructions Update: Enhanced Linux installation guide with pyenv setup steps and improved developer onboarding documentation #1748
-
External Contributor Documentation: Updated documentation to reflect lessons learned from public external contributor testing, improving the contribution experience #1764
Other Changes
-
Neo4j Ingestion Optimization: Modified Neo4j ingestion to only occur on monthly minor releases, reducing resource usage and improving pipeline efficiency #1823
-
BigQuery Output Optimization: Only write final filtered tables to BigQuery, reducing storage costs and improving query performance #1819
-
BigQuery Access Permissions: Allowed MATRIX PROD environment to access Orchard Datasets in BigQuery for cross-project data integration #1803
-
Clinical Trials Data Migration: Moved Clinical Trials and off-label data to public datasets, improving data accessibility and compliance #1760
-
Payload Size Optimization: Increased payload size limits and fixed string conversion issues for better data handling capacity #1773, #1776
-
PySpark Version Update: Updated PySpark to version 3.5.6 for improved performance and bug fixes #1753
-
Disease List Ingestion Refactor: Refactored disease list ingestion to use pandas.CSVDataset for better data handling and validation #1750
-
ARGO Configuration for Stability Pipeline: Added ARGO configuration to core stability pipeline for better workflow management #1747
-
Node Category Filtering: Added node category filters to the filtering pipeline, improving data quality and reducing noise #1730
-
Release History Link: Added release history link to KG dashboard home page for better user navigation #1790
-
Sampling Pipeline Schedule: Modified sampling pipeline to run only on weekdays, optimizing resource usage #1804
-
Platform Documentation: Added comprehensive platform refactor and standardization documentation #1706