Matrix Pipeline on the Cluster
This guide covers running the complete Matrix pipeline on the Kubernetes cluster using Argo Workflows. The cluster provides distributed computing capabilities that allow for parallel processing of large datasets.
Warning
Note that this section is heavily focusing on the infrastructure which can be only applicable to the Matrix Project & Matrix GCP. Therefore, this section is useful and applicable if you can access our infrastructure.
If you intend to adapt Matrix Codebase & Infrastructure to your own cloud system, these instructions might be also helpful to give you an idea how we utilize the cluster however they might not be 1:1 comparable.
Prerequisites
Before running on the cluster, ensure you have:
- Cluster Access: Completed cluster setup
- Authentication: Valid GCP credentials and cluster access
- Environment Variables: Proper
.envconfiguration for your target environment - Resource Understanding: Knowledge of Argo resource configuration
Environment Configuration
Required Environment Variables
Configure your .env file based on your target environment:
# GCP Project and Storage
RUNTIME_GCP_PROJECT_ID=mtrx-hub-dev-3of
RUNTIME_GCP_BUCKET=mtrx-us-central1-hub-dev-storage
# MLflow Configuration
MLFLOW_URL=https://mlflow.platform.dev.everycure.org/
# Argo Platform
ARGO_PLATFORM_URL=https://argo.platform.dev.everycure.org
# Authentication
GOOGLE_APPLICATION_CREDENTIALS=/Users/<YOUR_USERNAME>/.config/gcloud/application_default_credentials.json
# Dataset Access (development only has public datasets)
INCLUDE_PRIVATE_DATASETS=0
# GCP Project and Storage
RUNTIME_GCP_PROJECT_ID=mtrx-hub-prod-sms
RUNTIME_GCP_BUCKET=mtrx-us-central1-hub-prod-storage
# MLflow Configuration
MLFLOW_URL=https://mlflow.platform.prod.everycure.org/
# Argo Platform
ARGO_PLATFORM_URL=https://argo.platform.prod.everycure.org
# Authentication
GOOGLE_APPLICATION_CREDENTIALS=/Users/<YOUR_USERNAME>/.config/gcloud/application_default_credentials.json
# Dataset Access (production includes private datasets)
INCLUDE_PRIVATE_DATASETS=1
Run Configuration
Set a unique run name and release version:
# Unique identifier for your run
RUN_NAME=my-full-cluster-run
# Release version for output organization
RELEASE_VERSION=v0.7.0
RELEASE_FOLDER_NAME=releases
Cloud Environment Overview
The cloud environment is specifically designed for pipeline execution on GCP using our Kubernetes Cluster with Argo orchestration. Key differences from the base environment include:
- Storage Strategy: Uses GCS buckets and BigQuery instead of local filesystem
- Scalability: Enables cloud-scale parallel processing vs local resource limitations
- Data Paths: All paths point to GCS buckets following the same structure:
paths: raw: ${dev_gcs_bucket}/kedro/data/01_raw ingestion: ${release_dir}/datasets/ingestion integration: ${release_dir}/datasets/integration # ... etc - MLflow Integration: Uses a live MLflow service deployed on the cluster for metrics and parameter tracking
Argo Resource Configuration
The Matrix pipeline uses ArgoNode and ArgoResourceConfig to request specific Kubernetes resources for each pipeline step. This ensures optimal resource allocation and parallel execution.
Default Resource Configuration
The pipeline uses these default resources per node:
# Memory (GiB)
KUBERNETES_DEFAULT_LIMIT_RAM = 52
KUBERNETES_DEFAULT_REQUEST_RAM = 52
# CPU (cores)
KUBERNETES_DEFAULT_LIMIT_CPU = 14
KUBERNETES_DEFAULT_REQUEST_CPU = 4
# GPUs
KUBERNETES_DEFAULT_NUM_GPUS = 0
Custom Resource Requests
For compute-intensive steps, you can specify custom resources using predefined configurations:
from matrix.kedro4argo_node import ArgoNode
ArgoNode(
func=nodes.reduce_embeddings_dimension,
inputs={
"df": "embeddings.feat.graph.node_embeddings@spark",
"unpack": "params:embeddings.dimensionality_reduction",
},
outputs="embeddings.feat.graph.pca_node_embeddings",
name="apply_pca",
tags=["argowf.fuse", "argowf.fuse-group.node_embeddings"],
argo_config=ArgoResourceConfig(
cpu_request=14,
cpu_limit=14,
memory_limit=120,
memory_request=120,
ephemeral_storage_request=256,
ephemeral_storage_limit=256,
),
),
Note that this ArgoNode is just a wrapper around a kedro node that's specifically designed for cluster runs, allowing us to granularly control resources for specific parts of the pipeline. Alternatively you can also just extract pre-existing argo node configurations:
Fuse Tags
You might have noticed that, additionally to the Argo Config Resources, there are some argo-related tags. These tags allow to fuse kedro nodes into one argo node, meaning that series of kedro nodes uner a specific argo-tag (e.g. argowf.fuse-group.<group_name>) will be executed on a single machine. This is beneficial when we don't want to keep a transient intermediate data product (e.g. bucketized embeddings for parallel processing)
from matrix.kedro4argo_node import ArgoNode, ARGO_GPU_NODE_MEDIUM
# Example: Using GPU resources for embedding computation
node(
func=compute_embeddings,
inputs=["processed_graph"],
outputs=["embeddings"],
name="compute_embeddings",
tags=["embeddings"],
argo_config=ARGO_GPU_NODE_MEDIUM # Requests GPU resources
)
MLflow Integration
The cluster pipeline automatically integrates with MLflow for experiment tracking and model versioning.
MLflow Configuration
MLflow is configured through the mlflow.yml file in your environment configuration:
mlflow:
tracking_uri: ${MLFLOW_URL}
experiment_name: ${RUN_NAME}
artifact_root: ${gcs_bucket}/runs/${run_name}/mlflow
registry_uri: ${MLFLOW_URL}