Cloud Environment Guide
The cloud environment is designed for production-scale pipeline execution on Google Cloud Platform (GCP) using our Kubernetes Cluster with Argo orchestration. It configures the MATRIX pipeline to read from and write to cloud storage and BigQuery, enabling distributed processing and scalable data operations.
The cloud environment is used for production pipeline execution on Kubernetes clusters, large-scale data processing using distributed computing, and multi-user collaboration with centralized data storage.
Key Differences from Base Environment
Data Storage Strategy
| Aspect | Base Environment | Cloud Environment |
|---|---|---|
| Storage Location | Local filesystem | GCS + BigQuery |
| Scalability | Limited by local resources | Cloud-scale, Parallelized |
Data Path Structure
As mentioned, the data pathways are all pointing to our GCS buckets. You will notice that the convention remains the same between the pathways in cloud and other environments however main parent directory now points to our storage GCS bucket.
runtime_gcs_bucket: gs://${oc.env:RUNTIME_GCP_BUCKET}
runtime_gcp_project: ${oc.env:RUNTIME_GCP_PROJECT_ID}
dev_gcs_bucket: gs://mtrx-us-central1-hub-dev-storage
prod_gcs_bucket: gs://mtrx-us-central1-hub-prod-storage
# Public GCS bucket for public datasets
public_gcs_bucket: gs://data.dev.everycure.org
# ...
paths:
# Raw data (read-only from central buckets)
raw: ${dev_gcs_bucket}/data/01_RAW
raw_private: ${prod_gcs_bucket}/data/01_RAW
# Public data sources
raw_public: ${public_gcs_bucket}/data/01_RAW
# Release-based storage
ingestion: ${release_dir}/datasets/ingestion
integration: ${release_dir}/datasets/integration
release: ${release_dir}/datasets/release
# Run-based storage
filtering: ${run_dir}/datasets/filtering
embeddings: ${run_dir}/datasets/embeddings
modelling: ${run_dir}/datasets/modelling
evaluation: ${run_dir}/datasets/evaluation
matrix_generation: ${run_dir}/datasets/matrix_generation
inference: ${run_dir}/datasets/inference
# Distributed cache
cache: ${runtime_gcs_bucket}/kedro/data/cache
MLflow Cloud Configuration
The cloud environment configures MLflow for distributed execution:
tracking:
run:
# Ensures stable naming during distributed execution
name: ${oc.env:WORKFLOW_ID}
Running Cloud Environment
Runnign cloud env locally is not recommended as it relies heavily on GCS and services which are live on our cluster (e.g. live MLFlow instance). However we provide some instructions in the cross environment section on how to connect to your cloud environment run and continue locally in your base env.