GCP Environments
This guide explains how to use development and production GCP environments with Kedro in the matrix project.
Warning
Note that project names presented here are related to the Matrix Project & Matrix GCP infrastructure. If you have set up MATRIX codebase on your own cloud platform, these instructions might not be directly applicable for you and might vary.
Terminology
Kedro environments
Refers to kedro environments (e.g. base, test, cloud) as defined by kedro documentation here.
Primarily, they have an influence on the location and type of inputs and outputs of your data pipeline, controlled by the data catalog.
GCP environments
It is abstracted away from the user (you don't set it, it's determined automatically). It is used as a shorthand for GCP project a pipeline is sent to for execution.
As such, gcp environment prod refers to GCP project mtrx-hub-prod-sms and dev to mtrx-hub-dev-3of. Note that prod environment is only available to a restrictred group of users at the moment.
Runtime variables
Not an environment, but a related concept.
Variables such as RUNTIME_GCP_BUCKET or RUNTIME_GCP_PROJECT refer to the bucket or project corresponding to the GCP project you want to run the pipeline in.
Example 1: You want to run the pipeline in prod, so your .env file has:
RUNTIME_GCP_PROJECT_ID=mtrx-hub-prod-sms
Example 2: You want to run the pipeline in dev, so your .env file has the following (commented out):
# RUNTIME_GCP_PROJECT_ID=mtrx-hub-prod-sms
and your .env.defaults file has:
RUNTIME_GCP_PROJECT_ID=mtrx-hub-dev-3of
Understanding GCP Environments
In our project, we have two distinct GCP environments (projects):
- Development (
dev): uses GCP projectmtrx-hub-dev-3of - Production (
prod): uses GCP projectmtrx-hub-prod-sms
These environments are separate from Kedro environments (base, cloud, test, etc.) which control pipeline configurations.
Note
Only administrators have access to the production environment.
The GCP environment controls:
- Which GCP project is used for data processing
- Which storage buckets are accessed
- Whether private datasets are available
- Security and access controls
Note
Be aware that production contains strictly confidential, private datasets.
Using GCP Environments with Kedro
Setting the GCP Environment
When using kedro experiment run or kedro run you can specify the GCP environment using the values in your .env file (for prod), or have those commented out, in which case .env.defaults will be applied, which always point to dev.
Environment Variables
To run pipelines in production (with private datasets), your .env file should contain:
RUNTIME_GCP_PROJECT_ID=mtrx-hub-prod-sms
RUNTIME_GCP_BUCKET=mtrx-us-central1-hub-prod-storage
MLFLOW_URL=https://mlflow.platform.prod.everycure.org/
ARGO_PLATFORM_URL=https://argo.platform.prod.everycure.org
GOOGLE_APPLICATION_CREDENTIALS=/Users/<YOUR_USERNAME>/.config/gcloud/application_default_credentials.json
INCLUDE_PRIVATE_DATASETS=1
To run pipelines in development (only public datasets), the above values in your .env should be commented out.
Commands kedro experiment run and kedro run are environment agnostic.
Implications of GCP Environment Selection
Data Access
- Private Datasets: In development, private datasets are automatically excluded from pipelines
- Storage Locations: Different GCP buckets are used for
devvsprodgcp-env pipeline outputs - Public Datasets: When running in production gcp-env, public datasets are still ingested from the bucket in the dev GCP project
Info
The CI pipeline is not extended to prod - testing is done in the same way in both environments. Currently, releases are only triggered in the dev environment (this will change to prod in the future)