Environments overview
Environments
We have 4 environments declared in the kedro project for MATRIX:
base: Contains the base environment which reads the real data from GCS and operates in your local compute environmentcloud: Contains the cloud environment with real data. All data is read and written to a GCP project as configured (see below). Assumes fully stateless local machine operations (e.g. in docker containers)test: Fully local environment that executes an end-to-end smoke test of the pipeline using mock data and simplified parameters (e.g. 2 dimensions PCA instead of 100) to test the pipeline lineage quickly and without computationally expensive operations.local: A default environment which you can use for local adjustments and tweaks. Changes to this repo are not usually committed to git as they are unique for every developer.sample: Contains a sample of the data and is useful for fast iterations on the pipeline from the embeddings pipeline and on.
Info
Remember the .env.default and .env mentioned in the repository structure? Our cloud environment is equipped with environment variables that allow for controlling your credentials (e.g. github token) or configuring the GCP project to use (more about this in deep dive)
You can run any of the environments using the --env flag. For example, to run the pipeline in the cloud environment, you will use the following command:
kedro run --env cloud # NOTE: this is just an example; do not run it
Note that our cloud environment both reads and writes all intermediate data products to our Google Cloud Storage. In general, it should be only used for pipeline runs which are being executed on our kubernetes cluster.
Run with fake data locally
To run the full pipeline locally with fake data, you can use the following command:
kedro run --env test -p test
This runs the full pipeline with fake data. This is exactly what we did as a part of make integration_test in the previous section but now we are not using make wrapper.
Run with real data locally
To run the full pipeline with real data by copying the RAW data from the central GCS bucket and then run everything locally you can simply run from the default environment. We've setup an intermediate pipeline that copies data to avoid constant copying of the data from cloud.
# Copy data from cloud to local
kedro run -p ingestion
Hereafter, you can run the default pipeline.
# Default pipeline in default environment
kedro run -p data_engineering
Run with sample data locally
To run the the pipeline from the embeddings step onwards with a smaller dataset for testing or development purposes, use the sample environment:
# Run pipeline with sample data
kedro run -e sample -p test_sample
Info
Environments are abstracted away by Kedro's data catalog which is, in turn, defined as configuration in YAML. The catalog is dynamic, in the sense that it can combine the base environment with another environment during execution. This allows for overriding some of the configuration in base such that data can flow into different systems according to the selected environment.
The image below represents a pipeline configuration across three environments, base, cloud and test. By default the pipeline reads from Google Cloud Storage (GCS) and writes to the local filesystem. The cloud environment redefines the output dataset to write to BigQuery (as opposed to local). The test environment redefines the input dataset to read the output from the fabricator pipeline, thereby having the effect that the pipeline runs on synthetic data.
Now that you have a good understanding of different environments, we can run the pipeline with a sample of real data.