Tech Stack

Our drug repurposing pipeline relies on many libraries and components but some of the most critical ones are:

Python - Our primary programming language, used throughout the codebase for data processing, analysis and machine learning
Docker - Containerization platform that ensures consistent environments across development and production
uv - Modern Python package installer and resolver that we use instead of pip for faster, more reliable dependency management
Java Virtual Machines - Required for Apache Spark operations which we use for distributed data processing
gcloud SDK - Google Cloud Platform tools that enable interaction with our cloud infrastructure and services

Google

Our platform leverages Google Cloud Platform as Cloud provider. Many parts of the onboarding deep-dive guide or documentation will depend on or refer to GCP - these will be marked appropriately.

Whilst it's not essential to understand each part of the stack to contribute or run the pipeline, we encourage everyone to learn more about these useful technologies.

Pipeline framework: Kedro

The most essential library for our pipeline is Kedro which is used as our data pipelining framework. It provides crucial structure and modularity to our codebase, enabling reproducible data science workflows and making our pipeline maintainable and scalable.

Info

Kedro is an open-source framework to write modular data science code. We recommend checking out the Kedro documentation website as well as deep-dive into kedro and custom extensions we produced in the deep dive section.

Below is a 5 minutes intro video to Kedro:

As mentioned in the video, kedro is a rather light framework, that comes with the following key concepts:

Project template: Standard directory structure to streamline project layout, i.e., configuration, data, and pipelines.
Data catalog: A lightweight abstraction for datasets, abstracting references to the file system, in a compact configuration file.
Pipelines: A pipeline object abstraction, that leverages nodes that plug into datasets as defined by the data catalog¹.
Environments: Environments allow for codifying the execution environment of the pipeline.
Visualization: Out-of-the-box pipeline visualization based directly on the source code.

Our core pipelines' kedro project directory can be found in pipelines/matrix directory with an associated README.md with instructions.

Let's now understand how Kedro fits in our repo structure

Kedro allows for fine-grained control over pipeline execution, through the kedro run command. ↩