This document provides an overview and guide for the shared Kubernetes (k8s) cluster running on GCP in our hub and spoke project setup. The cluster resides in the hub project and is accessible by all development teams / working groups from their respective spoke projects through the shared Virtual Private Cloud (VPC).

Gateway API Setup

Our project utilizes the Kubernetes Gateway API, which represents the next generation of Kubernetes Ingress, Load Balancing, and Service Mesh APIs. It's designed to be generic, expressive, and role-oriented¹. The core idea of the setup is visualized in this diagram well by Google:

Components

1. External DNS

We've configured External DNS to work with Gateway API resources:

external-dns:
  provider:
    name: google
  extraArgs:
    - --source=gateway-httproute
  rbac:
    additionalPermissions:
      - apiGroups: ["gateway.networking.k8s.io"]
        resources:
          [
            "gateways",
            "httproutes",
            "grpcroutes",
            "tlsroutes",
            "tcproutes",
            "udproutes",
          ]
        verbs: ["get", "watch", "list"]
      - apiGroups: [""]
        resources: ["namespaces"]
        verbs: ["get", "watch", "list"]

This configuration allows External DNS to manage DNS records based on Gateway API resources, particularly HTTPRoutes.

2. Cert Manager

We've enabled Cert Manager to work with Gateway API:

cert-manager:
  config:
    enableGatewayAPI: true

This allows Cert Manager to provision and manage TLS certificates for Gateway resources².

Example: Whoami Service

We've set up a simple "whoami" service to demonstrate the use of Gateway API:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: whoami-route
spec:
  parentRefs:
    - kind: Gateway
      name: external-http
  hostnames:
    - "whoami-test.platform.dev.everycure.org"
  rules:
    - backendRefs:
        - name: whoami
          port: 80

This HTTPRoute resource:

Associates with a Gateway named "external-http"
Routes traffic for the hostname "whoami-test.platform.dev.everycure.org"
Directs traffic to the "whoami" service on port 80

Key Concepts

GatewayClass: Defines a set of Gateways with a common configuration and behavior.
Gateway: Describes how traffic can be translated to Services within the cluster.
HTTPRoute: Describes how HTTP requests should be routed by a Gateway.

Our setup leverages these concepts to provide a flexible and powerful routing solution.

Additional Resources

Using Gateway for Ingress: https://gateway-api.sigs.k8s.io/guides/
External DNS & Gateway: https://kubernetes-sigs.github.io/external-dns/v0.13.1/tutorials/gateway-api/
Cert Manager Configuration:
ACME: https://cert-manager.io/docs/configuration/acme/
Cert Manager and Gateway: https://cert-manager.io/docs/usage/gateway/
Gateway API on GKE:
How it Works: https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api
Securing with IAP: https://cloud.google.com/kubernetes-engine/docs/how-to/secure-gateway

Deploying applications to the cluster

Service Deployment Configuration

Services are deployed to the cluster using Helm charts. The default configuration provisions services on nodes with a maximum capacity of 32 GB RAM, managed through taints and node selectors.

For services requiring larger compute resources (e.g., >32 GB RAM), explicit taint tolerations must be configured in the Helm chart. This ensures proper scheduling on nodes with the required capacity.

Implementation Examples

Reference implementations for high-memory workloads can be found in:

gke.tf: Infrastructure configuration
pipelines/matrix/templates/argo_wf_spec.tmpl: Pipeline node configuration

For instance, our pipeline nodes require 64 GB RAM and demonstrate the necessary taint toleration configuration.

Using the cluster for Spark processing

Warning

not yet implemented, volunteers welcome

Using the cluster for ray processing

Warning

not yet implemented, volunteers welcome

Access and Permissions

Submitting workflows

We enabled Google Groups RBAC for the Argo workflows application. Access to Argo Workflows is managed through Google Groups, specifically:

The cluster is configured with Google Groups for RBAC using the security group gke-security-groups@everycure.org
Argo Workflows access is granted to:
Individual users (can be added in the configuration)
Members of the techteam@everycure.org Google Group

The configuration is managed in the values.yaml file for the developer-iam argo application:

argo:
  namespace: argo-workflows
  rbac:
    users:
      # add new users here that should be able to submit workflows
      #- user@example.com
    groups:
      - techteam@everycure.org

To request access to submit workflows:

Ensure you are a member of the techteam@everycure.org Google Group
For individual access, request to be added to the users section in the configuration (should rarely be needed)

For more details on Google Groups RBAC setup, refer to the official documentation.

Administering the cluster

We will provide roles/container.developer to everyone in the MATRIX project that should be able to administer the cluster.

Accessing applications running on the cluster

We expose the applications via web-based interfaces, which are accessible via the web browser. Those that should not be accessible to the public are protected by IAP. See the google docs for more details.

Cluster Configuration

Cluster Setup

The shared Kubernetes cluster is hosted in the hub project. Key configuration details include:

Region: us-central1 (by default)
Node Pools: Configured with autoscaling enabled
Network: Shared VPC
GPU nodes: Currently not enabled but planning to add these as scale to 0 autoscaling group

Networking

Networking between the hub and spoke projects leverages a shared VPC:

VPC Name: matrix-hub-dev-nw
Subnets: one per region, e.g. EU and US
Firewall Rules: Configured to allow necessary communication between the hub and spoke projects as well as outgoing HTTPs and incoming SSH via GCP IdP

Secrets

We set all secrets in our cloud_secrets terraform module by grabbing our encrypted yaml file from the disk and creating cloud secrets for each of them.

https://gateway-api.sigs.k8s.io/ ↩
https://gateway-api.sigs.k8s.io/guides/ ↩