GitHub Actions Self-Hosted Runners Deployment Guide

Overview

This guide sets up GitHub Actions self-hosted runners on the GKE cluster using Actions Runner Controller (ARC) with the following features:

Ephemeral Runners: One pod per job for security and reliability.
Docker-in-Docker: Full Docker build support.
Auto-scaling: 0-50 runners based on GitHub Actions queue.
Management Node Isolation: ARC controller runs on dedicated management nodes.

Custom Runner Image

To optimize CI performance and avoid installing dependencies on every job run, we use a custom GitHub runner image that includes pre-installed tools and dependencies.

Image Location

The custom image is hosted in our Artifact Registry:

us-central1-docker.pkg.dev/mtrx-hub-dev-3of/github-runner-images/github-runner:latest

Pre-installed Tools

The custom image extends the official GitHub Actions runner image and includes:

Python Environment:
pyenv for Python version management
Python 3.11 as the default version
Build Tools:
make, build-essential, gcc toolchain
SSL, compression, and development libraries
Java Runtime:
OpenJDK 17 (JDK and JRE)
System Dependencies:
curl, wget, ca-certificates
Various development libraries (libssl-dev, zlib1g-dev, etc.)

Image Source

The Dockerfile and setup scripts are located in:

/infra/github-runner-image/Dockerfile
/infra/github-runner-image/setup_pyenv.sh

Image Build Process

The custom image is automatically built and pushed to Artifact Registry using GitHub Actions:

Workflow: .github/workflows/build_and_upload_image_for_github_runner_set_k8s.yml
Triggers:
Changes to files in infra/github-runner-image/ directory
Changes to the workflow file itself
Manual dispatch via GitHub UI
Branches: Builds on main and feature branches
Tags: Creates both latest and SHA-based tags for each build

Benefits

By using this custom image, we achieve:

Faster Job Startup: No need to install common dependencies on each run
Consistent Environment: All jobs use the same pre-configured base environment
Reduced Network Usage: Dependencies are baked into the image
Better Reliability: Pre-tested tool combinations

Deployment Steps

1. Set Up GitHub App

Create a GitHub App in the everycure-org organization
Install it on the organization
Get the App ID, Installation ID, and Private Key
Add these to the secrets

2. Deploy via ArgoCD

The ArgoCD applications have been added to app-of-apps. They will be automatically synced with git push

3. Verify GitHub Integration

Go to https://github.com/everycure-org/matrix/settings/actions/runners
You should see "gha-runner-scale-set" listed as a runner set
Initially shows 0 runners (auto-scales on demand)

Usage in GitHub Actions

Use this in your .github/workflows/*.yml files:

name: Example Workflow
on: [push, pull_request]

jobs:
  build:
    runs-on: gha-runner-scale-set # <-- Use this value
    steps:
      - uses: actions/checkout@v4

      - name: Build with Docker
        run: |
          docker build -t my-app .
          docker run --rm my-app npm test

Key Features

Cost Optimization

Scale to Zero: No runners when no jobs are queued
Right-sizing: e2-standard-8 instances (8 vCPUs, 32GB RAM)

Security & Reliability

Ephemeral Runners: Fresh pod for each job
Network Isolation: Spot nodes with proper taints
Resource Limits: CPU/memory constraints prevent abuse

Docker Support

Docker-in-Docker: Full Docker daemon per runner.
Image Caching: Persistent storage for Docker images (Possible. Need extra configuration)
Multi-stage Builds: Full Docker feature support.

Limitation

Docker Compose usually runs into situation where a docker container gets stuck during initialization. This could be due to:

Storage Driver Compatibility & Performance
Security and Privileges
Networking and Docker Compose Behavior

It is advised to refrain from running docker compose and instead rely on natively running each component through docker run.

Monitoring & Troubleshooting

Check Runner Status

# List runner scale sets
kubectl get runnerscaleset -n actions-runner-system

# Check individual runners
kubectl get pods -n actions-runner-system -l app=gha-runner-scale-set

# View controller logs
kubectl logs -n actions-runner-system deployment/gha-runner-scale-set-controller

GitHub Actions Queue

# Check if runners are being created for queued jobs
kubectl describe runnerscaleset gha-runner-scale-set -n actions-runner-system

Common Issues

No runners scaling: Check GitHub App permissions and installation
Pod scheduling failures: Verify node taints/tolerations and ensure spot nodes are available
Docker issues: Check DinD container logs in runner pods
Cannot connect to Docker daemon: Ensure DinD sidecar is running properly
Docker build failures: Check that both runner and DinD containers have adequate resources
Certain containers are stuck when running through docker compose: Please do not use docker compose.

Scaling Configuration

Current limits:

Min Runners: 0 (cost optimization)
Max Runners: 50 (can be increased in values.yaml)
Scale Up: 2x factor, 30s grace period
Scale Down: 60s after job completion

Cost Tracking

Runners are labeled for billing:

billing-category: github-actions
cost-center: compute-workloads
workload-category: ci-cd