Spot Instance Implementation for Matrix Pipeline Infrastructure
Overview
This document outlines the comprehensive changes made to implement Google Cloud Platform (GCP) Spot instances across the Matrix pipeline infrastructure. The implementation focuses on cost optimization while maintaining high availability and reliability for pipeline workloads.
Executive Summary
Cost Impact: Up to 80% cost reduction on compute resources by leveraging GCP Spot instances. Reliability: Graceful fallback to regular instances ensures pipeline reliability. Scope: Full coverage of all pipeline workloads (CPU and GPU compute).
Changes Made
1. GKE Infrastructure Changes (infra/modules/stacks/compute_cluster/gke.tf)
1.1 New Spot Node Pools
N2D Highmem Spot Node Pools
n2d_spot_node_pools = [for size in [8, 16, 32, 48, 64] : {
name = "n2d-highmem-${size}-spot-nodes"
machine_type = "n2d-highmem-${size}"
spot = true
max_count = 20 # Higher than regular pools for availability
# ... other configuration
}]
GPU Spot Node Pools
gpu_spot_node_pools = [
{
name = "g2-standard-16-l4-spot-nodes"
machine_type = "g2-standard-16"
accelerator_count = 1
accelerator_type = "nvidia-l4"
spot = true
max_count = 30 # Higher for spot availability
# ... other configuration
}
]
1.2 Node Pool Configuration Updates
Increased Max Counts for Spot Pools
- N2D Spot pools: max_count = 20 (vs 10 for regular)
- GPU Spot pools: max_count = 30 (vs 20 for regular)
- Rationale: Higher capacity to handle spot instance preemptions
Node Pool Integration
node_pools_combined = concat(
local.n2d_node_pools,
local.gpu_node_pools,
local.management_node_pools,
local.n2d_spot_node_pools, # Added
local.gpu_spot_node_pools # Added
)
1.3 Taints and Labels
Spot Node Taints
"g2-standard-16-l4-spot-nodes" = [
{
key = "nvidia.com/gpu"
value = "present"
effect = "NO_SCHEDULE"
},
{
key = "spot"
value = "true"
effect = "NO_SCHEDULE"
},
{
key = "workload"
value = "true"
effect = "NO_SCHEDULE"
}
]
Node Labels for Cost Tracking
node_pools_labels = {
for pool in local.node_pools_combined : pool.name => merge(
{
spot_node = lookup(pool, "spot", false) ? "true" : "false"
billing-category = lookup(pool, "spot", false) ?
"gpu-compute-spot" : "gpu-compute" # For GPU pools
"cpu-compute-spot" : "cpu-compute" # For CPU pools
}
)
}
1.4 Fixed Terraform Compatibility Issues
Solution: Used lookup(pool, "spot", false) for safe attribute access. This would explicitly add the spot attribute to all non-spot instances as well.
2. Argo Workflow Template Changes (pipelines/matrix/templates/argo_wf_spec.tmpl)
2.1 Node Affinity Configuration
Spot-First Scheduling Strategy
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
# Highest priority: GKE spot instances
- weight: 100
preference:
matchExpressions:
- key: cloud.google.com/gke-spot
operator: In
values: ["true"]
Result: the pipeline workloads now prefer spot instances because of the affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution strategy, this ensure it first tries for spot, and if it is not available, it would go for standard.
2.2 Tolerations for Spot Nodes
Updated Tolerations
tolerations:
- key: "workload"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "node-memory-size"
operator: "Equal"
value: "large"
effect: "NoSchedule"
- key: "spot" # Added for spot instances
operator: "Equal"
value: "true"
effect: "NoSchedule"
Result: Tolerations allow pods to run on spot nodes; affinity weights them as the first choice. If no spot nodes exist, they will still schedule on regular nodes.
2.3 Retry strategy and fallback
To handle short-lived interruptions caused by spot instance reclamations, the Argo workflow template includes a targeted retryStrategy that will automatically retry a task when the failure looks like a spot eviction, and avoid retrying for other failure classes (for example, out-of-memory failures).
Behaviour summary:
- Retries are only attempted when the last failure message matches common eviction indicators such as pod deleted, imminent node shutdown or node is draining.
- Retries are explicitly disabled for OOM failures by excluding lastRetry.exitCode == 137 (the Linux OOM killer exit code).
- The configured limits and backoff are conservative: limit: 3 with an exponential backoff starting at duration: "1" and factor: "5".
Snippet from the template (pipelines/matrix/templates/argo_wf_spec.tmpl):
retryStrategy:
limit: 3
expression: |
(
lastRetry.message matches '.*pod deleted.*' ||
lastRetry.message matches '.*imminent node shutdown.*' ||
lastRetry.message matches '.*node is draining.*'
) && lastRetry.exitCode != 137
backoff:
duration: "1"
factor: "5"
Why this helps: - Spot nodes are reclaimed unpredictably. The retry strategy lets a short interruption be retried automatically so the pod can be rescheduled (potentially onto a non-spot node given the template's preferred-but-not-required spot affinity). - Explicitly excluding OOM ensures we don't waste retries on failures that are unlikely to succeed by rescheduling.
Where to change it:
- Update the retryStrategy block in pipelines/matrix/templates/argo_wf_spec.tmpl to adjust match patterns, retry limits, or backoff behaviour.
- If you change node affinity weights or tolerations, consider whether the retry behaviour should be updated as well (e.g., increasing retries if you expect more spot contention).
Technical Architecture
Scheduling Flow
- First Choice: Kubernetes scheduler attempts spot node placement
- Uses
cloud.google.com/gke-spot: truelabel -
Highest weight (100) for maximum preference
-
Fallback Strategy: If spot nodes unavailable
- Secondary preference for custom
spot_nodelabels - Graceful degradation to regular node pools
Deployment Notes
Prerequisites
- Proper IAM permissions for spot instance creation
Rollback Strategy
- Remove spot node pools from
node_pools_combined - Revert workflow template affinity changes
- All workloads will fall back to regular instances
Conclusion
With the above changes, we aim to save substantial cost due to our workloads being fault-tolerant and stateless by design, keeping the following in mind:
- Non-disruptive: Graceful fallback ensures continuity
- Cost-effective: Maximizes use of cheaper spot instances
- Operationally sound: Maintains monitoring and troubleshooting capabilities