Docker Image Cleanup Implementation
Overview
We've implemented an Exit Handler Approach to automatically delete Docker images from Google Artifact Registry when Argo Workflows complete successfully.
What Was Done
1. Modified Workflow Template
- File:
pipelines/matrix/templates/argo_wf_spec.tmpl - Changes:
- Added
onExit: cleanup-handlerto the workflow spec. This ensure that thecleanup-handleris executed when a workflow exits. - Added
cleanup-handlertemplate that runs only when workflow status is "Succeeded" - Added
delete-artifact-imagestemplate that usesgcloudto delete the Docker image. This uses thesa-k8s-nodeSA. - Passes image parameter explicitly to cleanup template to ensure proper access.
2. Updated GKE Service Account Permissions
- File:
infra/modules/stacks/compute_cluster/iam.tf - Changes:
- Added
roles/artifactregistry.adminto the existingsa-k8s-nodeservice account - Removed redundant
roles/artifactregistry.writerandroles/artifactregistry.readerroles - No new service accounts created - reuses existing GKE node service account
How It Works
- Workflow Execution: Your normal workflow runs as usual
- On Success: When the workflow completes successfully, the exit handler triggers
- Parameter Passing: Image parameter is explicitly passed as input to cleanup template
- Image Deletion: The cleanup step runs with
gcloud artifacts docker images delete - Authentication: Uses existing GKE node service account with Artifact Registry admin permissions
- Error Handling: If image deletion fails, it logs the error but doesn't fail the workflow
Key Benefits
✅ Uses existing infrastructure - Leverages existing sa-k8s-node service account
✅ No new secrets or service accounts - Reuses GKE's built-in authentication
✅ Automatic cleanup - Runs only when workflow succeeds
✅ Non-blocking - Image cleanup failures don't affect workflow success
✅ Selective cleanup - Only deletes the specific image used by the workflow
✅ Proper permissions - Single admin role instead of multiple overlapping roles
Implementation Steps Completed
- ✅ Modified workflow template with exit handler
- ✅ Updated IAM permissions for
sa-k8s-nodeservice account - ✅ Tested workflow execution - cleanup handler triggers correctly
- ✅ Apply Terraform changes to grant Artifact Registry admin permissions
Next Steps
-
Apply Terraform changes:
cd infra/deployments/wg1 # or your deployment directory terraform plan terraform apply -
Test the complete flow with a new workflow run to verify cleanup works
Image Cleanup Details
The cleanup process:
- Input: Receives image parameter explicitly from workflow: {{inputs.parameters.image_to_delete}}
- Validation: Checks if image path is non-empty before proceeding
- Deletion: Uses gcloud artifacts docker images delete "$IMAGE_FULL" --quiet --delete-tags
- Logging: Provides detailed logs for debugging and monitoring
- Error Handling: Gracefully handles missing images or permission errors
- Example: Deletes images like us-central1-docker.pkg.dev/mtrx-hub-dev-3of/matrix-images/matrix:nelson-sample-run-71940567
Security & Best Practices
- Principle of Least Privilege: Using single
artifactregistry.adminrole instead of multiple roles - No Stored Secrets: Leverages GKE's built-in service account authentication
- Existing Infrastructure: No new service accounts, secrets, or complex setups
- Error Isolation: Cleanup failures don't impact workflow success