Architecture Decision Record: Adding LiteLLM to Matrix Infrastructure Kubernetes Stack Date: 04-09-2025 Status: Proposed Author: Platform Engineering Team
Context
The MATRIX platform at EveryCure requires a robust LLM gateway to manage multiple AI model providers efficiently. As we expand our AI-driven drug repurposing capabilities using Gemini 2.0 and other LLM providers, we need a unified interface that provides:
- Consistent API interface across multiple LLM providers
- Cost tracking and budget management
- Rate limiting and load balancing
- Caching for improved performance and cost reduction
- Audit logging and compliance
- High availability and scalability
Our current infrastructure runs in MATRIX project and we need a solution that integrates seamlessly with our existing Kubernetes stack and have the ability to lift and shift to another project later down the line.
Decision
We will deploy LiteLLM Proxy Server as our LLM gateway within our matrix infrastructure Kubernetes stack with the following components:
LiteLLM Proxy deployed via Helm chart PostgreSQL for persistent storage and user management with PGBouncer for Connection pooling. (Patroni to be set up later when we need Active/Passive Failover and HA) Redis for caching and distributed state management. We will create a new Cloud DNS Private Zone for internal DNS Resolution. We will use ExternalDNS controller to tie A Records to services.
Implementation Details
Components & Deployment Strategy
-
LiteLLM Core Deployment
-
Managed via ArgoCD
Application(infra/argo/app-of-apps/templates/litellm.yaml) - Sources upstream Helm chart from
https://github.com/BerriAI/litellm.git(deploy/charts/litellm-helm) pinned totargetRevision: v1.76.1-stable - Runs in namespace
litellm(auto-created viaCreateNamespace=truesync option) - Replicas:
replicaCount: 3for basic HA - Container image:
ghcr.io/berriai/litellm:main-stable -
Service: ClusterIP on port
4000 -
Gateway / Routing
-
Separate ArgoCD
Application(litellm-gateway) with a later sync-wave ("11") to ensure core LiteLLM is healthy first ("10") - Exposes HTTP entry via Kubernetes
HTTPRoute(Gateway API) referencing sharedapi-gatewayin namespacegateway - Public hostname pattern:
litellm.api.<env>.everycure.org(injected via Helm parameter) and managed byexternal-dnsannotation for automated DNS record creation -
A
HealthCheckPolicy(GKE specific) ensures GCLB backend health monitoring on/port 4000 -
PostgreSQL
-
Uses existing CloudNativePG cluster (pooler) – no embedded DB deployment (
db.useExisting: true,deployStandalone: false) - Connection URL template (in values):
postgresql://litellm:$(DATABASE_PASSWORD)@postgresql-cloudnative-pg-cluster-pooler-rw.postgresql.svc.cluster.local:5432/app?schema=litellm - Schema-level isolation via
?schema=litellmrather than a separate database -
Credentials delivered through ExternalSecret -> Kubernetes secret
postgres -
Redis
-
External in-cluster Redis service
redis.redis.svc.cluster.local:6379leveraged for:- Request routing / rate state (
router_settings.redis_host) - Response + embedding cache (
litellm_settings.cache_params)
- Request routing / rate state (
-
Cache configuration: TTL 86400 seconds (1 day), namespace
litellm_cache, flush batch size 100 -
Secrets & Key Management
-
External Secrets (
infra/argo/applications/litellm-gateway/templates/external_secrets.yaml) provision:litellm-master-key(master key for virtual key issuance)litellm-provider-keys(OpenAI + Anthropic API keys)postgres(DB username/password)
- Remote secret backend:
ClusterSecretStore gcp-store -
Refresh interval: 1h
-
Model Registry (Initial Set)
-
OpenAI:
gpt-4o,gpt-4o-mini,text-embedding-3-large,text-embedding-3-small - Anthropic:
claude-3-5-sonnet-20240620,claude-4-sonnet-20250514,claude-3-5-haiku-20241022,claude-3-haiku-20240307 -
Caching enabled selectively (
model_info.cache: true) for frequently reused models -
Runtime Settings
-
Retries:
num_retries: 2(router level) - Telemetry disabled initially (
general_settings.telemetry: false) – will revisit once observability baseline is in place -
UI enabled for admin operations (
ui.enabled: true) -
Sync & Delivery
- ArgoCD automated sync with prune + selfHeal ensures drift correction
- Ordered deployment by sync-wave enforces dependency: core app before route/DNS exposure
Rollout Plan
| Phase | Action | Success Criteria |
|---|---|---|
| 0 | Apply secrets + DB schema migration (Terraform) | Secrets synced; schema present |
| 1 | Deploy LiteLLM app (wave 10) | Pods Ready=3/3; health endpoints responsive |
| 2 | Deploy gateway route + health check (wave 11) | External DNS record resolves; health check green |
| 3 | Smoke test API (list models, simple completion, embedding) | All model calls succeed < 2s p95 (initial) |
| 4 | Enable caching validation | Cache hit ratio > 20% after warm-up |
| 5 | Onboard first service via virtual key | Successful auth & billing metadata recorded |
Operations & SRE Considerations
Scaling: Horizontal scaling by adjusting replicaCount; Redis centralizes shared state. Consider HPA on CPU + custom QPS.
Failure Modes:
- Redis outage: falls back to no shared cache/routing state (elevated latency & cost) – document runbook to disable cache params temporarily
- DB outage: key issuance, logging, and user mgmt fail – read-only proxy operations that don't require DB state may still function (confirm upstream behavior)
- Provider rate limit: built-in retry (2 attempts) – consider circuit breaking if elevated 429s > threshold
Backups: Rely on existing Postgres backup policy (CloudNativePG). No Redis persistence assumed (ephemeral cache acceptable).
Disaster Recovery: Recreate Redis + rehydrate warm cache; re-sync ArgoCD apps; DB schema managed declaratively (Terraform + migrations).
Security
| Area | Control |
|---|---|
| API Keys | Stored in GCP Secret Manager, synced via ExternalSecrets |
| Master Key | Rotatable via updating secret backend; triggers pod restart on change |
| Network Exposure | Restricted to gateway-managed hostname; ClusterIP internally |
| Namespace Isolation | Dedicated litellm namespace for policy scoping |
| Least Privilege | DB user limited to litellm schema |
| Future Enhancements | Add OPA/Gatekeeper constraints; enable request audit logging |
Observability
Initial telemetry disabled. Next steps:
- Expose Prometheus metrics endpoint (if supported by upstream chart) and add ServiceMonitor
- Add structured logging config (JSON) for ingestion by existing log pipeline
- Define SLOs: Availability (99.5%), p95 latency (<1500ms for GPT-4o), error rate (<2%)
- Add synthetic probe calls for early provider degradation detection
Open Questions / Future Work
- Introduce rate limiting & per-team quota enforcement (needs Redis + custom policy layer)
- Evaluate multi-region active/passive for Postgres (Patroni or Cloud-native failover) when usage justifies
- Add additional providers (Mistral, Google Gemini) once keys & compliance cleared
- Evaluate enabling upstream telemetry for usage analytics once privacy review complete
- Consider Private Service Connect / Internal LB if cross-project consumption grows
Alternatives Considered
DNS Resolution (Load Balancer vs other)
A key decision needed to be made was whether to use a GCP Private Load Balancer or not. Our current networking layer already fulfils present L7 and L4 requirements; we can introduce a Private Load Balancer later if/when needs evolve.
Architecture Diagram
%% Diagram shows dual access paths (external via HTTPS hostname + internal via cluster DNS)
graph TD
%% External & Internal Clients
EXT[External Clients / Internet]
INT[Internal Platform Services]
%% Subgraphs
subgraph "Kubernetes Cluster"
GW["API Gateway (Gateway API)<br/>Namespace: gateway"]
C[LiteLLM Proxy<br/>Namespace: litellm]
D["PostgreSQL (CloudNativePG Pooler)<br/>Schema: litellm"]
E[Redis<br/>Cache & Routing State]
end
subgraph "External Providers"
J[OpenAI]
K[Other LLM Providers]
end
%% Access Paths
EXT -->|HTTPS: litellm.api.<env>.everycure.org| GW
INT -->|Cluster DNS: litellm.litellm.svc.cluster.local:4000| C
GW --> C
%% Internal Dependencies
C --> D
C --> E
%% Outbound to Providers
C --> J
C --> K