Infrastructure & Deployment

Kubernetes-basiertes Cloud-Native Deployment mit GitOps-Workflow für Produktivbetrieb. Das kiva-infra-dev Repository ist die Single Source of Truth für das Deployment aller Services auf dem Azure Kubernetes-Cluster.

Überblick

Das KIVA Infrastructure Repository orchestriert alle Microservices mit Helmfile und bietet eine zentrale Verwaltung für Multi-Environment-Deployments (dev, prod, init).

Kernfunktionen

Helmfile-basiertes Deployment: Orchestrierung aller Microservices mit einer einzigen Konfiguration
Multi-Environment Support: Dev, Prod, Init-Umgebungen mit separaten Konfigurationen
Azure Container Registry (ACR): Zentrale Speicherung für Docker Images und Helm Charts als OCI-Artefakte
GitLab CI/CD Integration: Automatisierte Build-, Test- und Deployment-Pipelines
SSL/TLS Management: Selbstsignierte Zertifikate mit CA-Verwaltung für sichere HTTPS-Verbindungen
Persistent Storage: Azure Managed CSI Premium für Datenbank- und Modell-Persistenz

Architektur-Diagramm

DevOps-Workflow

3-Schritt-Deployment-Prozess

Schritt 1: Service-Update

Entwickler ändern Code in Service-Repositories: - kiva-vllm - Inference Engine - litellm - LLM Gateway - open-webui-backend-kiva - Backend API - open-webui-frontend-kiva - Frontend UI - api-gw - Kong Gateway - kiva-monitoring - Monitoring Stack

# Beispiel: vLLM-Service aktualisieren
cd kiva-vllm
git checkout -b feature/new-model
# ... Code-Änderungen ...
git add .
git commit -m "Add support for Llama-3"
git push origin feature/new-model

Schritt 2: CI/CD Build & Push

GitLab Pipeline erstellt automatisch Artefakte:

# .gitlab-ci.yml (vereinfacht)
stages:
  - build
  - package
  - push

docker-build:
  stage: build
  script:
    - docker build -t exxkiregistry.azurecr.io/kiva-vllm:$CI_COMMIT_SHA ./
    - docker push exxkiregistry.azurecr.io/kiva-vllm:$CI_COMMIT_SHA

helm-package:
  stage: package
  script:
    - helm package ./helm --version $CI_COMMIT_SHA
    - helm push kiva-vllm-$CI_COMMIT_SHA.tgz oci://exxkiregistry.azurecr.io/helm

Artefakte in ACR: - Docker Image: exxkiregistry.azurecr.io/kiva-vllm:sha123 - Helm Chart: oci://exxkiregistry.azurecr.io/helm/kiva-vllm:sha123

Schritt 3: Infrastructure Deployment

Zentrales Deployment über kiva-infra-dev:

# 1. Repository klonen
git clone https://gitlab.opencode.de/baden-wuerttemberg/innenministerium/kiva.ops.git
cd kiva-infra-dev

# 2. Azure Login
az login
az acr login -n exxkiregistry

# 3. Alle Services deployen
helmfile -e dev apply

# 4. Deployment verifizieren
kubectl get pods -n kiva
kubectl get svc -n kiva

Unterstützte Umgebungen

Cloud-Plattformen

Plattform	Status	Dokumentation
Azure AKS	✅ Primär	AKS Docs
AWS EKS	✅ Unterstützt	EKS Docs
Google GKE	✅ Unterstützt	GKE Docs
Lokal (Minikube)	✅ Development	Minikube Docs
Lokal (k3s)	✅ Development	k3s Docs

Environment-Konfiguration

# Development
helmfile -e dev apply

# Production
helmfile -e prod apply

# Initial Setup (erstmalige Installation)
helmfile -e init apply

Unterschiede zwischen Environments:

Setting	Dev	Prod	Init
Replicas	1	3	1
Resources	Niedrig	Hoch	Niedrig
SSL	Self-signed	Let's Encrypt	Self-signed
Monitoring	Optional	Mandatory	Optional
Backups	Nein	Täglich	Nein

Deployierte Komponenten

Das Infrastructure Repository verwaltet folgende Services:

1. Open WebUI

Frontend: Svelte/SvelteKit UI (Port 80)
Backend: Python API Server (Port 8080)
Ollama: Lokales LLM-Hosting (Port 11434)
PostgreSQL: Datenbank für Benutzer und Chats

2. LiteLLM Proxy

Service: LLM Gateway (Port 4000)
PostgreSQL: Konfiguration und Service Accounts
Monitoring: Prometheus Metriken auf /metrics

3. vLLM Inference

Service: GPU-Inference Engine (Port 8000)
PVC: Model Cache (50Gi)
GPU: NVIDIA V100/A100

4. Kong API Gateway

Proxy: Port 8000 (HTTP), 8443 (HTTPS)
Admin: Port 8001 (API), 8002 (GUI)
Routes: Frontend, Backend, LiteLLM, Monitoring

5. Monitoring Stack

Prometheus: Metrics Collection (Port 9090)
Grafana: Dashboards (Port 3000)
Alertmanager: Alerting (Port 9093)

Voraussetzungen

1. Kubernetes-Cluster

Version: 1.20 oder höher
Nodes: Mindestens 3 Worker Nodes
GPU-Nodes: Für vLLM (optional, aber empfohlen)

2. Lokale Tools

# Helm installieren (macOS)
brew install helm

# Helm installieren (Linux)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Helmfile installieren
brew install helmfile  # macOS

# Helm Diff Plugin
helm plugin install https://github.com/databus23/helm-diff

# kubectl installieren
brew install kubectl  # macOS
# oder
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

3. Azure Container Registry Zugriff

# Azure CLI Login
az login

# ACR Login
az acr login -n exxkiregistry

# Image Pull Secret erstellen
kubectl create secret docker-registry acr-secret \
  --docker-server=exxkiregistry.azurecr.io \
  --docker-username=<username> \
  --docker-password=<password> \
  --namespace=kiva

4. Storage Class

# Azure Managed CSI Premium prüfen
kubectl get storageclass

# Falls nicht vorhanden, erstellen
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-csi-premium
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_LRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
EOF

Installation

Vollständige Installation

# 1. Namespace erstellen
kubectl create namespace kiva

# 2. Repository klonen
git clone https://gitlab.opencode.de/baden-wuerttemberg/innenministerium/kiva.ops.git
cd kiva-infra-dev

# 3. Azure ACR Login
az login
az acr login -n exxkiregistry

# 4. Secrets erstellen
kubectl create secret docker-registry acr-secret \
  --docker-server=exxkiregistry.azurecr.io \
  --docker-username=$ACR_USERNAME \
  --docker-password=$ACR_PASSWORD \
  --namespace=kiva

# 5. Storage Class erstellen (falls nötig)
kubectl apply -f kiva-namespace-and-storageclass.yaml

# 6. Alle Services deployen
helmfile -e dev apply

# 7. Deployment überprüfen
kubectl get pods -n kiva --watch

Einzelne Services deployen

# Nur LiteLLM deployen
helmfile -e dev -l name=litellm apply

# Nur vLLM deployen
helmfile -e dev -l name=kiva-vllm apply

# Nur Monitoring deployen
helmfile -e dev -l name=prometheus apply
helmfile -e dev -l name=grafana apply

Update-Prozess

Service aktualisieren

# 1. Neue Version in Service-Repository pushen
cd kiva-vllm
git push origin main
# -> GitLab CI/CD erstellt neue Images

# 2. Im Infrastructure-Repository
cd kiva-infra-dev

# 3. Änderungen prüfen (Dry-Run)
helmfile -e dev diff

# 4. Updates anwenden
helmfile -e dev apply

# 5. Rollout überwachen
kubectl rollout status deployment/kiva-vllm -n kiva

Helm Chart Version aktualisieren

# helmfile.yaml bearbeiten
releases:
  - name: kiva-vllm
    chart: oci://exxkiregistry.azurecr.io/helm/kiva-vllm
    version: "new-version"  # Version ändern

# Anwenden
helmfile -e dev apply

SSL/TLS-Zertifikatsverwaltung

Die Infrastruktur verwendet selbstsignierte Zertifikate für HTTPS-Konnektivität.

CA-Zertifikat erhalten

Option 1: Vom Infrastruktur-Team

# Datei: ca-cert.pem

Option 2: Vom Cluster extrahieren

kubectl get secret kong-ssl-certs -n kiva -o jsonpath='{.data.ca\.crt}' | base64 -d > ca-cert.pem

Zertifikat installieren

Windows

# Zertifikat-Manager öffnen
certmgr.msc

# Importieren:
# Vertrauenswürdige Stammzertifizierungsstellen > Zertifikate
# Rechtsklick > Alle Aufgaben > Importieren
# ca-cert.pem auswählen

macOS

# Schlüsselbundverwaltung öffnen
open -a "Keychain Access"

# Zertifikat importieren
# Datei > Objekte importieren > ca-cert.pem auswählen
# System-Schlüsselbund wählen
# Zertifikat doppelklicken > Vertrauen > Immer vertrauen

Linux (Ubuntu/Debian)

sudo cp ca-cert.pem /usr/local/share/ca-certificates/kiva-ca.crt
sudo update-ca-certificates

Firefox (alle Plattformen)

Einstellungen > Datenschutz & Sicherheit > Zertifikate > Zertifikate anzeigen
Zertifizierungsstellen > Importieren > ca-cert.pem
✓ Dieser CA vertrauen, um Websites zu identifizieren

Zertifikat verifizieren

# HTTPS-Verbindung testen
curl https://kiva.example.com:8443/health

# Zertifikat-Details anzeigen
openssl s_client -connect kiva.example.com:8443 -showcerts

Monitoring & Observability

Service-Zugriff

Nach Deployment sind alle Services über Kong Gateway erreichbar:

# Kong External-IP ermitteln
KONG_IP=$(kubectl get svc kong-gateway-service -n kiva -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

echo "Services verfügbar unter:"
echo "- Open WebUI: http://$KONG_IP:8000/"
echo "- LiteLLM: http://$KONG_IP:8000/litellm"
echo "- Grafana: http://$KONG_IP:8000/grafana"
echo "- Prometheus: http://$KONG_IP:8000/prometheus"

Deployment-Status prüfen

# Alle Pods
kubectl get pods -n kiva

# Services
kubectl get svc -n kiva

# Persistent Volumes
kubectl get pvc -n kiva

# Helm Releases
helm list -n kiva

# Helmfile Status
helmfile -e dev status

Logs abrufen

# Alle Pods eines Services
kubectl logs -l app.kubernetes.io/name=kiva-vllm -n kiva --tail=100 -f

# Spezifischer Pod
kubectl logs kiva-vllm-6d8f9c5b7-xk4ql -n kiva

# Alle Container in einem Pod
kubectl logs kiva-vllm-6d8f9c5b7-xk4ql --all-containers=true -n kiva

# Logs in Datei speichern
kubectl logs -l app.kubernetes.io/name=litellm -n kiva > litellm-logs.txt

Troubleshooting

Problem: Pods starten nicht (ImagePullBackOff)

Ursache: Fehlende oder ungültige ACR-Secrets

Lösung:

# Secret überprüfen
kubectl get secret acr-secret -n kiva

# Secret neu erstellen
kubectl delete secret acr-secret -n kiva
kubectl create secret docker-registry acr-secret \
  --docker-server=exxkiregistry.azurecr.io \
  --docker-username=$ACR_USERNAME \
  --docker-password=$ACR_PASSWORD \
  --namespace=kiva

# Pods neu starten
kubectl rollout restart deployment -n kiva

Problem: PVC bleibt im Pending-Status

Ursache: Storage Class nicht verfügbar oder Node-Affinity

Lösung:

# Storage Classes prüfen
kubectl get storageclass

# PVC Details
kubectl describe pvc <pvc-name> -n kiva

# Falls Node-Affinity Problem:
kubectl get pv | grep <pvc-name>
kubectl describe pv <pv-name>

# Storage Class erstellen (falls fehlend)
kubectl apply -f kiva-namespace-and-storageclass.yaml

Problem: Helmfile Apply schlägt fehl

Ursache: Helm Diff Plugin fehlt

Lösung:

# Helm Diff installieren
helm plugin install https://github.com/databus23/helm-diff

# Helmfile Version prüfen
helmfile version

# Verbose Output für Debugging
helmfile -e dev --debug apply

Problem: Service nicht erreichbar

Ursache: Kong Route oder Service-Konfiguration

Lösung:

# Kong Services prüfen
kubectl exec -it deployment/kong-gateway -n kiva -- kong config db_export /tmp/kong.yml
kubectl exec -it deployment/kong-gateway -n kiva -- cat /tmp/kong.yml

# Service Endpoints prüfen
kubectl get endpoints -n kiva

# Port-Forward für direkten Zugriff
kubectl port-forward svc/litellm-service 4000:4000 -n kiva

# Test
curl http://localhost:4000/health

Best Practices

1. Version Control

# Immer Feature-Branches verwenden
git checkout -b feature/update-vllm-version
git add helmfile.yaml
git commit -m "Update vLLM to version 0.3.0"
git push origin feature/update-vllm-version

2. Backup vor Major Updates

# Helmfile State sichern
helmfile -e prod list > backup-$(date +%Y%m%d).txt

# PVCs sichern
kubectl get pvc -n kiva -o yaml > pvc-backup-$(date +%Y%m%d).yaml

# Datenbank-Backup
kubectl exec deployment/postgresql -n kiva -- pg_dumpall -U postgres > db-backup-$(date +%Y%m%d).sql

3. Staged Rollouts

# 1. Erst in Dev testen
helmfile -e dev apply
kubectl get pods -n kiva --watch

# 2. Dann in Prod deployen
helmfile -e prod apply

# 3. Bei Problemen: Rollback
helm rollback <release-name> -n kiva

4. Resource Limits setzen

# values.yaml
resources:
  limits:
    cpu: "2000m"
    memory: "4Gi"
  requests:
    cpu: "500m"
    memory: "1Gi"

Performance-Optimierung

Node-Affinity für GPU-Workloads

# vLLM auf GPU-Nodes schedulen
nodeSelector:
  kubernetes.azure.com/agentpool: permgpu

tolerations:
- key: "sku"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"

Horizontal Pod Autoscaling

# HPA für LiteLLM
kubectl autoscale deployment litellm -n kiva \
  --cpu-percent=80 \
  --min=2 \
  --max=10

# Status prüfen
kubectl get hpa -n kiva

PVC Optimierung

# Hochleistungs-Storage für Datenbanken
storageClassName: managed-csi-premium
accessModes:
  - ReadWriteOnce
resources:
  requests:
    storage: 100Gi

Sicherheit

Network Policies

# Nur Kong darf auf Services zugreifen
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-kong
  namespace: kiva
spec:
  podSelector:
    matchLabels:
      app: litellm
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: kong-gateway

Secret Management

# Secrets nie in Git committen!
echo "secrets/" >> .gitignore

# Sealed Secrets verwenden (empfohlen)
kubectl create secret generic db-password \
  --from-literal=password=my-secret-password \
  --dry-run=client -o yaml | \
  kubeseal -o yaml > sealed-secret.yaml

Links & Ressourcen

Infrastructure Repository: kiva.ops
Deployment-Guide: README.de.md
Helmfile Dokumentation: helmfile.readthedocs.io
Helm Dokumentation: helm.sh
Kubernetes Dokumentation: kubernetes.io