Inference Engine (vLLM)

Hochperformante Modell-Inferenz mit vLLM (Very Large Language Model) für Open-Source Sprachmodelle.

Überblick

VLLM dient als lokales LLM-Inference-Backend und bietet OpenAI-kompatible APIs für Model-Serving.

Kernfunktionen

GPU-Optimierung: NVIDIA V100, A100, A10G, T4 mit CUDA 11.8+
PagedAttention: Effizientes Speichermanagement
Continuous Batching: Hoher Durchsatz für parallele Anfragen
Tensor-Parallelismus: Verteilte Verarbeitung über mehrere GPUs
Streaming & Quantisierung: Echtzeitantworten mit reduziertem Speicherbedarf
Hugging Face Integration: Automatischer Modell-Download

Technologie-Stack

Komponente	Technologie
Inference Engine	vLLM
GPU Framework	CUDA 11.8+
Orchestrierung	Kubernetes GPU Operator
Deployment	Helm Charts
Storage	Persistent Volumes (PVC)

Voraussetzungen

Hardware-Anforderungen

GPU: NVIDIA V100, A100, A10G, T4 oder neuer
GPU-Speicher: Mindestens 16GB (variiert je nach Modellgröße)
CUDA: Version 11.8 oder höher

Kubernetes-Anforderungen

Cluster: Kubernetes 1.20+
GPU Operator: NVIDIA GPU Operator installiert
Storage Class: Für Modell-Caching

Installation

1. GPU-Verfügbarkeit prüfen

kubectl get nodes -l kubernetes.azure.com/agentpool=permgpu
kubectl describe node <gpu-node> | grep nvidia.com/gpu

2. Helm Chart deployen

# Repository klonen
git clone https://gitlab.opencode.de/baden-wuerttemberg/innenministerium/kiva.apps/kiva-vllm.git
cd kiva-vllm/helm

# Installation
helm install kiva-vllm . --namespace kiva

3. Deployment überprüfen

# Pod-Status
kubectl get pods -l app.kubernetes.io/name=kiva-vllm -n kiva

# Logs überwachen
kubectl logs -l app.kubernetes.io/name=kiva-vllm -f -n kiva

# Modell-Ladefortschritt
kubectl logs -l app.kubernetes.io/name=kiva-vllm -n kiva | grep "Loading model"

Konfiguration

Modell-Auswahl

# values.yaml
args:
  - "--host=0.0.0.0"
  - "--port=8000"
  - "--model=meta-llama/Llama-2-7b-chat-hf"  # Hugging Face Model ID
  - "--trust-remote-code"
  - "--max-model-len=4096"
  - "--gpu-memory-utilization=0.95"

GPU-Ressourcen

resources:
  limits:
    nvidia.com/gpu: "1"  # Anzahl GPUs
  requests:
    nvidia.com/gpu: "1"

Persistenter Speicher

persistence:
  enabled: true
  size: 50Gi  # Für Modell-Cache
  storageClass: managed-csi-premium

API-Nutzung

OpenAI-kompatible API

# Chat Completions
curl -X POST http://kiva-vllm:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "user", "content": "Hallo, wie geht es dir?"}
    ]
  }'

# Verfügbare Modelle
curl http://kiva-vllm:8000/v1/models

# Health Check
curl http://kiva-vllm:8000/health

Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://kiva-vllm:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "Erkläre KI in einfachen Worten"}
    ]
)

print(response.choices[0].message.content)

Monitoring

Metriken-Endpoint

# Prometheus-Metriken
curl http://kiva-vllm:8000/metrics

Wichtige Metriken: - vllm_num_requests_running: Aktive Anfragen - vllm_num_requests_waiting: Wartende Anfragen - vllm_gpu_cache_usage_perc: GPU-Cache-Auslastung - vllm_time_to_first_token_seconds: Latenz bis zum ersten Token

Grafana Dashboard

Vorkonfiguriertes Dashboard verfügbar in: kiva-monitoring

Troubleshooting

Problem: OOM (Out of Memory) Error

Lösung:

# GPU-Memory-Utilization reduzieren
args:
  - "--gpu-memory-utilization=0.85"  # Standard: 0.90

# Oder max-model-len begrenzen
args:
  - "--max-model-len=2048"

Problem: Langsamer Modell-Download

Lösung: Persistent Volume verwenden für Modell-Caching

persistence:
  enabled: true
  mountPath: /root/.cache/huggingface

Problem: GPU nicht erkannt

Lösung:

# NVIDIA GPU Operator installieren
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/values.yaml

# Node-Labels prüfen
kubectl get nodes --show-labels | grep nvidia

Performance-Optimierung

Tensor Parallelism (Multi-GPU)

args:
  - "--tensor-parallel-size=2"  # Für 2 GPUs

resources:
  limits:
    nvidia.com/gpu: "2"

Quantisierung

args:
  - "--quantization=awq"  # AWQ Quantisierung
  # Oder
  - "--quantization=gptq"  # GPTQ Quantisierung

Continuous Batching Tuning

args:
  - "--max-num-batched-tokens=8192"
  - "--max-num-seqs=256"

Links & Ressourcen

Deployment-Anleitung: README.de.md
vLLM Dokumentation: docs.vllm.ai
Helm Chart Repository: kiva-vllm
Monitoring Setup: kiva-monitoring

Zurück zu: Kernkomponenten | KIVA Startseite