perf: ollama cpu-cap (2 cores) gegen ingest-spikes

Embedding-Inferenz ist CPU-only und skaliert sonst auf alle Cores. cpus: "2.0" + OLLAMA_NUM_PARALLEL=1 halten die Last konstant bei ~2 statt Peaks bis 8 Cores. Bewusster Trade-off: ~5x langsamere Bulk- Laufzeit, dafuer predictable Host-Last (selten laufender Workload). README dokumentiert, dass Coolify dieselben Limits spiegeln muss. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:59:30 +02:00
parent 4b9280a972
commit cce17f3517
2 changed files with 15 additions and 0 deletions
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -27,6 +27,12 @@ services:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
+    # Cap CPU so embedding peaks don't starve the host. Mirror these
+    # limits in the production Coolify config — Ollama otherwise scales
+    # inference threads to all available cores.
+    cpus: "2.0"
+    environment:
+      OLLAMA_NUM_PARALLEL: "1"

 volumes:
  qdrant_data: