Skip to content

Models

The gateway exposes two machines:

  • 5080 — RTX 5080, 16 GB VRAM. Fast, small/medium models.
  • Spark — DGX Spark, 128 GB unified memory. Large models, slower per-token.

Naming convention

Pattern Behavior
5080/<model> Force routing to the 5080
spark/<model> Force routing to the Spark
<model> (bare name) Auto-routed to the least-busy backend that has it
<big-model> (70B+) Only on Spark — bare name routes there
coder/<model> Code-specialized models (5080)
embed/<model> Embedding models (5080)
whisper-large-v3 Transcription (5080)

Catalog

General chat

Model Best for Where Notes
llama3.2:3b via 5080/llama3.2:3b Ultra-low-latency tasks 5080 2 GB. Fastest response.
llama3.1:8b General-purpose chat auto (5080 or Spark) Balanced quality/speed.
qwen3:8b General-purpose chat, multilingual auto Alternative to llama3.1.
phi4:14b via 5080/phi4:14b Reasoning-heavy tasks 5080 Strong for its size.
gpt-oss:20b General use, mid-size 5080 OpenAI's 20B MoE release.
qwen3:32b Mid-size quality Spark Smaller Qwen3 family.
llama3.3:70b Best general-purpose Spark Meta's best dense 70B.
qwen2.5:72b Alternative large dense Spark
qwen3-next:80b MoE, faster than dense 70B Spark Alibaba's MoE.

Multimodal / vision

Model Where Notes
5080/llama3.2-vision:11b 5080 OCR, image Q&A. Send images as base64 data URLs in content.

Code

Model Where Notes
coder/qwen2.5-coder:7b 5080 Faster, lower quality.
coder/qwen2.5-coder:14b 5080 Recommended default for IDE integration.

Embeddings

Model Where Dimension
embed/nomic-embed-text 5080 768

Transcription

Model Where Notes
whisper-large-v3 5080 Whisper Large v3 via faster-whisper.

Choosing a model

  • Need it fast? Use llama3.2:3b on 5080.
  • General chat? Use bare llama3.1:8b — auto-routes to whichever backend is idle.
  • Need quality? Use llama3.3:70b — it's the strongest general model available here.
  • Code completion in an IDE? Use coder/qwen2.5-coder:14b.
  • OCR or looking at images? 5080/llama3.2-vision:11b.
  • Embeddings for RAG? embed/nomic-embed-text.
  • Transcribing audio? whisper-large-v3.

Cold-load cost

Models not recently used must load from disk. The 5080 uses spinning disks — first request after idle can be slow:

Size Cold load
2–5 GB 10–30s
9 GB (phi4, qwen-coder 14B) ~55s
14 GB (gpt-oss 20B) ~90s

After the first load, models stay resident in VRAM for 30 minutes after the last request.

The Spark uses NVMe — its large models still take real time to load (they're huge) but the per-byte cost is much lower.