Models
The gateway exposes two machines:
- 5080 — RTX 5080, 16 GB VRAM. Fast, small/medium models.
- Spark — DGX Spark, 128 GB unified memory. Large models, slower per-token.
Naming convention
| Pattern |
Behavior |
5080/<model> |
Force routing to the 5080 |
spark/<model> |
Force routing to the Spark |
<model> (bare name) |
Auto-routed to the least-busy backend that has it |
<big-model> (70B+) |
Only on Spark — bare name routes there |
coder/<model> |
Code-specialized models (5080) |
embed/<model> |
Embedding models (5080) |
whisper-large-v3 |
Transcription (5080) |
Catalog
General chat
| Model |
Best for |
Where |
Notes |
llama3.2:3b via 5080/llama3.2:3b |
Ultra-low-latency tasks |
5080 |
2 GB. Fastest response. |
llama3.1:8b |
General-purpose chat |
auto (5080 or Spark) |
Balanced quality/speed. |
qwen3:8b |
General-purpose chat, multilingual |
auto |
Alternative to llama3.1. |
phi4:14b via 5080/phi4:14b |
Reasoning-heavy tasks |
5080 |
Strong for its size. |
gpt-oss:20b |
General use, mid-size |
5080 |
OpenAI's 20B MoE release. |
qwen3:32b |
Mid-size quality |
Spark |
Smaller Qwen3 family. |
llama3.3:70b |
Best general-purpose |
Spark |
Meta's best dense 70B. |
qwen2.5:72b |
Alternative large dense |
Spark |
— |
qwen3-next:80b |
MoE, faster than dense 70B |
Spark |
Alibaba's MoE. |
Multimodal / vision
| Model |
Where |
Notes |
5080/llama3.2-vision:11b |
5080 |
OCR, image Q&A. Send images as base64 data URLs in content. |
Code
| Model |
Where |
Notes |
coder/qwen2.5-coder:7b |
5080 |
Faster, lower quality. |
coder/qwen2.5-coder:14b |
5080 |
Recommended default for IDE integration. |
Embeddings
| Model |
Where |
Dimension |
embed/nomic-embed-text |
5080 |
768 |
Transcription
| Model |
Where |
Notes |
whisper-large-v3 |
5080 |
Whisper Large v3 via faster-whisper. |
Choosing a model
- Need it fast? Use
llama3.2:3b on 5080.
- General chat? Use bare
llama3.1:8b — auto-routes to whichever backend is idle.
- Need quality? Use
llama3.3:70b — it's the strongest general model available here.
- Code completion in an IDE? Use
coder/qwen2.5-coder:14b.
- OCR or looking at images?
5080/llama3.2-vision:11b.
- Embeddings for RAG?
embed/nomic-embed-text.
- Transcribing audio?
whisper-large-v3.
Cold-load cost
Models not recently used must load from disk. The 5080 uses spinning disks — first request after idle can be slow:
| Size |
Cold load |
| 2–5 GB |
10–30s |
| 9 GB (phi4, qwen-coder 14B) |
~55s |
| 14 GB (gpt-oss 20B) |
~90s |
After the first load, models stay resident in VRAM for 30 minutes after the last request.
The Spark uses NVMe — its large models still take real time to load (they're huge) but the per-byte cost is much lower.