Models

The gateway exposes two machines:

Naming convention

Pattern	Behavior
`5080/<model>`	Force routing to the 5080
`spark/<model>`	Force routing to the Spark
`<model>` (bare name)	Auto-routed to the least-busy backend that has it
`<big-model>` (70B+)	Only on Spark — bare name routes there
`coder/<model>`	Code-specialized models (5080)
`embed/<model>`	Embedding models (5080)
`whisper-large-v3`	Transcription (5080)

Model	Best for	Where	Notes
`llama3.2:3b` via `5080/llama3.2:3b`	Ultra-low-latency tasks	5080	2 GB. Fastest response.
`llama3.1:8b`	General-purpose chat	auto (5080 or Spark)	Balanced quality/speed.
`qwen3:8b`	General-purpose chat, multilingual	auto	Alternative to llama3.1.
`phi4:14b` via `5080/phi4:14b`	Reasoning-heavy tasks	5080	Strong for its size.
`gpt-oss:20b`	General use, mid-size	5080	OpenAI's 20B MoE release.
`qwen3:32b`	Mid-size quality	Spark	Smaller Qwen3 family.
`llama3.3:70b`	Best general-purpose	Spark	Meta's best dense 70B.
`qwen2.5:72b`	Alternative large dense	Spark	—
`qwen3-next:80b`	MoE, faster than dense 70B	Spark	Alibaba's MoE.

Model	Where	Notes
`5080/llama3.2-vision:11b`	5080	OCR, image Q&A. Send images as base64 data URLs in `content`.

Model	Where	Notes
`coder/qwen2.5-coder:7b`	5080	Faster, lower quality.
`coder/qwen2.5-coder:14b`	5080	Recommended default for IDE integration.

Model	Where	Dimension
`embed/nomic-embed-text`	5080	768

Model	Where	Notes
`whisper-large-v3`	5080	Whisper Large v3 via faster-whisper.

Need it fast? Use llama3.2:3b on 5080.
General chat? Use bare llama3.1:8b — auto-routes to whichever backend is idle.
Need quality? Use llama3.3:70b — it's the strongest general model available here.
Code completion in an IDE? Use coder/qwen2.5-coder:14b.
OCR or looking at images? 5080/llama3.2-vision:11b.
Embeddings for RAG? embed/nomic-embed-text.
Transcribing audio? whisper-large-v3.

Models not recently used must load from disk. The 5080 uses spinning disks — first request after idle can be slow:

After the first load, models stay resident in VRAM for 30 minutes after the last request.

The Spark uses NVMe — its large models still take real time to load (they're huge) but the per-byte cost is much lower.