Coding Agents
CLI-style autonomous coding agents — read files, run commands, propose edits. This page covers connecting them to the gateway.
Expectation-setting
These agents were built and tuned around frontier hosted models (Claude, GPT-5, Gemini). Pointing them at a local 70B — even a good one — is a real step down in reliability. Tool-call formatting errors, stuck loops, and refusals to edit files are normal. Use local agents for cheap/offline work; reach for hosted for hard problems.
Best local models for agentic coding work (in order):
llama3.3:70b → qwen3-next:80b → qwen2.5-coder:14b → gpt-oss:20b.
Claude Code
Claude Code is Anthropic's official CLI. It natively speaks the Anthropic Messages API. LiteLLM exposes an Anthropic-compatible endpoint (/v1/messages) that translates to any model in your catalog — so Claude Code can drive a local llama/qwen through this gateway.
1. Install
2. Point it at the gateway
Claude Code uses Anthropic-flavored env vars. Set these in your shell profile:
export ANTHROPIC_BASE_URL="https://api.chris.hellotopia.io"
export ANTHROPIC_AUTH_TOKEN="sk-your-key-here"
export ANTHROPIC_MODEL="llama3.3:70b"
export ANTHROPIC_SMALL_FAST_MODEL="qwen2.5-coder:7b"
ANTHROPIC_BASE_URL— server root (Claude Code appends/v1/messages). Do not add/v1yourself.ANTHROPIC_AUTH_TOKEN— your gateway API key. Bearer-auth'd.ANTHROPIC_MODEL— main reasoning model. Any gateway model ID works.ANTHROPIC_SMALL_FAST_MODEL— used for cheap subtasks (summaries, routing decisions). Pick something ≤7B for latency.
3. Run
You should see the normal Claude Code interface. Type a request and it'll read files, propose edits, and run shell commands (with your approval) using the local model.
4. Per-project config (optional)
Drop a .claude/settings.json in a project to override defaults locally:
{
"env": {
"ANTHROPIC_BASE_URL": "https://api.chris.hellotopia.io",
"ANTHROPIC_MODEL": "qwen3-next:80b",
"ANTHROPIC_SMALL_FAST_MODEL": "qwen2.5-coder:7b"
}
}
Model picks for Claude Code
| Use | Main model | Small model | Notes |
|---|---|---|---|
| Best quality, patient | llama3.3:70b |
qwen2.5-coder:7b |
~4–6 tok/s main. Feels sluggish on interactive sessions. |
| Balanced (recommended) | qwen3-next:80b |
qwen2.5-coder:7b |
MoE ~10B active. 3–4× faster than 70b dense. |
| Pure speed | qwen2.5-coder:14b |
llama3.2:3b |
Stays on 5080, fastest round-trips. Weaker reasoning. |
| Biggest brain (once installed) | gpt-oss:120b |
qwen2.5-coder:7b |
MoE ~5B active. Fast and strong; install with ollama pull gpt-oss:120b on Spark. |
Known limitations vs hosted Claude
- No prompt caching — every turn re-sends the full context. Long sessions get expensive in tokens and slow.
- No extended thinking — the local model won't do Claude's explicit reasoning block.
- Tool-call brittleness — Claude Code's tool format is strict. Local models occasionally emit malformed tool JSON, causing a retry or a stall. Restart the session if it gets stuck.
- Context window — gateway default is 16K. Override with
ANTHROPIC_MAX_TOKENS/ per-model Ollama options if you need more.
Aider
Aider is a battle-tested OSS coding CLI. It speaks OpenAI natively — no proxy needed.
pip install aider-chat
export OPENAI_API_KEY="sk-your-key-here"
export OPENAI_API_BASE="https://api.chris.hellotopia.io/v1"
aider --model openai/llama3.3:70b
Use --model openai/coder/qwen2.5-coder:14b for faster turns on smaller tasks. The openai/ prefix tells Aider to route via the OpenAI adapter (required for custom base URLs).
Aider has explicit support for non-hosted models via --edit-format whole or --edit-format diff-fenced if the model struggles with Aider's default unified-diff format:
OpenAI Codex CLI
Codex CLI is OpenAI's terminal-based coding agent. It accepts custom OpenAI-compatible endpoints:
npm install -g @openai/codex
export OPENAI_API_KEY="sk-your-key-here"
export OPENAI_BASE_URL="https://api.chris.hellotopia.io/v1"
codex --model llama3.3:70b
Cline / Roo Code (VS Code)
Cline (formerly Claude Dev) and Roo Code are VS Code extensions that run a Claude-style agent inside the editor. Both support OpenAI-compatible providers natively — no proxy needed.
Configure in the extension settings:
- Provider: OpenAI Compatible
- Base URL:
https://api.chris.hellotopia.io/v1 - API Key:
sk-your-key-here - Model ID:
llama3.3:70b(or any gateway model)
For agentic coding tasks, pick a model with strong tool-use training. llama3.3:70b, qwen3-next:80b, and gpt-oss:20b are the strongest options here.
Goose
Goose is Block's open-source agent. Configure a custom OpenAI provider via ~/.config/goose/config.yaml:
GOOSE_PROVIDER: openai
OPENAI_HOST: https://api.chris.hellotopia.io
OPENAI_BASE_PATH: /v1
OPENAI_API_KEY: sk-your-key-here
GOOSE_MODEL: llama3.3:70b
Tips that apply to all of them
- Keep context tight. Local models degrade fast past 8–12K tokens of context. Agents that love to read everything in sight will pay in latency and quality.
- Expect retries. Tool-call formatting is the #1 failure mode. If an agent stalls, interrupt and rephrase.
- Pick the right model per task. Use
coder/qwen2.5-coder:7bfor snappy edits,llama3.3:70borqwen3-next:80bfor reasoning-heavy work. - Watch the cold-load cost. Switching between big models forces reloads. Pick one main model per session and stick with it.