Coding Agents

CLI-style autonomous coding agents — read files, run commands, propose edits. This page covers connecting them to the gateway.

Expectation-setting

These agents were built and tuned around frontier hosted models (Claude, GPT-5, Gemini). Pointing them at a local 70B — even a good one — is a real step down in reliability. Tool-call formatting errors, stuck loops, and refusals to edit files are normal. Use local agents for cheap/offline work; reach for hosted for hard problems.

Best local models for agentic coding work (in order): llama3.3:70b → qwen3-next:80b → qwen2.5-coder:14b → gpt-oss:20b.

Claude Code

Claude Code is Anthropic's official CLI. It natively speaks the Anthropic Messages API. LiteLLM exposes an Anthropic-compatible endpoint (/v1/messages) that translates to any model in your catalog — so Claude Code can drive a local llama/qwen through this gateway.

1. Install

npm install -g @anthropic-ai/claude-code

2. Point it at the gateway

Claude Code uses Anthropic-flavored env vars. Set these in your shell profile:

export ANTHROPIC_BASE_URL="https://api.chris.hellotopia.io"
export ANTHROPIC_AUTH_TOKEN="sk-your-key-here"
export ANTHROPIC_MODEL="llama3.3:70b"
export ANTHROPIC_SMALL_FAST_MODEL="qwen2.5-coder:7b"

ANTHROPIC_BASE_URL — server root (Claude Code appends /v1/messages). Do not add /v1 yourself.
ANTHROPIC_AUTH_TOKEN — your gateway API key. Bearer-auth'd.
ANTHROPIC_MODEL — main reasoning model. Any gateway model ID works.
ANTHROPIC_SMALL_FAST_MODEL — used for cheap subtasks (summaries, routing decisions). Pick something ≤7B for latency.

3. Run

cd ~/some-project
claude

You should see the normal Claude Code interface. Type a request and it'll read files, propose edits, and run shell commands (with your approval) using the local model.

4. Per-project config (optional)

Drop a .claude/settings.json in a project to override defaults locally:

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://api.chris.hellotopia.io",
    "ANTHROPIC_MODEL": "qwen3-next:80b",
    "ANTHROPIC_SMALL_FAST_MODEL": "qwen2.5-coder:7b"
  }
}

Model picks for Claude Code

Use	Main model	Small model	Notes
Best quality, patient	`llama3.3:70b`	`qwen2.5-coder:7b`	~4–6 tok/s main. Feels sluggish on interactive sessions.
Balanced (recommended)	`qwen3-next:80b`	`qwen2.5-coder:7b`	MoE ~10B active. 3–4× faster than 70b dense.
Pure speed	`qwen2.5-coder:14b`	`llama3.2:3b`	Stays on 5080, fastest round-trips. Weaker reasoning.
Biggest brain (once installed)	`gpt-oss:120b`	`qwen2.5-coder:7b`	MoE ~5B active. Fast and strong; install with `ollama pull gpt-oss:120b` on Spark.

Known limitations vs hosted Claude

No prompt caching — every turn re-sends the full context. Long sessions get expensive in tokens and slow.
No extended thinking — the local model won't do Claude's explicit reasoning block.
Tool-call brittleness — Claude Code's tool format is strict. Local models occasionally emit malformed tool JSON, causing a retry or a stall. Restart the session if it gets stuck.
Context window — gateway default is 16K. Override with ANTHROPIC_MAX_TOKENS / per-model Ollama options if you need more.

Aider

Aider is a battle-tested OSS coding CLI. It speaks OpenAI natively — no proxy needed.

pip install aider-chat
export OPENAI_API_KEY="sk-your-key-here"
export OPENAI_API_BASE="https://api.chris.hellotopia.io/v1"
aider --model openai/llama3.3:70b

Use --model openai/coder/qwen2.5-coder:14b for faster turns on smaller tasks. The openai/ prefix tells Aider to route via the OpenAI adapter (required for custom base URLs).

Aider has explicit support for non-hosted models via --edit-format whole or --edit-format diff-fenced if the model struggles with Aider's default unified-diff format:

aider --model openai/llama3.3:70b --edit-format diff-fenced

OpenAI Codex CLI

Codex CLI is OpenAI's terminal-based coding agent. It accepts custom OpenAI-compatible endpoints:

npm install -g @openai/codex
export OPENAI_API_KEY="sk-your-key-here"
export OPENAI_BASE_URL="https://api.chris.hellotopia.io/v1"
codex --model llama3.3:70b

Cline / Roo Code (VS Code)

Cline (formerly Claude Dev) and Roo Code are VS Code extensions that run a Claude-style agent inside the editor. Both support OpenAI-compatible providers natively — no proxy needed.

Configure in the extension settings:

Provider: OpenAI Compatible
Base URL: https://api.chris.hellotopia.io/v1
API Key: sk-your-key-here
Model ID: llama3.3:70b (or any gateway model)

For agentic coding tasks, pick a model with strong tool-use training. llama3.3:70b, qwen3-next:80b, and gpt-oss:20b are the strongest options here.

Goose

Goose is Block's open-source agent. Configure a custom OpenAI provider via ~/.config/goose/config.yaml:

GOOSE_PROVIDER: openai
OPENAI_HOST: https://api.chris.hellotopia.io
OPENAI_BASE_PATH: /v1
OPENAI_API_KEY: sk-your-key-here
GOOSE_MODEL: llama3.3:70b

Tips that apply to all of them

Keep context tight. Local models degrade fast past 8–12K tokens of context. Agents that love to read everything in sight will pay in latency and quality.
Expect retries. Tool-call formatting is the #1 failure mode. If an agent stalls, interrupt and rephrase.
Pick the right model per task. Use coder/qwen2.5-coder:7b for snappy edits, llama3.3:70b or qwen3-next:80b for reasoning-heavy work.
Watch the cold-load cost. Switching between big models forces reloads. Pick one main model per session and stick with it.