Skip to content

Coding Agents

CLI-style autonomous coding agents — read files, run commands, propose edits. This page covers connecting them to the gateway.

Expectation-setting

These agents were built and tuned around frontier hosted models (Claude, GPT-5, Gemini). Pointing them at a local 70B — even a good one — is a real step down in reliability. Tool-call formatting errors, stuck loops, and refusals to edit files are normal. Use local agents for cheap/offline work; reach for hosted for hard problems.

Best local models for agentic coding work (in order): llama3.3:70bqwen3-next:80bqwen2.5-coder:14bgpt-oss:20b.

Claude Code

Claude Code is Anthropic's official CLI. It natively speaks the Anthropic Messages API. LiteLLM exposes an Anthropic-compatible endpoint (/v1/messages) that translates to any model in your catalog — so Claude Code can drive a local llama/qwen through this gateway.

1. Install

npm install -g @anthropic-ai/claude-code

2. Point it at the gateway

Claude Code uses Anthropic-flavored env vars. Set these in your shell profile:

export ANTHROPIC_BASE_URL="https://api.chris.hellotopia.io"
export ANTHROPIC_AUTH_TOKEN="sk-your-key-here"
export ANTHROPIC_MODEL="llama3.3:70b"
export ANTHROPIC_SMALL_FAST_MODEL="qwen2.5-coder:7b"
  • ANTHROPIC_BASE_URL — server root (Claude Code appends /v1/messages). Do not add /v1 yourself.
  • ANTHROPIC_AUTH_TOKEN — your gateway API key. Bearer-auth'd.
  • ANTHROPIC_MODEL — main reasoning model. Any gateway model ID works.
  • ANTHROPIC_SMALL_FAST_MODEL — used for cheap subtasks (summaries, routing decisions). Pick something ≤7B for latency.

3. Run

cd ~/some-project
claude

You should see the normal Claude Code interface. Type a request and it'll read files, propose edits, and run shell commands (with your approval) using the local model.

4. Per-project config (optional)

Drop a .claude/settings.json in a project to override defaults locally:

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://api.chris.hellotopia.io",
    "ANTHROPIC_MODEL": "qwen3-next:80b",
    "ANTHROPIC_SMALL_FAST_MODEL": "qwen2.5-coder:7b"
  }
}

Model picks for Claude Code

Use Main model Small model Notes
Best quality, patient llama3.3:70b qwen2.5-coder:7b ~4–6 tok/s main. Feels sluggish on interactive sessions.
Balanced (recommended) qwen3-next:80b qwen2.5-coder:7b MoE ~10B active. 3–4× faster than 70b dense.
Pure speed qwen2.5-coder:14b llama3.2:3b Stays on 5080, fastest round-trips. Weaker reasoning.
Biggest brain (once installed) gpt-oss:120b qwen2.5-coder:7b MoE ~5B active. Fast and strong; install with ollama pull gpt-oss:120b on Spark.

Known limitations vs hosted Claude

  • No prompt caching — every turn re-sends the full context. Long sessions get expensive in tokens and slow.
  • No extended thinking — the local model won't do Claude's explicit reasoning block.
  • Tool-call brittleness — Claude Code's tool format is strict. Local models occasionally emit malformed tool JSON, causing a retry or a stall. Restart the session if it gets stuck.
  • Context window — gateway default is 16K. Override with ANTHROPIC_MAX_TOKENS / per-model Ollama options if you need more.

Aider

Aider is a battle-tested OSS coding CLI. It speaks OpenAI natively — no proxy needed.

pip install aider-chat
export OPENAI_API_KEY="sk-your-key-here"
export OPENAI_API_BASE="https://api.chris.hellotopia.io/v1"
aider --model openai/llama3.3:70b

Use --model openai/coder/qwen2.5-coder:14b for faster turns on smaller tasks. The openai/ prefix tells Aider to route via the OpenAI adapter (required for custom base URLs).

Aider has explicit support for non-hosted models via --edit-format whole or --edit-format diff-fenced if the model struggles with Aider's default unified-diff format:

aider --model openai/llama3.3:70b --edit-format diff-fenced

OpenAI Codex CLI

Codex CLI is OpenAI's terminal-based coding agent. It accepts custom OpenAI-compatible endpoints:

npm install -g @openai/codex
export OPENAI_API_KEY="sk-your-key-here"
export OPENAI_BASE_URL="https://api.chris.hellotopia.io/v1"
codex --model llama3.3:70b

Cline / Roo Code (VS Code)

Cline (formerly Claude Dev) and Roo Code are VS Code extensions that run a Claude-style agent inside the editor. Both support OpenAI-compatible providers natively — no proxy needed.

Configure in the extension settings:

  • Provider: OpenAI Compatible
  • Base URL: https://api.chris.hellotopia.io/v1
  • API Key: sk-your-key-here
  • Model ID: llama3.3:70b (or any gateway model)

For agentic coding tasks, pick a model with strong tool-use training. llama3.3:70b, qwen3-next:80b, and gpt-oss:20b are the strongest options here.

Goose

Goose is Block's open-source agent. Configure a custom OpenAI provider via ~/.config/goose/config.yaml:

GOOSE_PROVIDER: openai
OPENAI_HOST: https://api.chris.hellotopia.io
OPENAI_BASE_PATH: /v1
OPENAI_API_KEY: sk-your-key-here
GOOSE_MODEL: llama3.3:70b

Tips that apply to all of them

  • Keep context tight. Local models degrade fast past 8–12K tokens of context. Agents that love to read everything in sight will pay in latency and quality.
  • Expect retries. Tool-call formatting is the #1 failure mode. If an agent stalls, interrupt and rephrase.
  • Pick the right model per task. Use coder/qwen2.5-coder:7b for snappy edits, llama3.3:70b or qwen3-next:80b for reasoning-heavy work.
  • Watch the cold-load cost. Switching between big models forces reloads. Pick one main model per session and stick with it.