Can I Run cagent with Local AI Models — Docker Model Runner, Ollama and Beyond
If you've been experimenting with Docker's cagent and loving the multi-agent YAML magic, you've probably wondered: Can I skip the cloud API keys and run this whole thing locally?
The short answer is yes - and you have options. A lot of developers reach for Ollama first since it's what they already know. And it works. But there's actually a cleaner path that most people don't know about yet: Docker Model Runner (DMR), which is built right into Docker Desktop and has first-class cagent support.
In this post, I'll walk through both approaches - starting with Ollama (since that's probably what brought you here), then showing you why Docker Model Runner might be the better choice for your agent workflows.
Why Run Models Locally?
cagent supports cloud providers like OpenAI, Anthropic, and Google out of the box. That's great for production, but sometimes you want:
- Zero cost - no token charges, no billing surprises
- Full privacy - your prompts never leave your machine
- Offline capability - works without internet once the model is downloaded
- Experimentation freedom - iterate on agent architectures without watching your usage dashboard
Let's look at how to get there.
Option 1: Using Ollama with cagent
If you're already running Ollama, you can connect it to cagent since Ollama exposes an OpenAI-compatible API. Here's how.
Pull a Model
ollama pull llama3.1:8b
Start Ollama
ollama serve
Keep this terminal open - Ollama needs to stay running.
Create the cagent YAML
version: "2"
models:
local_model:
provider: openai
model: llama3.1:8b
base_url: http://127.0.0.1:11434/v1
agents:
root:
model: local_model
description: A helpful AI assistant
instruction: |
You are a knowledgeable assistant that helps users with various tasks.
Be helpful, accurate, and concise in your responses.
A few things to note:
provider: openai— This isn't a mistake. Ollama's API is OpenAI-compatible, so we tell cagent to use the OpenAI provider and redirect the traffic to Ollama.model: llama3.1:8b— Must match exactly whatollama listshows.base_url: http://127.0.0.1:11434/v1— This redirects all API calls to Ollama instead of OpenAI's servers.
Set a Dummy API Key
cagent validates that OPENAI_API_KEY is set when using the OpenAI provider. Since Ollama doesn't check keys, just set a placeholder:
export OPENAI_API_KEY=ollama
Run It
cagent run cagent.yaml
And it works. Your local Llama 3.1 is now responding through cagent.
The Gotchas I Hit
IPv6 vs IPv4: I initially used localhost in the base_url and got this:
error receiving from stream: Post "http://localhost:11434/v1/chat/completions": dial
tcp [::1]:11434: connect: connection refused
That [::1] is IPv6 localhost. On macOS, localhost can resolve to IPv6 first, but Ollama only listens on IPv4. Switching to 127.0.0.1 fixed it immediately.
Separate terminal required: You need Ollama running in one terminal and cagent in another. If Ollama isn't actively serving, you'll get connection refused errors.
Dummy API key dance: Having to export a fake API key just to satisfy a validation check isn't ideal. It works, but it's a workaround.
This all functions perfectly fine. But as I was setting it up, I kept thinking — there has to be a cleaner way to do this within Docker itself. Turns out there is.
Option 2: Docker Model Runner — The Docker-Native Way
Docker Model Runner (DMR) is built directly into Docker Desktop. It runs local AI models with GPU acceleration, exposes an OpenAI-compatible API, and — here's the key part — has first-class support in cagent. No workarounds needed.
Enable Docker Model Runner
If you're on Docker Desktop 4.40+, go to:
Settings → AI → Enable Docker Model Runner
Or from the command line:
docker desktop enable model-runner
Verify it's ready:
docker model version
Pull a Model
docker model pull ai/llama3.1:8B-Q4_0
Test it:
docker model run ai/llama3.1:8B-Q4_0 "Hello, how are you?"
Create the cagent YAML
Here's where you see the difference:
version: "2"
agents:
root:
model: dmr/ai/llama3.1:8B-Q4_0
description: A helpful AI assistant
instruction: |
You are a knowledgeable assistant that helps users with various tasks.
Be helpful, accurate, and concise in your responses.
That's the entire file. Compare that to the Ollama version — no models block, no base_url, no provider override, no dummy API key. Just prefix the model name with dmr/ and cagent handles the rest.
Run It
cagent run cagent.yaml
No export OPENAI_API_KEY needed. No separate server process to manage. If the model isn't pulled yet, cagent even prompts you:
Model not found locally. Do you want to pull it now? ([y]es/[n]o)
It just works.
Side-by-Side: Ollama vs Docker Model Runner
Let me lay out the differences for anyone deciding between the two:
| Ollama | Docker Model Runner | |
|---|---|---|
| Setup | Install separately, run ollama serve |
Built into Docker Desktop — just enable it |
| cagent integration | Workaround via provider: openai + base_url |
Native dmr/ prefix — first-class support |
| API key | Dummy key required (export OPENAI_API_KEY=ollama) |
No API key needed at all |
| Model source | Ollama model library | Docker Hub + any OCI registry + Hugging Face |
| GPU acceleration | Apple Silicon, NVIDIA | Apple Silicon, NVIDIA (+ vLLM engine on Linux) |
| Docker Compose | Manual service configuration | Native provider: model support in Compose |
| Sharing models | Ollama-specific format | Standard OCI artifacts — same workflow as container images |
| Requires separate process | Yes (ollama serve must be running) |
No — managed by Docker Desktop |
If you're already deep in the Ollama ecosystem and just want to wire it to cagent quickly, the OpenAI provider workaround gets the job done. But if you're building within Docker — and especially if you're using Docker Compose, Docker Hub, or working in a team — DMR is the cleaner, more integrated path.
Going Further with Docker Model Runner
Once you're running with DMR, you can take advantage of features that go well beyond basic chat.
Tuning Model Parameters
version: "2"
models:
local-llama:
provider: dmr
model: ai/llama3.1:8B-Q4_0
temperature: 0.7
max_tokens: 8192
agents:
root:
model: local-llama
description: A helpful AI assistant
instruction: |
You are a knowledgeable assistant that helps users with various tasks.
You can also configure context size from the CLI:
docker model configure --context-size 8192 ai/llama3.1:8B-Q4_0
Adding MCP Tools
Give your local agent superpowers with MCP tools:
version: "2"
agents:
root:
model: dmr/ai/llama3.1:8B-Q4_0
description: A helpful assistant with web search
instruction: |
You are a knowledgeable assistant.
Use web search when you need current information.
Write your findings to disk.
toolsets:
- type: mcp
ref: docker:duckduckgo
- type: filesystem
Multi-Agent Teams — Fully Local
Run an entire agent team on your machine:
version: "2"
agents:
root:
model: dmr/ai/llama3.1:8B-Q4_0
description: Task coordinator
instruction: |
You coordinate tasks between specialist agents.
Route research questions to the researcher and
writing tasks to the writer.
sub_agents: [researcher, writer]
researcher:
model: dmr/ai/llama3.1:8B-Q4_0
description: Research specialist
instruction: |
You research topics thoroughly and provide
detailed findings with key facts.
toolsets:
- type: mcp
ref: docker:duckduckgo
writer:
model: dmr/ai/llama3.1:8B-Q4_0
description: Content writer
instruction: |
You write clear, engaging content based on
research provided to you.
toolsets:
- type: filesystem
Three agents, each with their own role and tools — all running locally. No API keys, no per-token costs, no data leaving your machine.
Mix Local and Cloud Models
You can even combine DMR with cloud providers in the same agent team:
version: "2"
agents:
root:
model: anthropic/claude-sonnet-4-5
instruction: |
Coordinate the team. Delegate research to the researcher
and writing to the writer.
sub_agents: [researcher, writer]
researcher:
model: dmr/ai/qwen3
description: Local research agent
toolsets:
- type: mcp
ref: docker:duckduckgo
writer:
model: dmr/ai/llama3.1:8B-Q4_0
description: Local writing agent
toolsets:
- type: filesystem
Use a cloud model for the coordinator that needs maximum intelligence, and local models for the workers that handle volume. Best of both worlds.
Local RAG Pipelines
DMR supports embeddings too, so you can build full Retrieval-Augmented Generation locally:
version: "2"
rag:
my_docs:
docs: [./documents]
strategies:
- type: chunked-embeddings
embedding_model: dmr/ai/embeddinggemma
threshold: 0.5
chunking:
size: 1000
overlap: 100
results:
limit: 5
agents:
root:
model: dmr/ai/llama3.1:8B-Q4_0
instruction: |
You are an assistant with access to an internal knowledge base.
Answer questions using the provided documents.
rag: [my_docs]
Your documents, your model, your machine. No data ever leaves your infrastructure.
Choosing the Right Model
Here's a quick reference for model selection:
| Model | Size | Best For |
|---|---|---|
ai/smollm2 |
~1.5 GB | Quick tests, lightweight tasks |
ai/llama3.1:8B-Q4_0 |
~4.7 GB | Good reasoning, decent tool use (our pick) |
ai/qwen3 |
~5 GB | Strong tool calling, great for MCP integrations |
ai/gemma3:12B |
~8 GB | Solid all-rounder from Google |
ai/qwen3:14B |
~9 GB | Excellent agentic capabilities (needs 16GB+ RAM) |
Match the model to your hardware. I initially tried a 20B parameter model on a machine with 13GB of total memory. It technically loaded (24/25 layers offloaded to GPU), but performance was painful. A fast 8B model is far more useful than a crawling 20B one.
Wrapping Up
If you searched for "cagent Ollama" and landed here, hopefully you got what you came for — it works, and the setup is straightforward once you know the tricks around 127.0.0.1 and the dummy API key.
But if you're building within the Docker ecosystem, I'd encourage you to give Docker Model Runner a serious look. The integration with cagent is native, the setup is simpler, and you get benefits like OCI-based model distribution, Docker Compose support, and no extra processes to manage. It's what we use internally and what we'd recommend for anyone building agent workflows with Docker.
Five minutes from now, you could have a fully local multi-agent team running on your machine. Give it a try.
Resources: