Can I Run cagent with Local AI Models — Docker Model Runner, Ollama and Beyond

If you've been experimenting with Docker's cagent and loving the multi-agent YAML magic, you've probably wondered: Can I skip the cloud API keys and run this whole thing locally?

Can I Run cagent with Local AI Models — Docker Model Runner, Ollama and Beyond

The short answer is yes - and you have options. A lot of developers reach for Ollama first since it's what they already know. And it works. But there's actually a cleaner path that most people don't know about yet: Docker Model Runner (DMR), which is built right into Docker Desktop and has first-class cagent support.

In this post, I'll walk through both approaches - starting with Ollama (since that's probably what brought you here), then showing you why Docker Model Runner might be the better choice for your agent workflows.

Why Run Models Locally?

cagent supports cloud providers like OpenAI, Anthropic, and Google out of the box. That's great for production, but sometimes you want:

  • Zero cost - no token charges, no billing surprises
  • Full privacy - your prompts never leave your machine
  • Offline capability - works without internet once the model is downloaded
  • Experimentation freedom - iterate on agent architectures without watching your usage dashboard

Let's look at how to get there.


Option 1: Using Ollama with cagent

If you're already running Ollama, you can connect it to cagent since Ollama exposes an OpenAI-compatible API. Here's how.

Pull a Model

ollama pull llama3.1:8b

Start Ollama

ollama serve

Keep this terminal open - Ollama needs to stay running.

Create the cagent YAML

version: "2"

models:
  local_model:
    provider: openai
    model: llama3.1:8b
    base_url: http://127.0.0.1:11434/v1

agents:
  root:
    model: local_model
    description: A helpful AI assistant
    instruction: |
      You are a knowledgeable assistant that helps users with various tasks.
      Be helpful, accurate, and concise in your responses.

A few things to note:

  • provider: openai — This isn't a mistake. Ollama's API is OpenAI-compatible, so we tell cagent to use the OpenAI provider and redirect the traffic to Ollama.
  • model: llama3.1:8b — Must match exactly what ollama list shows.
  • base_url: http://127.0.0.1:11434/v1 — This redirects all API calls to Ollama instead of OpenAI's servers.

Set a Dummy API Key

cagent validates that OPENAI_API_KEY is set when using the OpenAI provider. Since Ollama doesn't check keys, just set a placeholder:

export OPENAI_API_KEY=ollama

Run It

cagent run cagent.yaml

And it works. Your local Llama 3.1 is now responding through cagent.

The Gotchas I Hit

IPv6 vs IPv4: I initially used localhost in the base_url and got this:

error receiving from stream: Post "http://localhost:11434/v1/chat/completions": dial
tcp [::1]:11434: connect: connection refused

That [::1] is IPv6 localhost. On macOS, localhost can resolve to IPv6 first, but Ollama only listens on IPv4. Switching to 127.0.0.1 fixed it immediately.

Separate terminal required: You need Ollama running in one terminal and cagent in another. If Ollama isn't actively serving, you'll get connection refused errors.

Dummy API key dance: Having to export a fake API key just to satisfy a validation check isn't ideal. It works, but it's a workaround.

This all functions perfectly fine. But as I was setting it up, I kept thinking — there has to be a cleaner way to do this within Docker itself. Turns out there is.


Option 2: Docker Model Runner — The Docker-Native Way

Docker Model Runner (DMR) is built directly into Docker Desktop. It runs local AI models with GPU acceleration, exposes an OpenAI-compatible API, and — here's the key part — has first-class support in cagent. No workarounds needed.

Enable Docker Model Runner

If you're on Docker Desktop 4.40+, go to:

Settings → AI → Enable Docker Model Runner

Or from the command line:

docker desktop enable model-runner

Verify it's ready:

docker model version

Pull a Model

docker model pull ai/llama3.1:8B-Q4_0

Test it:

docker model run ai/llama3.1:8B-Q4_0 "Hello, how are you?"

Create the cagent YAML

Here's where you see the difference:

version: "2"

agents:
  root:
    model: dmr/ai/llama3.1:8B-Q4_0
    description: A helpful AI assistant
    instruction: |
      You are a knowledgeable assistant that helps users with various tasks.
      Be helpful, accurate, and concise in your responses.

That's the entire file. Compare that to the Ollama version — no models block, no base_url, no provider override, no dummy API key. Just prefix the model name with dmr/ and cagent handles the rest.

Run It

cagent run cagent.yaml

No export OPENAI_API_KEY needed. No separate server process to manage. If the model isn't pulled yet, cagent even prompts you:

Model not found locally. Do you want to pull it now? ([y]es/[n]o)

It just works.


Side-by-Side: Ollama vs Docker Model Runner

Let me lay out the differences for anyone deciding between the two:

Ollama Docker Model Runner
Setup Install separately, run ollama serve Built into Docker Desktop — just enable it
cagent integration Workaround via provider: openai + base_url Native dmr/ prefix — first-class support
API key Dummy key required (export OPENAI_API_KEY=ollama) No API key needed at all
Model source Ollama model library Docker Hub + any OCI registry + Hugging Face
GPU acceleration Apple Silicon, NVIDIA Apple Silicon, NVIDIA (+ vLLM engine on Linux)
Docker Compose Manual service configuration Native provider: model support in Compose
Sharing models Ollama-specific format Standard OCI artifacts — same workflow as container images
Requires separate process Yes (ollama serve must be running) No — managed by Docker Desktop

If you're already deep in the Ollama ecosystem and just want to wire it to cagent quickly, the OpenAI provider workaround gets the job done. But if you're building within Docker — and especially if you're using Docker Compose, Docker Hub, or working in a team — DMR is the cleaner, more integrated path.


Going Further with Docker Model Runner

Once you're running with DMR, you can take advantage of features that go well beyond basic chat.

Tuning Model Parameters

version: "2"

models:
  local-llama:
    provider: dmr
    model: ai/llama3.1:8B-Q4_0
    temperature: 0.7
    max_tokens: 8192

agents:
  root:
    model: local-llama
    description: A helpful AI assistant
    instruction: |
      You are a knowledgeable assistant that helps users with various tasks.

You can also configure context size from the CLI:

docker model configure --context-size 8192 ai/llama3.1:8B-Q4_0

Adding MCP Tools

Give your local agent superpowers with MCP tools:

version: "2"

agents:
  root:
    model: dmr/ai/llama3.1:8B-Q4_0
    description: A helpful assistant with web search
    instruction: |
      You are a knowledgeable assistant.
      Use web search when you need current information.
      Write your findings to disk.
    toolsets:
      - type: mcp
        ref: docker:duckduckgo
      - type: filesystem

Multi-Agent Teams — Fully Local

Run an entire agent team on your machine:

version: "2"

agents:
  root:
    model: dmr/ai/llama3.1:8B-Q4_0
    description: Task coordinator
    instruction: |
      You coordinate tasks between specialist agents.
      Route research questions to the researcher and
      writing tasks to the writer.
    sub_agents: [researcher, writer]

  researcher:
    model: dmr/ai/llama3.1:8B-Q4_0
    description: Research specialist
    instruction: |
      You research topics thoroughly and provide
      detailed findings with key facts.
    toolsets:
      - type: mcp
        ref: docker:duckduckgo

  writer:
    model: dmr/ai/llama3.1:8B-Q4_0
    description: Content writer
    instruction: |
      You write clear, engaging content based on
      research provided to you.
    toolsets:
      - type: filesystem

Three agents, each with their own role and tools — all running locally. No API keys, no per-token costs, no data leaving your machine.

Mix Local and Cloud Models

You can even combine DMR with cloud providers in the same agent team:

version: "2"

agents:
  root:
    model: anthropic/claude-sonnet-4-5
    instruction: |
      Coordinate the team. Delegate research to the researcher
      and writing to the writer.
    sub_agents: [researcher, writer]

  researcher:
    model: dmr/ai/qwen3
    description: Local research agent
    toolsets:
      - type: mcp
        ref: docker:duckduckgo

  writer:
    model: dmr/ai/llama3.1:8B-Q4_0
    description: Local writing agent
    toolsets:
      - type: filesystem

Use a cloud model for the coordinator that needs maximum intelligence, and local models for the workers that handle volume. Best of both worlds.

Local RAG Pipelines

DMR supports embeddings too, so you can build full Retrieval-Augmented Generation locally:

version: "2"

rag:
  my_docs:
    docs: [./documents]
    strategies:
      - type: chunked-embeddings
        embedding_model: dmr/ai/embeddinggemma
        threshold: 0.5
        chunking:
          size: 1000
          overlap: 100
    results:
      limit: 5

agents:
  root:
    model: dmr/ai/llama3.1:8B-Q4_0
    instruction: |
      You are an assistant with access to an internal knowledge base.
      Answer questions using the provided documents.
    rag: [my_docs]

Your documents, your model, your machine. No data ever leaves your infrastructure.


Choosing the Right Model

Here's a quick reference for model selection:

Model Size Best For
ai/smollm2 ~1.5 GB Quick tests, lightweight tasks
ai/llama3.1:8B-Q4_0 ~4.7 GB Good reasoning, decent tool use (our pick)
ai/qwen3 ~5 GB Strong tool calling, great for MCP integrations
ai/gemma3:12B ~8 GB Solid all-rounder from Google
ai/qwen3:14B ~9 GB Excellent agentic capabilities (needs 16GB+ RAM)

Match the model to your hardware. I initially tried a 20B parameter model on a machine with 13GB of total memory. It technically loaded (24/25 layers offloaded to GPU), but performance was painful. A fast 8B model is far more useful than a crawling 20B one.

Wrapping Up

If you searched for "cagent Ollama" and landed here, hopefully you got what you came for — it works, and the setup is straightforward once you know the tricks around 127.0.0.1 and the dummy API key.

But if you're building within the Docker ecosystem, I'd encourage you to give Docker Model Runner a serious look. The integration with cagent is native, the setup is simpler, and you get benefits like OCI-based model distribution, Docker Compose support, and no extra processes to manage. It's what we use internally and what we'd recommend for anyone building agent workflows with Docker.

Five minutes from now, you could have a fully local multi-agent team running on your machine. Give it a try.

Resources: