Setting Up Local RAG with Docker Model Runner and cagent

Want to run AI-powered code search without sending your proprietary code to the cloud? This tutorial shows you how to set up Retrieval-Augmented Generation (RAG) using Docker Model Runner and cagent.

Setting Up Local RAG with Docker Model Runner and cagent

This tutorial walks you through setting up Retrieval-Augmented Generation (RAG) using Docker Model Runner (DMR) and cagent. Run AI agents with semantic search capabilities entirely on your local machine—keeping your code private, avoiding API costs, and working offline.

Prerequisites

  • Docker Desktop 4.49+ with Docker Model Runner enabled
  • macOS, Windows, or Linux
  • cagent v1.19.5+

Step 1: Enable Docker Model Runner

On Docker Desktop (macOS/Windows)

  1. Open Docker Desktop
  2. Go to SettingsAI
  3. Check Enable Docker Model Runner
  4. Click Apply & Restart

Verify Installation

docker model version

Expected output:

Docker Model Runner version v1.0.6
Docker Engine Kind: Docker Desktop

Step 2: Install cagent

Using Homebrew (macOS)

brew install cagent

Manual Installation (macOS ARM64)

wget https://github.com/docker/cagent/releases/download/v1.19.5/cagent-darwin-arm64
chmod +x cagent-darwin-arm64
sudo mv cagent-darwin-arm64 /usr/local/bin/cagent

Verify Installation

cagent version

Expected output:

cagent v1.19.5

Step 3: Pull Required Models

You need three models for a complete RAG setup:

3.1 Pull the Chat Model

docker model pull ai/llama3.2

Or for a more powerful model:

docker model pull ai/qwen3

3.2 Pull the Embedding Model

docker model pull ai/embeddinggemma

3.3 (Optional) Pull the Reranker Model

docker model pull hf.co/ggml-org/qwen3-reranker-0.6b-q8_0-gguf

Verify Models

docker model list

Expected output:

MODEL NAME      PARAMETERS  QUANTIZATION    ARCHITECTURE     MODEL ID      CREATED        SIZE
embeddinggemma  302.86 M    Q8_0            gemma-embedding  b6635ddcd4cb  4 months ago   307.13 MiB
llama3.2        3.21 B      IQ2_XXS/Q4_K_M  llama            436bb282b419  10 months ago  1.87 GiB
qwen3           8.19 B      IQ2_XXS/Q4_K_M  qwen3            79fa56c07429  8 months ago   4.68 GiB

Step 4: Create Project Structure

Create a test project with sample files to index:

mkdir -p rag-dmr-test/src rag-dmr-test/docs
cd rag-dmr-test

4.1 Create Sample Source Code

cat > src/app.py << 'EOF'
"""
Main application file for the RAG Demo App.
This app processes customer orders and calculates shipping costs.
"""

def calculate_shipping(weight, distance):
    """Calculate shipping cost based on weight and distance.
    
    Args:
        weight: Package weight in pounds
        distance: Shipping distance in miles
    
    Returns:
        Total shipping cost in dollars
    """
    base_rate = 5.99
    weight_rate = 0.50 * weight
    distance_rate = 0.10 * distance
    return base_rate + weight_rate + distance_rate

def process_order(order_id, items):
    """Process a customer order and return total.
    
    Args:
        order_id: Unique order identifier
        items: List of items with price and quantity
    
    Returns:
        Dictionary with order_id and calculated total
    """
    total = sum(item['price'] * item['quantity'] for item in items)
    return {'order_id': order_id, 'total': total}

def apply_discount(total, discount_code):
    """Apply discount code to order total.
    
    Supported codes:
        - SAVE10: 10% off
        - SAVE20: 20% off
        - FREESHIP: Free shipping ($5.99 off)
    """
    discounts = {
        'SAVE10': 0.10,
        'SAVE20': 0.20,
        'FREESHIP': 5.99
    }
    if discount_code in discounts:
        if discount_code == 'FREESHIP':
            return total - discounts[discount_code]
        return total * (1 - discounts[discount_code])
    return total
EOF

4.2 Create Sample Documentation

cat > docs/README.md << 'EOF'
# RAG Demo Project

This project demonstrates local RAG with Docker Model Runner.

## Features

- Order processing system
- Shipping cost calculator
- Discount code support
- Built with Python

## Installation

1. Clone the repository
2. Install dependencies: `pip install -r requirements.txt`
3. Run the app: `python src/app.py`

## Configuration

Set the following environment variables before running:

- `SHIPPING_API_KEY`: API key for shipping provider
- `DATABASE_URL`: Connection string for the database
- `DEBUG`: Set to "true" for debug mode

## API Endpoints

- `POST /orders`: Create a new order
- `GET /orders/{id}`: Get order details
- `POST /shipping/calculate`: Calculate shipping cost

## Discount Codes

- `SAVE10`: 10% off your order
- `SAVE20`: 20% off your order
- `FREESHIP`: Free shipping on any order
EOF

Step 5: Create the Agent Configuration

Create agent.yaml in your project root:

cat > agent.yaml << 'EOF'
agents:
  root:
    model: dmr/ai/llama3.2
    instruction: |
      You are a helpful coding assistant.
      Use the RAG tool to search the codebase before answering questions.
    rag: [codebase]
    toolsets:
      - type: filesystem
      - type: shell

rag:
  codebase:
    docs: [./src, ./docs]
    strategies:
      - type: chunked-embeddings
        embedding_model: dmr/ai/embeddinggemma
        vector_dimensions: 768
        database: ./code.db
        chunking:
          size: 1000
          overlap: 100
      - type: bm25
        database: ./bm25.db
EOF

Configuration Explained

Field Description
model: dmr/ai/llama3.2 Use Llama 3.2 via Docker Model Runner
rag: [codebase] Enable RAG with the "codebase" source
docs: [./src, ./docs] Directories to index
embedding_model: dmr/ai/embeddinggemma Local embedding model
vector_dimensions: 768 Embedding vector size for embeddinggemma
chunked-embeddings Semantic search strategy
bm25 Keyword search strategy (hybrid search)

Loading models takes time on first use. Pre-warm them:

# Warm up the chat model
curl http://localhost:12434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "ai/llama3.2", "messages": [{"role": "user", "content": "hi"}]}'

Expected output:

{
  "choices": [{
    "finish_reason": "stop",
    "message": {
      "role": "assistant",
      "content": "How can I assist you today?"
    }
  }],
  "model": "model.gguf",
  "usage": {
    "completion_tokens": 8,
    "prompt_tokens": 36,
    "total_tokens": 44
  }
}

Step 7: Run the Agent

cagent run agent.yaml

First run behavior:

  1. cagent indexes your ./src and ./docs directories
  2. Creates code.db (vector embeddings) and bm25.db (keyword index)
  3. Loads the chat model
  4. Opens an interactive chat interface

Expected output:

┌─────────────────────────────────────────────────────────────┐
│ cagent                                                      │
├─────────────────────────────────────────────────────────────┤
│                                          Session            │
│                                          New session        │
│                                                             │
│                                          Agent              │
│                                          ▶ root             │
│                                          Provider: dmr      │
│                                          Model: model.gguf  │
│                                                             │
│                                          Tools              │
│                                          13 tools available │
├─────────────────────────────────────────────────────────────┤
│ Type your message here...                                   │
└─────────────────────────────────────────────────────────────┘

Step 8: Test RAG Queries

Try these questions to test your RAG setup:

Question 1: Code Understanding

What does the calculate_shipping function do?

Expected response: The agent searches your codebase and explains the function based on the actual code in src/app.py.

Question 2: Documentation Retrieval

How do I configure this project?

Expected response: The agent retrieves information from docs/README.md about environment variables and configuration.

Question 3: Feature Discovery

What discount codes are available?

Expected response: The agent finds both the code implementation and documentation about SAVE10, SAVE20, and FREESHIP codes.

Question 4: Cross-reference

How is the FREESHIP discount different from percentage discounts?

Expected response: The agent analyzes the apply_discount function and explains that FREESHIP subtracts a flat $5.99 while others apply percentages.



Summary

You now have a fully local RAG setup with:

Component Model Purpose
Chat Llama 3.2 or Qwen3 Answer questions
Embeddings embeddinggemma Semantic search
BM25 Built-in Keyword search
Reranker qwen3-reranker (optional) Improve accuracy

References