What is Docker Model Runner and what problem does it solve?

A GPU-accelerated LLM runner that enables developers to run AI models locally

What is Docker Model Runner and what problem does it solve?

Docker Model Runner is a new experimental feature introduced in Docker Desktop for Mac 4.40+ that provides a Docker-native experience for running Large Language Models (LLMs) locally. It seamlessly integrates with existing container tooling and workflows, bringing AI capabilities directly into the Docker ecosystem.


What's unique about Docker Model Runner?

Unlike traditional Docker containers that package large AI models with the model runtime (which results in slow deployment), Model Runner separates the model from the runtime, allowing for faster deployment. 

What does it mean? Let's understand this in detail.

Traditional Docker images in Docker Hub are stored in compressed format, and when you run docker pull, the image gets uncompressed on your local disk, often doubling your storage requirement.

AI models work fundamentally differently because they consist of mathematical weights - essentially numbers that don't benefit from compression. When you use docker model pull, the model files don't undergo compression or decompression because compressing numerical weights provides minimal benefit and can actually slow down the process.

This means no compression overhead during download, no duplicate storage of compressed and uncompressed versions, faster model loading and deployment, and significantly more efficient disk usage overall.

💡
With Docker Model Runner, the AI model DOES NOT run in a container. Instead, Docker Model Runner uses a host-installed inference server (llama.cpp for now) that runs natively on your Mac rather than containerizing the model.

Model Runner enables local LLM execution. It runs large language models (LLMs) directly on your machine rather than sending data to external API services. This means your data never leaves your infrastructure. All you need to do is pull the model from Docker Hub and start integrating it with your application.

Architecture of Docker Model Runner

Docker Model Runner implements a unique hybrid approach that combines Docker's orchestration capabilities with native host performance for AI inference. Let me break down the key architectural components and explain how they work together:

Core Architecture Components

Host Machine Layer (Mac with Apple Silicon) The entire system runs on Apple Silicon Macs, taking advantage of the Metal API for GPU acceleration. This is the foundation that enables the performance benefits.

Docker Desktop 4.40+ Container Docker Desktop provides the management layer and orchestration, but importantly, it doesn't containerize the actual AI inference workload. Instead, it manages the lifecycle and provides interfaces.

llama.cpp Inference Server (Host-Native) This is the critical architectural decision - the inference server runs directly on the host machine, NOT inside a Docker container. This design choice enables:

  • Direct access to Apple's Metal API for GPU acceleration
  • Elimination of containerization overhead for compute-intensive operations
  • Native performance for model inference

Two-Tier Interface System

CLI Interface The docker model commands (pull, run, ls, rm, status) provide a familiar Docker-like experience for model management. This maintains consistency with existing Docker workflows while introducing AI-specific operations.

API Interface Offers two types of endpoints:

  • Model Management Endpoints: For administrative operations (listing, pulling, removing models)
  • OpenAI-compatible Endpoints: For inference requests, ensuring seamless integration with existing AI applications

Dual Connection Architecture

The diagram shows two primary connection methods, each serving different use cases:

1. Docker Socket Connection (/var/run/docker.sock)

  • Used by host applications and containers within the Docker network
  • Enables internal communication between Docker components
  • Connects to Docker Hub for model downloads
  • Ideal for containerized applications that need to access models

2. Host TCP Connection (port 12434)

  • Provides direct network access from external services
  • Allows non-containerized applications to connect
  • Enables integration with services outside the Docker ecosystem
  • Accessible via model-runner.docker.internal from containers or localhost:12434 from the host

Model Storage and Execution Flow

Model Registry and Caching

  • Models are downloaded from Docker Hub as OCI artifacts
  • Cached locally on the host machine for quick access
  • The Model Registry tracks available and loaded models
  • No compression/decompression overhead during model loading

GPU Acceleration Pipeline

  • Direct access to Apple Silicon's integrated GPU
  • Utilizes Metal API for optimized inference performance
  • No virtualization layer between the inference engine and hardware
  • Real-time GPU utilization visible in Activity Monitor

Key Architectural Benefits

Performance Optimization By running the inference server directly on the host, the architecture eliminates the performance penalty of containerization for compute-intensive AI workloads while maintaining Docker's benefits for orchestration and management.

Flexibility in Integration The dual connection approach allows both containerized and non-containerized applications to access the same models, providing maximum flexibility for different deployment scenarios.

Resource Efficiency Models are shared across all applications accessing the Model Runner, avoiding duplication and reducing memory footprint compared to running separate containerized inference servers.

Developer Experience The familiar Docker CLI interface combined with OpenAI-compatible APIs means developers can integrate AI capabilities without learning new tools or significantly changing their existing workflows.

This architecture represents a thoughtful balance between Docker's container orchestration strengths and the performance requirements of AI inference workloads, creating a system that's both powerful and accessible to developers already familiar with Docker ecosystems.

Let's see it in action.

Getting Started with Model Runner

Prerequisites

  • macOS with Apple Silicon (M1/M2/M3/M4) - Windows support coming soon
  • Docker Desktop 4.40+
  • Python 3.8+ installed on your system
  • Basic familiarity with Python and Docker

Step 1: Install and Configure Docker Desktop

1.1 Download Docker Desktop 4.40+

Download the latest version from Docker's official website.

1.2 Enable Docker Model Runner

Option A: Using CLI

docker desktop enable model-runner

Option B: Using Docker Dashboard

  1. Open Docker Desktop
  2. Go to Settings → Features in development
  3. Enable "Docker Model Runner"
  4. Optionally enable "Enable host-side TCP support" and set port (default: 12434)
  5. Click "Apply & Restart"

1.3 Verify Installation

docker model --help
docker model status

Expected output:

Docker Model Runner is running

Step 2: Download an AI Model

2.1 List Available Models

docker model ls

2.2 Pull a Model from Docker Hub

docker model pull ai/llama3.2:1B-Q8_0

2.3 Verify Model Download

docker model ls

Expected output:

MODEL                PARAMETERS  QUANTIZATION  ARCHITECTURE  MODEL ID      CREATED       SIZE
ai/llama3.2:1B-Q8_0  1.24 B      Q8_0          llama         a15c3117eeeb  20 hours ago  1.22 GiB

Step 3: Set Up Python Environment

3.1 Create a Project Directory

mkdir docker-model-runner-python
cd docker-model-runner-python

3.2 Create a Virtual Environment

python -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
# venv\Scripts\activate  # On Windows

3.3 Install Required Python Packages

pip install openai requests python-dotenv

3.4 Create Requirements File

pip freeze > requirements.txt

Step 4: Basic Python Integration

4.1 Create a Simple Chat Script

Create basic_chat.py:

import requests
import json

def chat_with_model(message, model="ai/llama3.2:1B-Q8_0"):
    """
    Send a chat message to Docker Model Runner using direct HTTP requests
    """
    url = "http://model-runner.docker.internal/engines/llama.cpp/v1/chat/completions"
    
    payload = {
        "model": model,
        "messages": [
            {"role": "user", "content": message}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }
    
    headers = {
        "Content-Type": "application/json"
    }
    
    try:
        response = requests.post(url, json=payload, headers=headers)
        response.raise_for_status()
        
        result = response.json()
        return result["choices"][0]["message"]["content"]
    
    except requests.exceptions.RequestException as e:
        return f"Error: {e}"

if __name__ == "__main__":
    message = input("Enter your message: ")
    response = chat_with_model(message)
    print(f"AI Response: {response}")

4.2 Test the Basic Script

python basic_chat.py

Step 5: Using OpenAI Python Client

5.1 Create OpenAI-Compatible Client

Create openai_client.py:

from openai import OpenAI
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

class DockerModelRunner:
    def __init__(self, base_url="http://model-runner.docker.internal/engines/llama.cpp/v1/"):
        """
        Initialize Docker Model Runner client using OpenAI SDK
        """
        self.client = OpenAI(
            base_url=base_url,
            api_key="docker-model-runner"  # Dummy key, required by OpenAI client
        )
        self.model = os.getenv("MODEL", "ai/llama3.2:1B-Q8_0")
    
    def chat(self, message, system_message=None):
        """
        Send a chat completion request
        """
        messages = []
        
        if system_message:
            messages.append({"role": "system", "content": system_message})
        
        messages.append({"role": "user", "content": message})
        
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=messages,
                max_tokens=150,
                temperature=0.7
            )
            
            return response.choices[0].message.content
        
        except Exception as e:
            return f"Error: {e}"
    
    def stream_chat(self, message):
        """
        Stream chat responses
        """
        try:
            stream = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": message}],
                max_tokens=150,
                temperature=0.7,
                stream=True
            )
            
            response_text = ""
            for chunk in stream:
                if chunk.choices[0].delta.content is not None:
                    content = chunk.choices[0].delta.content
                    response_text += content
                    print(content, end="", flush=True)
            
            print()  # New line after streaming
            return response_text
        
        except Exception as e:
            return f"Error: {e}"

if __name__ == "__main__":
    runner = DockerModelRunner()
    
    print("Docker Model Runner Chat (type 'quit' to exit)")
    print("-" * 50)
    
    while True:
        user_input = input("\nYou: ")
        
        if user_input.lower() in ['quit', 'exit', 'bye']:
            print("Goodbye!")
            break
        
        print("AI: ", end="")
        runner.stream_chat(user_input)

5.2 Create Environment File

Create .env:

MODEL=ai/llama3.2:1B-Q8_0
BASE_URL=http://model-runner.docker.internal/engines/llama.cpp/v1/

5.3 Test OpenAI Client

python openai_client.py

Step 6: Advanced Python Integration

6.1 Create a Flask Web Application

Create app.py:

from flask import Flask, request, jsonify, render_template_string
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()

app = Flask(__name__)

# Initialize OpenAI client for Docker Model Runner
client = OpenAI(
    base_url=os.getenv("BASE_URL", "http://model-runner.docker.internal/engines/llama.cpp/v1/"),
    api_key="docker-model-runner"
)

MODEL = os.getenv("MODEL", "ai/llama3.2:1B-Q8_0")

@app.route('/')
def index():
    return render_template_string("""
    <!DOCTYPE html>
    <html>
    <head>
        <title>Docker Model Runner Chat</title>
        <style>
            body { font-family: Arial, sans-serif; max-width: 800px; margin: 50px auto; padding: 20px; }
            .chat-container { border: 1px solid #ddd; height: 400px; overflow-y: scroll; padding: 10px; margin-bottom: 20px; }
            .message { margin: 10px 0; padding: 10px; border-radius: 5px; }
            .user { background-color: #e3f2fd; text-align: right; }
            .ai { background-color: #f5f5f5; }
            input[type="text"] { width: 70%; padding: 10px; }
            button { padding: 10px 20px; }
        </style>
    </head>
    <body>
        <h1>Docker Model Runner Chat</h1>
        <div id="chat-container" class="chat-container"></div>
        <input type="text" id="message-input" placeholder="Type your message..." onkeypress="if(event.key==='Enter') sendMessage()">
        <button onclick="sendMessage()">Send</button>
        
        <script>
            async function sendMessage() {
                const input = document.getElementById('message-input');
                const message = input.value.trim();
                if (!message) return;
                
                addMessage('You', message, 'user');
                input.value = '';
                
                try {
                    const response = await fetch('/chat', {
                        method: 'POST',
                        headers: { 'Content-Type': 'application/json' },
                        body: JSON.stringify({ message })
                    });
                    
                    const data = await response.json();
                    addMessage('AI', data.response, 'ai');
                } catch (error) {
                    addMessage('Error', 'Failed to get response', 'ai');
                }
            }
            
            function addMessage(sender, text, className) {
                const container = document.getElementById('chat-container');
                const div = document.createElement('div');
                div.className = `message ${className}`;
                div.innerHTML = `<strong>${sender}:</strong> ${text}`;
                container.appendChild(div);
                container.scrollTop = container.scrollHeight;
            }
        </script>
    </body>
    </html>
    """)

@app.route('/chat', methods=['POST'])
def chat():
    data = request.json
    message = data.get('message', '')
    
    try:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "user", "content": message}],
            max_tokens=150,
            temperature=0.7
        )
        
        ai_response = response.choices[0].message.content
        return jsonify({"response": ai_response})
    
    except Exception as e:
        return jsonify({"response": f"Error: {str(e)}"}), 500

if __name__ == '__main__':
    app.run(debug=True, port=5000)

6.2 Install Flask

pip install flask

6.3 Run the Flask Application

python app.py

Visit http://localhost:5000 in your browser.

Step 7: Using TCP Host Support (Alternative Connection Method)

7.1 Enable TCP Support

Using CLI:

docker desktop enable model-runner --tcp 12434

Or via Docker Dashboard: Enable "Enable host-side TCP support" and set port to 12434.

7.2 Create TCP Client Script

Create tcp_client.py:

from openai import OpenAI
import os

class DockerModelRunnerTCP:
    def __init__(self, host="localhost", port=12434):
        """
        Connect to Docker Model Runner via TCP
        """
        base_url = f"http://{host}:{port}/engines/llama.cpp/v1/"
        
        self.client = OpenAI(
            base_url=base_url,
            api_key="docker-model-runner"
        )
        self.model = "ai/llama3.2:1B-Q8_0"
    
    def chat(self, message):
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": message}],
                max_tokens=100,
                temperature=0.7
            )
            
            return response.choices[0].message.content
        except Exception as e:
            return f"Error: {e}"

if __name__ == "__main__":
    runner = DockerModelRunnerTCP()
    
    message = input("Enter your message: ")
    response = runner.chat(message)
    print(f"AI Response: {response}")

Step 8: Error Handling and Best Practices

8.1 Create Robust Client with Error Handling

Create robust_client.py:

import time
import logging
from openai import OpenAI
from typing import Optional, Generator

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class DockerModelRunnerClient:
    def __init__(self, base_url: str = "http://model-runner.docker.internal/engines/llama.cpp/v1/", 
                 model: str = "ai/llama3.2:1B-Q8_0", 
                 max_retries: int = 3,
                 retry_delay: float = 1.0):
        
        self.client = OpenAI(base_url=base_url, api_key="dummy")
        self.model = model
        self.max_retries = max_retries
        self.retry_delay = retry_delay
    
    def _make_request_with_retry(self, request_func, *args, **kwargs):
        """Make request with retry logic"""
        last_exception = None
        
        for attempt in range(self.max_retries):
            try:
                return request_func(*args, **kwargs)
            except Exception as e:
                last_exception = e
                logger.warning(f"Attempt {attempt + 1} failed: {e}")
                
                if attempt < self.max_retries - 1:
                    time.sleep(self.retry_delay * (2 ** attempt))  # Exponential backoff
        
        raise last_exception
    
    def chat(self, message: str, system_message: Optional[str] = None, 
             max_tokens: int = 150, temperature: float = 0.7) -> str:
        """Send chat completion request with retry logic"""
        
        messages = []
        if system_message:
            messages.append({"role": "system", "content": system_message})
        messages.append({"role": "user", "content": message})
        
        def make_request():
            return self.client.chat.completions.create(
                model=self.model,
                messages=messages,
                max_tokens=max_tokens,
                temperature=temperature
            )
        
        try:
            response = self._make_request_with_retry(make_request)
            return response.choices[0].message.content
        except Exception as e:
            logger.error(f"Failed to get response after {self.max_retries} attempts: {e}")
            return f"Error: Unable to get response from model"
    
    def check_model_status(self) -> bool:
        """Check if the model is available"""
        try:
            models = self.client.models.list()
            available_models = [model.id for model in models.data]
            return self.model in available_models
        except Exception as e:
            logger.error(f"Failed to check model status: {e}")
            return False

# Usage example
if __name__ == "__main__":
    client = DockerModelRunnerClient()
    
    # Check if model is available
    if client.check_model_status():
        print("✅ Model is available")
    else:
        print("❌ Model is not available")
    
    # Test chat
    response = client.chat("Hello, how are you?")
    print(f"Response: {response}")

Step 9: Testing and Validation

9.1 Create Test Script

Create test_model_runner.py:

import unittest
from robust_client import DockerModelRunnerClient

class TestDockerModelRunner(unittest.TestCase):
    def setUp(self):
        self.client = DockerModelRunnerClient()
    
    def test_model_availability(self):
        """Test if model is available"""
        self.assertTrue(self.client.check_model_status(), "Model should be available")
    
    def test_basic_chat(self):
        """Test basic chat functionality"""
        response = self.client.chat("Say hello")
        self.assertIsInstance(response, str)
        self.assertGreater(len(response), 0)
    
    def test_system_message(self):
        """Test chat with system message"""
        response = self.client.chat(
            "What's your role?", 
            system_message="You are a helpful assistant."
        )
        self.assertIsInstance(response, str)
        self.assertGreater(len(response), 0)

if __name__ == "__main__":
    unittest.main()

9.2 Run Tests

python test_model_runner.py

Troubleshooting

Common Issues and Solutions

  1. Connection Refused Error
    • Ensure Docker Desktop is running
    • Verify Model Runner is enabled: docker model status
    • Check if the model is downloaded: docker model ls
  2. Model Not Found Error
    • Pull the model: docker model pull ai/llama3.2:1B-Q8_0
    • Verify model name matches exactly
  3. Slow Response Times
    • Check GPU usage in Activity Monitor
    • Consider using a smaller model for testing
    • Ensure sufficient system memory
  4. TCP Connection Issues
    • Verify TCP support is enabled in Docker Desktop
    • Check port 12434 is not used by other applications
    • Use host.docker.internal instead of localhost in containers

Checking Logs

tail -f ~/Library/Containers/com.docker.docker/Data/log/host/inference-llama.cpp.log

Next Steps

  • Experiment with different models available in the Docker Hub ai/ namespace
  • Build more complex applications using conversation history
  • Integrate with other Python frameworks like Django or FastAPI
  • Explore embeddings and other OpenAI-compatible endpoints
  • Consider building production applications with proper error handling and monitoring

Additional Resources