Running Vision Models Locally with Docker Model Runner: A Complete Tutorial

Want to add vision capabilities to your applications without sending data to external APIs? Docker Model Runner makes it straightforward to run multimodal AI models locally, giving you complete control over your data while using the familiar OpenAI-compatible API format.

Ajeet Singh Raina

09 Nov 2025 — 8 min read

Want to run AI vision models that can analyze images, extract text, and answer questions about photos—all on your own machine? Docker Model Runner makes it straightforward to run multimodal AI models locally with the same OpenAI-compatible API you already know.

Let me show you how to get started with vision-capable models using Docker Model Runner, from pulling your first multimodal model to making real API calls that analyze images.

What is Docker Model Runner?

Docker Model Runner is Docker's native solution for running AI models locally, integrated directly into Docker Desktop. It brings local AI inference into your Docker workflow, using the same familiar concepts—registries, tags, versioning—that you already use for containers.

Learn more

Here's what makes it clever: Models don't run inside containers. Instead, Docker Model Runner:

Uses an Inference Server API endpoint through Docker Desktop
Runs llama.cpp as a native host process for direct GPU access
Loads models on-demand when you make API calls
Automatically unloads models after 5 minutes of inactivity

No containers to manage, no docker run commands before making API calls. Models are stored as OCI artifacts in registries, giving you version control and distribution for AI models just like you have for container images.

Learn more

The architecture in a nutshell

Docker Model Runner has three main components:

model-runner - The backend that manages and runs models (native process for performance)
model-cli - Command-line tool for pulling and managing models
model-spec - Specification for packaging models as OCI artifacts

The inference engine is llama.cpp, which exposes an OpenAI-compatible API. This means if you've worked with OpenAI's API before, you already know most of what you need.

Getting started: Your first model

If you have Docker Desktop, you already have Model Runner. Let's verify:

docker model --help

Pull a text model first

Let's start with a simple text model to get comfortable:

# Pull a small, fast model
docker model pull ai/smollm2:360M-Q4_K_M

# List your models
docker model ls

The model name breaks down as:

ai/ - Docker's namespace for official models
smollm2 - Model family
360M - 360 million parameters
Q4_K_M - 4-bit quantization (smaller, faster)

Make your first API call

The API runs on localhost:12434:

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/smollm2:360M-Q4_K_M",
  "messages": [
    {
      "role": "user",
      "content": "Explain Docker in simple terms"
    }
  ]
}'

No API keys, no authentication—everything runs locally. The response follows OpenAI's familiar format:

{
  "id": "chatcmpl-...",
  "model": "ai/smollm2:360M-Q4_K_M",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Docker is a platform that lets you package..."
    }
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 87,
    "total_tokens": 102
  }
}

Moving to multimodal: Vision-capable models

Now let's explore Docker Model Runner's multimodal capabilities. Vision-capable models can analyze images, extract text, identify objects, and answer questions about visual content.

Pulling a vision model

Several models in the ai/ namespace support vision:

# Google's Gemma 3 (good balance of speed and quality)
docker model pull ai/gemma3:4B-Q4_K_M

# Meta's Llama 3.2 Vision (larger, more capable)
docker model pull ai/llama3.2-vision:11B-Q4_K_M

Identifying vision-capable models

Look for these model families that support multimodal input:

gemma3 - Google's vision-capable models (2B, 4B, 9B, 27B)
llama3.2-vision - Meta's multimodal models (11B, 90B)
llava - Popular open-source vision models

You can also pull GGUF models directly from Hugging Face:

docker model pull hf.co/bartowski/llava-v1.6-mistral-7b-GGUF

Working with images: The base64 format

To send images to vision models, you'll encode them as base64 data URIs. This keeps everything in a single JSON payload and works seamlessly with the OpenAI-compatible API format.

Quick encoding methods

Command line (Linux/Mac):

echo "data:image/jpeg;base64,$(base64 -i photo.jpg)"

Python:

import base64

def encode_image(image_path):
    with open(image_path, 'rb') as f:
        encoded = base64.b64encode(f.read()).decode('utf-8')
    
    ext = image_path.lower().split('.')[-1]
    mime = f'image/{"jpeg" if ext == "jpg" else ext}'
    
    return f"data:{mime};base64,{encoded}"

data_uri = encode_image('photo.jpg')

JavaScript/Node.js:

const fs = require('fs');
const path = require('path');

function encodeImage(imagePath) {
    const buffer = fs.readFileSync(imagePath);
    const base64 = buffer.toString('base64');
    const ext = path.extname(imagePath).slice(1).toLowerCase();
    const mime = `image/${ext === 'jpg' ? 'jpeg' : ext}`;
    
    return `data:${mime};base64,${base64}`;
}

Your first vision API call

Here's a complete example that sends both text and an image:

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/gemma3:4B-Q4_K_M",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What do you see in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
          }
        }
      ]
    }
  ]
}'

The key difference from text-only requests is the content array format:

Text prompts use {"type": "text", "text": "..."}
Images use {"type": "image_url", "image_url": {"url": "data:..."}}

You can mix multiple text segments and images in the same request!

Practical use cases

1. Image description and analysis

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/gemma3:4B-Q4_K_M",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Describe this image in detail"
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/jpeg;base64,[BASE64]"}
      }
    ]
  }]
}'

2. Text extraction (OCR)

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/gemma3:4B-Q4_K_M",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Extract all text from this image"
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/png;base64,[BASE64]"}
      }
    ]
  }]
}'

3. Document and chart analysis

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/llama3.2-vision:11B-Q4_K_M",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Analyze this chart and summarize the key trends"
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/png;base64,[BASE64]"}
      }
    ]
  }]
}'

4. Comparing multiple images

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/gemma3:4B-Q4_K_M",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Compare these two images and describe the differences"
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/jpeg;base64,[BASE64_1]"}
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/jpeg;base64,[BASE64_2]"}
      }
    ]
  }]
}'

Building a Python wrapper

Let's create a reusable class for easier integration:

import base64
import requests
from pathlib import Path

class DockerModelRunner:
    def __init__(self, 
                 base_url="http://localhost:12434",
                 model="ai/gemma3:4B-Q4_K_M"):
        self.base_url = base_url
        self.model = model
        self.endpoint = f"{base_url}/engines/llama.cpp/v1/chat/completions"
    
    def encode_image(self, image_path):
        """Encode image to base64 data URI"""
        path = Path(image_path)
        with open(path, 'rb') as f:
            encoded = base64.b64encode(f.read()).decode('utf-8')
        
        ext = path.suffix.lower()[1:]
        mime = f'image/{"jpeg" if ext == "jpg" else ext}'
        
        return f"data:{mime};base64,{encoded}"
    
    def analyze_image(self, image_path, prompt="Describe this image"):
        """Analyze an image with a custom prompt"""
        data_uri = self.encode_image(image_path)
        
        payload = {
            "model": self.model,
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": data_uri}}
                ]
            }]
        }
        
        response = requests.post(
            self.endpoint,
            headers={"Content-Type": "application/json"},
            json=payload,
            timeout=60
        )
        
        response.raise_for_status()
        result = response.json()
        return result['choices'][0]['message']['content']
    
    def chat(self, message):
        """Simple text chat"""
        payload = {
            "model": self.model,
            "messages": [{
                "role": "user",
                "content": message
            }]
        }
        
        response = requests.post(
            self.endpoint,
            headers={"Content-Type": "application/json"},
            json=payload,
            timeout=60
        )
        
        response.raise_for_status()
        result = response.json()
        return result['choices'][0]['message']['content']

# Usage example
if __name__ == "__main__":
    runner = DockerModelRunner()
    
    # Text chat
    print("=== Text Chat ===")
    response = runner.chat("What is containerization?")
    print(response)
    
    # Vision analysis
    print("\n=== Vision Analysis ===")
    response = runner.analyze_image(
        "diagram.png",
        "Explain what this architecture diagram shows"
    )
    print(response)

Tips for best results

1. Choose the right model size

Use Case	Recommended Model	Why
Quick prototyping	`ai/gemma3:2B-Q4_K_M`	Fastest, lowest resource usage
General purpose	`ai/gemma3:4B-Q4_K_M`	Good balance of speed and quality
High accuracy	`ai/llama3.2-vision:11B-Q4_K_M`	Better understanding, slower
Production quality	`ai/gemma3:27B-Q4_K_M`	Best results

2. Optimize your images

Before encoding, consider resizing large images:

# Using ImageMagick
convert large-image.jpg -resize 1024x1024 optimized.jpg

# Using Python/Pillow
from PIL import Image
img = Image.open('large-image.jpg')
img.thumbnail((1024, 1024))
img.save('optimized.jpg')

Smaller images mean faster encoding, smaller payloads, and quicker inference.

3. Understand the API URLs

The API endpoint changes based on context:

From host machine: http://localhost:12434
From Docker container: http://model-runner.docker.internal:12434
With Docker Compose: Use the service name

4. Performance expectations

First API call loads the model (may take a few seconds)
Subsequent calls are fast
Models auto-unload after 5 minutes of inactivity
Vision requests are slower than text-only due to image processing

5. Supported image formats

Common formats work well:

JPEG/JPG: data:image/jpeg;base64,...
PNG: data:image/png;base64,...
WebP: data:image/webp;base64,...
GIF: data:image/gif;base64,...

Integrating with Docker Compose

You can declare model dependencies in your Compose file:

services:
  app:
    build: .
    ports:
      - "3000:3000"
    models:
      - vision-model
    environment:
      - MODEL_API_URL=http://model-runner.docker.internal:12434

models:
  vision-model:
    model: ai/gemma3:4B-Q4_K_M
    context_size: 4096

This ensures the model is available before your application starts.

Troubleshooting common issues

"Model not found"

# Verify the model is pulled
docker model ls

# Pull it if missing
docker model pull ai/gemma3:4B-Q4_K_M

Connection refused

Verify Docker Desktop is running
Check that Model Runner is enabled in Docker Desktop settings

Timeout errors

Try a smaller model
Reduce image size before encoding
Ensure sufficient system resources (RAM, GPU memory)

Unexpected responses

Verify the model supports vision (use gemma3 or llama3.2-vision)
Check that base64 encoding is complete and properly formatted
Ensure MIME type matches image format

Real-world example: Building an image analyzer

Here's a complete Flask app that uses Docker Model Runner for image analysis:

from flask import Flask, request, jsonify
import base64
import requests
from io import BytesIO
from PIL import Image

app = Flask(__name__)

MODEL_API = "http://localhost:12434/engines/llama.cpp/v1/chat/completions"
MODEL_NAME = "ai/gemma3:4B-Q4_K_M"

def encode_image(image_file):
    """Encode uploaded image to base64"""
    image = Image.open(image_file)
    
    # Optimize size
    image.thumbnail((1024, 1024))
    
    # Convert to bytes
    buffer = BytesIO()
    image.save(buffer, format='JPEG')
    encoded = base64.b64encode(buffer.getvalue()).decode('utf-8')
    
    return f"data:image/jpeg;base64,{encoded}"

@app.route('/analyze', methods=['POST'])
def analyze():
    if 'image' not in request.files:
        return jsonify({"error": "No image provided"}), 400
    
    image_file = request.files['image']
    prompt = request.form.get('prompt', 'Describe this image')
    
    # Encode image
    data_uri = encode_image(image_file)
    
    # Call Docker Model Runner
    payload = {
        "model": MODEL_NAME,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": data_uri}}
            ]
        }]
    }
    
    response = requests.post(MODEL_API, json=payload, timeout=60)
    result = response.json()
    
    return jsonify({
        "analysis": result['choices'][0]['message']['content'],
        "model": MODEL_NAME,
        "tokens_used": result['usage']['total_tokens']
    })

if __name__ == '__main__':
    app.run(debug=True, port=5000)

Test it:

curl -X POST http://localhost:5000/analyze \
  -F "image=@photo.jpg" \
  -F "prompt=What objects do you see in this image?"

Why choose Docker Model Runner?

Complete privacy: Your images and data never leave your machine
No API costs: Run as many requests as you want, no usage fees
Docker integration: Models work seamlessly with your containerized apps
Familiar API: OpenAI-compatible format means minimal learning curve
Registry benefits: Version control, tagging, and distribution for AI models
Flexible deployment: Works on Docker Desktop, servers, and CI/CD pipelines

Next steps

Now that you understand the basics:

Experiment with different models - Try various sizes and families to find the right balance
Build practical applications - Image captioning, document analysis, accessibility tools
Integrate with your workflow - Add vision capabilities to existing Docker applications
Optimize performance - Tune image sizes and model selection for your use case
Explore the ecosystem - Check out models on Docker Hub and Hugging Face

Resources

Official documentation: https://docs.docker.com/ai/model-runner/
API reference: https://docs.docker.com/ai/model-runner/api-reference/
Docker Hub AI models: https://hub.docker.com/u/ai
Hugging Face GGUF models: https://huggingface.co/models?library=gguf

Have questions or want to share what you've built with Docker Model Runner? We'd love to hear from you!

Ready to run AI vision models locally? Get started with Docker Model Runner today at docs.docker.com/ai/model-runner/