Running Vision Models Locally with Docker Model Runner: A Complete Tutorial

Want to add vision capabilities to your applications without sending data to external APIs? Docker Model Runner makes it straightforward to run multimodal AI models locally, giving you complete control over your data while using the familiar OpenAI-compatible API format.

Running Vision Models Locally with Docker Model Runner: A Complete Tutorial

Want to run AI vision models that can analyze images, extract text, and answer questions about photos—all on your own machine? Docker Model Runner makes it straightforward to run multimodal AI models locally with the same OpenAI-compatible API you already know.

Let me show you how to get started with vision-capable models using Docker Model Runner, from pulling your first multimodal model to making real API calls that analyze images.

What is Docker Model Runner?

SPONSORED

Docker Model Runner is Docker's native solution for running AI models locally, integrated directly into Docker Desktop. It brings local AI inference into your Docker workflow, using the same familiar concepts—registries, tags, versioning—that you already use for containers.

Learn more

Here's what makes it clever: Models don't run inside containers. Instead, Docker Model Runner:

  1. Uses an Inference Server API endpoint through Docker Desktop
  2. Runs llama.cpp as a native host process for direct GPU access
  3. Loads models on-demand when you make API calls
  4. Automatically unloads models after 5 minutes of inactivity
SPONSORED

No containers to manage, no docker run commands before making API calls. Models are stored as OCI artifacts in registries, giving you version control and distribution for AI models just like you have for container images.

Learn more

The architecture in a nutshell

Docker Model Runner has three main components:

  • model-runner - The backend that manages and runs models (native process for performance)
  • model-cli - Command-line tool for pulling and managing models
  • model-spec - Specification for packaging models as OCI artifacts

The inference engine is llama.cpp, which exposes an OpenAI-compatible API. This means if you've worked with OpenAI's API before, you already know most of what you need.

Getting started: Your first model

If you have Docker Desktop, you already have Model Runner. Let's verify:

docker model --help

Pull a text model first

Let's start with a simple text model to get comfortable:

# Pull a small, fast model
docker model pull ai/smollm2:360M-Q4_K_M

# List your models
docker model ls

The model name breaks down as:

  • ai/ - Docker's namespace for official models
  • smollm2 - Model family
  • 360M - 360 million parameters
  • Q4_K_M - 4-bit quantization (smaller, faster)

Make your first API call

The API runs on localhost:12434:

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/smollm2:360M-Q4_K_M",
  "messages": [
    {
      "role": "user",
      "content": "Explain Docker in simple terms"
    }
  ]
}'

No API keys, no authentication—everything runs locally. The response follows OpenAI's familiar format:

{
  "id": "chatcmpl-...",
  "model": "ai/smollm2:360M-Q4_K_M",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Docker is a platform that lets you package..."
    }
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 87,
    "total_tokens": 102
  }
}

Moving to multimodal: Vision-capable models

Now let's explore Docker Model Runner's multimodal capabilities. Vision-capable models can analyze images, extract text, identify objects, and answer questions about visual content.

Pulling a vision model

Several models in the ai/ namespace support vision:

# Google's Gemma 3 (good balance of speed and quality)
docker model pull ai/gemma3:4B-Q4_K_M

# Meta's Llama 3.2 Vision (larger, more capable)
docker model pull ai/llama3.2-vision:11B-Q4_K_M

Identifying vision-capable models

Look for these model families that support multimodal input:

  • gemma3 - Google's vision-capable models (2B, 4B, 9B, 27B)
  • llama3.2-vision - Meta's multimodal models (11B, 90B)
  • llava - Popular open-source vision models

You can also pull GGUF models directly from Hugging Face:

docker model pull hf.co/bartowski/llava-v1.6-mistral-7b-GGUF

Working with images: The base64 format

To send images to vision models, you'll encode them as base64 data URIs. This keeps everything in a single JSON payload and works seamlessly with the OpenAI-compatible API format.

Quick encoding methods

Command line (Linux/Mac):

echo "data:image/jpeg;base64,$(base64 -i photo.jpg)"

Python:

import base64

def encode_image(image_path):
    with open(image_path, 'rb') as f:
        encoded = base64.b64encode(f.read()).decode('utf-8')
    
    ext = image_path.lower().split('.')[-1]
    mime = f'image/{"jpeg" if ext == "jpg" else ext}'
    
    return f"data:{mime};base64,{encoded}"

data_uri = encode_image('photo.jpg')

JavaScript/Node.js:

const fs = require('fs');
const path = require('path');

function encodeImage(imagePath) {
    const buffer = fs.readFileSync(imagePath);
    const base64 = buffer.toString('base64');
    const ext = path.extname(imagePath).slice(1).toLowerCase();
    const mime = `image/${ext === 'jpg' ? 'jpeg' : ext}`;
    
    return `data:${mime};base64,${base64}`;
}

Your first vision API call

Here's a complete example that sends both text and an image:

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/gemma3:4B-Q4_K_M",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What do you see in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "..."
          }
        }
      ]
    }
  ]
}'

The key difference from text-only requests is the content array format:

  • Text prompts use {"type": "text", "text": "..."}
  • Images use {"type": "image_url", "image_url": {"url": "data:..."}}

You can mix multiple text segments and images in the same request!

Practical use cases

1. Image description and analysis

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/gemma3:4B-Q4_K_M",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Describe this image in detail"
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/jpeg;base64,[BASE64]"}
      }
    ]
  }]
}'

2. Text extraction (OCR)

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/gemma3:4B-Q4_K_M",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Extract all text from this image"
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/png;base64,[BASE64]"}
      }
    ]
  }]
}'

3. Document and chart analysis

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/llama3.2-vision:11B-Q4_K_M",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Analyze this chart and summarize the key trends"
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/png;base64,[BASE64]"}
      }
    ]
  }]
}'

4. Comparing multiple images

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "ai/gemma3:4B-Q4_K_M",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Compare these two images and describe the differences"
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/jpeg;base64,[BASE64_1]"}
      },
      {
        "type": "image_url",
        "image_url": {"url": "data:image/jpeg;base64,[BASE64_2]"}
      }
    ]
  }]
}'

Building a Python wrapper

Let's create a reusable class for easier integration:

import base64
import requests
from pathlib import Path

class DockerModelRunner:
    def __init__(self, 
                 base_url="http://localhost:12434",
                 model="ai/gemma3:4B-Q4_K_M"):
        self.base_url = base_url
        self.model = model
        self.endpoint = f"{base_url}/engines/llama.cpp/v1/chat/completions"
    
    def encode_image(self, image_path):
        """Encode image to base64 data URI"""
        path = Path(image_path)
        with open(path, 'rb') as f:
            encoded = base64.b64encode(f.read()).decode('utf-8')
        
        ext = path.suffix.lower()[1:]
        mime = f'image/{"jpeg" if ext == "jpg" else ext}'
        
        return f"data:{mime};base64,{encoded}"
    
    def analyze_image(self, image_path, prompt="Describe this image"):
        """Analyze an image with a custom prompt"""
        data_uri = self.encode_image(image_path)
        
        payload = {
            "model": self.model,
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": data_uri}}
                ]
            }]
        }
        
        response = requests.post(
            self.endpoint,
            headers={"Content-Type": "application/json"},
            json=payload,
            timeout=60
        )
        
        response.raise_for_status()
        result = response.json()
        return result['choices'][0]['message']['content']
    
    def chat(self, message):
        """Simple text chat"""
        payload = {
            "model": self.model,
            "messages": [{
                "role": "user",
                "content": message
            }]
        }
        
        response = requests.post(
            self.endpoint,
            headers={"Content-Type": "application/json"},
            json=payload,
            timeout=60
        )
        
        response.raise_for_status()
        result = response.json()
        return result['choices'][0]['message']['content']

# Usage example
if __name__ == "__main__":
    runner = DockerModelRunner()
    
    # Text chat
    print("=== Text Chat ===")
    response = runner.chat("What is containerization?")
    print(response)
    
    # Vision analysis
    print("\n=== Vision Analysis ===")
    response = runner.analyze_image(
        "diagram.png",
        "Explain what this architecture diagram shows"
    )
    print(response)

Tips for best results

1. Choose the right model size

Use CaseRecommended ModelWhy
Quick prototypingai/gemma3:2B-Q4_K_MFastest, lowest resource usage
General purposeai/gemma3:4B-Q4_K_MGood balance of speed and quality
High accuracyai/llama3.2-vision:11B-Q4_K_MBetter understanding, slower
Production qualityai/gemma3:27B-Q4_K_MBest results

2. Optimize your images

Before encoding, consider resizing large images:

# Using ImageMagick
convert large-image.jpg -resize 1024x1024 optimized.jpg

# Using Python/Pillow
from PIL import Image
img = Image.open('large-image.jpg')
img.thumbnail((1024, 1024))
img.save('optimized.jpg')

Smaller images mean faster encoding, smaller payloads, and quicker inference.

3. Understand the API URLs

The API endpoint changes based on context:

  • From host machine: http://localhost:12434
  • From Docker container: http://model-runner.docker.internal:12434
  • With Docker Compose: Use the service name

4. Performance expectations

  • First API call loads the model (may take a few seconds)
  • Subsequent calls are fast
  • Models auto-unload after 5 minutes of inactivity
  • Vision requests are slower than text-only due to image processing

5. Supported image formats

Common formats work well:

  • JPEG/JPG: data:image/jpeg;base64,...
  • PNG: data:image/png;base64,...
  • WebP: data:image/webp;base64,...
  • GIF: data:image/gif;base64,...

Integrating with Docker Compose

You can declare model dependencies in your Compose file:

services:
  app:
    build: .
    ports:
      - "3000:3000"
    models:
      - vision-model
    environment:
      - MODEL_API_URL=http://model-runner.docker.internal:12434

models:
  vision-model:
    model: ai/gemma3:4B-Q4_K_M
    context_size: 4096

This ensures the model is available before your application starts.

Troubleshooting common issues

"Model not found"

# Verify the model is pulled
docker model ls

# Pull it if missing
docker model pull ai/gemma3:4B-Q4_K_M

Connection refused

  • Verify Docker Desktop is running
  • Check that Model Runner is enabled in Docker Desktop settings

Timeout errors

  • Try a smaller model
  • Reduce image size before encoding
  • Ensure sufficient system resources (RAM, GPU memory)

Unexpected responses

  • Verify the model supports vision (use gemma3 or llama3.2-vision)
  • Check that base64 encoding is complete and properly formatted
  • Ensure MIME type matches image format

Real-world example: Building an image analyzer

Here's a complete Flask app that uses Docker Model Runner for image analysis:

from flask import Flask, request, jsonify
import base64
import requests
from io import BytesIO
from PIL import Image

app = Flask(__name__)

MODEL_API = "http://localhost:12434/engines/llama.cpp/v1/chat/completions"
MODEL_NAME = "ai/gemma3:4B-Q4_K_M"

def encode_image(image_file):
    """Encode uploaded image to base64"""
    image = Image.open(image_file)
    
    # Optimize size
    image.thumbnail((1024, 1024))
    
    # Convert to bytes
    buffer = BytesIO()
    image.save(buffer, format='JPEG')
    encoded = base64.b64encode(buffer.getvalue()).decode('utf-8')
    
    return f"data:image/jpeg;base64,{encoded}"

@app.route('/analyze', methods=['POST'])
def analyze():
    if 'image' not in request.files:
        return jsonify({"error": "No image provided"}), 400
    
    image_file = request.files['image']
    prompt = request.form.get('prompt', 'Describe this image')
    
    # Encode image
    data_uri = encode_image(image_file)
    
    # Call Docker Model Runner
    payload = {
        "model": MODEL_NAME,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": data_uri}}
            ]
        }]
    }
    
    response = requests.post(MODEL_API, json=payload, timeout=60)
    result = response.json()
    
    return jsonify({
        "analysis": result['choices'][0]['message']['content'],
        "model": MODEL_NAME,
        "tokens_used": result['usage']['total_tokens']
    })

if __name__ == '__main__':
    app.run(debug=True, port=5000)

Test it:

curl -X POST http://localhost:5000/analyze \
  -F "image=@photo.jpg" \
  -F "prompt=What objects do you see in this image?"

Why choose Docker Model Runner?

  • Complete privacy: Your images and data never leave your machine
  • No API costs: Run as many requests as you want, no usage fees
  • Docker integration: Models work seamlessly with your containerized apps
  • Familiar API: OpenAI-compatible format means minimal learning curve
  • Registry benefits: Version control, tagging, and distribution for AI models
  • Flexible deployment: Works on Docker Desktop, servers, and CI/CD pipelines

Next steps

Now that you understand the basics:

  1. Experiment with different models - Try various sizes and families to find the right balance
  2. Build practical applications - Image captioning, document analysis, accessibility tools
  3. Integrate with your workflow - Add vision capabilities to existing Docker applications
  4. Optimize performance - Tune image sizes and model selection for your use case
  5. Explore the ecosystem - Check out models on Docker Hub and Hugging Face

Resources

Have questions or want to share what you've built with Docker Model Runner? We'd love to hear from you!


Ready to run AI vision models locally? Get started with Docker Model Runner today at docs.docker.com/ai/model-runner/