Docker Desktop 4.42: llama.cpp Gets Streaming and Tool Calling Support

Docker Desktop 4.42 brings real-time streaming and tool calling to Model Runner, transforming how developers build AI applications locally. No more waiting for responses or cloud dependencies—watch AI generate content token by token with full GPU acceleration on port 12434.

Llama.cpp and Docker Model Runner

Docker Desktop 4.42 brings exciting new capabilities to AI developers with enhanced llama.cpp server support in Model Runner. Two powerful features are now available that significantly improve the local AI development experience.

What's New

Streaming Support: The llama.cpp server now delivers real-time token streaming through the OpenAI-compatible API. Instead of waiting for complete responses, you can now process AI output as it generates, creating more responsive and interactive applications.
Tool Calling: Perhaps even more significant is the addition of tool calling capabilities. Your locally-running models can now execute functions, access external APIs, and perform complex reasoning tasks that extend beyond simple text generation.

Models That Support Tool Calling

Tool calling availability is model-specific rather than a universal Docker Model Runner feature.

Based on Docker's recent testing:

Qwen 3 (14B) - High tool calling performance (F1 score: 0.971)
Llama 3.3 models - Variable performance
Some Gemma 3 variants - Limited support

Models That DON'T Support Tool Calling:

SmolLM2 (the model used in most examples) - No tool calling capability
Many smaller quantized models - Often lack function calling training

Why This Matters

These additions transform Docker Desktop's Model Runner from a basic inference server into a comprehensive AI development platform. Streaming enables better user experiences in chat applications, while tool calling opens doors to AI agents that can interact with databases, APIs, and external services.

The combination means you can now build sophisticated AI applications entirely locally - no cloud dependencies, better privacy, and faster iteration cycles during development.

Getting Started

With Docker Desktop 4.42, simply pull your favorite llama.cpp-compatible model and start experimenting with streaming responses and tool integrations. The familiar OpenAI API format means minimal code changes for existing applications.

Ready to upgrade your local AI development workflow? Docker Desktop 4.42 is available now.

Getting Started

Prerequisites

Docker Desktop 4.42 or later

Terminal/Command Prompt access

Step 1: Set Up Model Runner

Open Docker Desktop 4.42
Navigate to Model Runner (in the left sidebar)

Pull a compatible model (e.g., SmolLM2 or Llama 3.2)

# Example: Pull SmolLM2
docker pull ollama/smollm2:latest

Step 1: Enable Docker Model Runner

Method 1: Using CLI

# Enable Model Runner with TCP support on port 12434
docker desktop enable model-runner --tcp 12434

Method 2: Using Docker Dashboard

Open Docker Desktop
Go to Settings → Features in development
Enable "Docker Model Runner"
Enable "Enable host-side TCP support" and set port to 12434
Click "Apply & Restart"

Verify Installation

# Check if Model Runner is available
docker model --help

# Check if Model Runner is running
docker model status

Step 2: Download and Manage Models

List Available Models

# List downloaded models (initially empty)
docker model ls

Pull a Model from Docker Hub

# Download Llama 3.2 1B model (recommended for testing)
docker model pull ai/llama3.2:1B-Q8_0

# Other available models:
# docker model pull ai/gemma3
# docker model pull ai/qwq
# docker model pull ai/mistral-nemo
# docker model pull ai/phi4
# docker model pull ai/qwen2.5

Verify Model Download

docker model ls
# Should show:
# MODEL                PARAMETERS  QUANTIZATION  ARCHITECTURE  MODEL ID      CREATED       SIZE
# ai/llama3.2:1B-Q8_0  1.24 B      Q8_0          llama         a15c3117eeeb  20 hours ago  1.22 GiB

Step 3: Basic Model Testing

Quick Single Query

docker model run ai/llama3.2:1B-Q8_0 "Tell me a fun fact about dolphins"

Interactive Chat Mode

docker model run ai/llama3.2:1B-Q8_0
# Interactive chat mode started. Type '/bye' to exit.
# > What is quantum computing?
# > /bye

Step 4: Monitor GPU Usage (Mac Only)

Open Activity Monitor (⌘ + Space → "Activity Monitor")
Click "Window" → "GPU History"
Run a query and watch GPU usage spike in real-time
Leave Activity Monitor open for the next steps

Step 5: Demo Streaming Responses

Method 1: Direct TCP Connection (Port 12434)

curl -N http://localhost:12434/engines/llama.cpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/llama3.2:1B-Q8_0",
    "stream": true,
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a short story about a robot learning to paint, but tell it slowly."}
    ]
  }'

Expected Output:

data: {"choices":[{"delta":{"role":"assistant"}}],...}
data: {"choices":[{"delta":{"content":"Once"}}],...}
data: {"choices":[{"delta":{"content":" upon"}}],...}
data: {"choices":[{"delta":{"content":" a"}}],...}
...

Method 2: Python Streaming Client

import requests
import json

def stream_response():
    url = "http://localhost:12434/engines/llama.cpp/v1/chat/completions"
    data = {
        "model": "ai/llama3.2:1B-Q8_0",
        "stream": True,
        "messages": [
            {"role": "user", "content": "Tell me about the history of computers"}
        ]
    }
    
    print("🤖 AI Response (streaming):")
    response = requests.post(url, json=data, stream=True)
    
    for line in response.iter_lines():
        if line and line.startswith(b'data: '):
            try:
                chunk = json.loads(line[6:])
                if 'choices' in chunk and chunk['choices']:
                    delta = chunk['choices'][0].get('delta', {})
                    if 'content' in delta:
                        print(delta['content'], end='', flush=True)
            except json.JSONDecodeError:
                continue
    print("\n✅ Stream completed!")

# Run the demo
stream_response()

Step 6: Demo Tool Calling

Simple Tool Calling Test

curl -N http://localhost:12434/engines/llama.cpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/llama3.2:1B-Q8_0",
    "stream": true,
    "messages": [
      {"role": "user", "content": "What is the weather like in San Francisco? Use the weather tool."}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City name"
              },
              "units": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature units"
              }
            },
            "required": ["location"]
          }
        }
      }
    ]
  }'

Advanced Tool Calling with Python

import requests
import json

def demo_tool_calling():
    url = "http://localhost:12434/engines/llama.cpp/v1/chat/completions"
    
    tools = [
        {
            "type": "function",
            "function": {
                "name": "calculate",
                "description": "Perform mathematical calculations",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {
                            "type": "string", 
                            "description": "Math expression to evaluate"
                        }
                    },
                    "required": ["expression"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "search_web",
                "description": "Search the web for information",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "Search query"
                        }
                    },
                    "required": ["query"]
                }
            }
        }
    ]
    
    data = {
        "model": "ai/llama3.2:1B-Q8_0",
        "stream": True,
        "messages": [
            {
                "role": "user", 
                "content": "Calculate 25 * 17 + 89 and then search for information about Docker containers"
            }
        ],
        "tools": tools
    }
    
    print("🔧 Testing Tool Calling...")
    response = requests.post(url, json=data, stream=True)
    
    for line in response.iter_lines():
        if line and line.startswith(b'data: '):
            try:
                chunk = json.loads(line[6:])
                if 'choices' in chunk and chunk['choices']:
                    choice = chunk['choices'][0]
                    
                    # Handle regular content
                    if 'delta' in choice and 'content' in choice['delta']:
                        print(choice['delta']['content'], end='', flush=True)
                    
                    # Handle tool calls
                    if 'delta' in choice and 'tool_calls' in choice['delta']:
                        tool_calls = choice['delta']['tool_calls']
                        for tool_call in tool_calls:
                            print(f"\n🔧 Tool Call Detected: {tool_call}")
                            
            except json.JSONDecodeError:
                continue

demo_tool_calling()

Ready to upgrade your local AI development workflow? Docker Desktop 4.42 is available now.

What's New

Models That Support Tool Calling

Models That DON'T Support Tool Calling:

Why This Matters

Getting Started

Getting Started

Prerequisites

Step 1: Set Up Model Runner

Step 1: Enable Docker Model Runner

Method 1: Using CLI

Method 2: Using Docker Dashboard

Verify Installation

Step 2: Download and Manage Models

List Available Models

Pull a Model from Docker Hub

Verify Model Download

Step 3: Basic Model Testing

Quick Single Query

Interactive Chat Mode

Step 4: Monitor GPU Usage (Mac Only)

Step 5: Demo Streaming Responses

Method 1: Direct TCP Connection (Port 12434)

Method 2: Python Streaming Client

Step 6: Demo Tool Calling

Simple Tool Calling Test

Advanced Tool Calling with Python

Comments ( )

Comments ()