How to Increase Context Window Size in Docker Model Runner with llama.cpp

Frustrated by tiny context windows when you know your model can handle so much more? If you're running llama.cpp through Docker Model Runner and hitting that annoying 4096 token wall, there's a simple fix you need to know about. Your model isn't the problem—your configuration is.

How to Increase Context Window Size in Docker Model Runner with llama.cpp

The Problem: Limited Context Window Despite Model Capabilities

If you're running a large language model using Docker Model Runner (DMR) with llama.cpp, you might encounter a frustrating issue: your model supports a massive context window (like 131K tokens), but the API interface stubbornly limits you to just 4096 tokens. This artificial constraint can significantly impact your model's ability to handle long documents, extended conversations, or complex tasks requiring substantial context.

The good news? This is easily fixable with the right configuration.

Understanding the Context Window Limitation

The 4096 token limit is often a default setting in the inference engine, not a limitation of your model itself. When Docker Model Runner starts your model with llama.cpp, it uses default parameters unless explicitly told otherwise. This means even though your model can handle 131K tokens, the runtime environment caps it at a much lower value.

Solution: Configure Context Size in Docker Compose

The primary way to increase your context window is by setting the context_size attribute in your compose.yaml file. This tells Docker Model Runner exactly how large a context window to allocate when starting your model.

DMR REST API
Reference documentation for the Docker Model Runner REST API endpoints and usage examples.

Basic Configuration

Here's a simple example of how to set up your compose.yaml:

services:
  my-app:
    image: my-app-image
    models:
      - my_llm

models:
  my_llm:
    model: ai/llama3.3:70B-Q4_K_M
    context_size: 131000

Key points:

  • Replace ai/llama3.3:70B-Q4_K_M with your actual model identifier
  • Set context_size to match your model's maximum supported context (e.g., 131000 for models with 131K token windows)
  • Ensure the value doesn't exceed what your hardware can handle

Advanced Configuration with Runtime Flags

If you need more control or the basic configuration isn't working, you can explicitly pass the context size as a runtime flag to llama.cpp:

models:
  my_llm:
    model: ai/llama3.3:70B-Q4_K_M
    context_size: 131000
    runtime_flags:
      - "--ctx-size"
      - "131000"

This approach directly passes the --ctx-size parameter to the llama.cpp inference engine, giving you explicit control over the context window.

Prerequisites and Requirements

Before implementing this solution, ensure you have:

  1. Docker Compose v2.38.0 or later - Model support in Docker Compose requires this version or newer
  2. Sufficient VRAM - Larger context windows require more GPU memory. A 131K context window can require 10GB+ of VRAM depending on your model size
  3. Compatible model - Verify your model actually supports the context size you're setting

Troubleshooting Common Issues

Still Seeing 4096 Token Limit?

If you're still constrained after updating your configuration:

  1. Verify Compose file is being used - Ensure Docker is actually reading your compose.yaml file
  2. Check Docker Compose version - Run docker compose version to confirm you're on v2.38.0+
  3. Restart the service - After changing configuration, rebuild and restart: docker compose up --build
  4. Check logs - Look for initialization messages that show the actual context size being used

Model Fails to Start

If your model won't start after increasing context size:

  1. Insufficient VRAM - The most common cause. Your GPU might not have enough memory for the larger context
  2. Reduce context size - Try incrementally increasing from 4096 (e.g., 8192, 16384, 32768) to find what your hardware supports
  3. Check system resources - Monitor GPU memory usage with tools like nvidia-smi

Performance Degradation

Larger context windows use more memory and can slow down inference:

  • Start with a moderate increase (e.g., 32K instead of 131K) and scale up as needed
  • Only use what you actually need for your use case
  • Consider the trade-off between context size and inference speed

Best Practices

  1. Match your use case - Don't always max out the context window. Use 32K for most conversations, 64K for document analysis, and 131K only when truly needed
  2. Monitor resources - Keep an eye on VRAM usage to avoid out-of-memory errors
  3. Test incrementally - Start with smaller increases and scale up to ensure stability
  4. Document your configuration - Note the context size in your compose file comments for future reference

Conclusion

Increasing the context window in Docker Model Runner is straightforward once you know where to configure it. By setting the context_size parameter in your compose.yaml file, you can unlock your model's full potential and handle much larger contexts than the default 4096 tokens.

Remember that hardware limitations, particularly VRAM, are the real bottleneck for large context windows. Start conservatively, test thoroughly, and scale up based on your actual needs and available resources.

Further Resources:

DMR examples
Example projects and CI/CD workflows for Docker Model Runner.