How to Increase Context Window Size in Docker Model Runner with llama.cpp
Frustrated by tiny context windows when you know your model can handle so much more? If you're running llama.cpp through Docker Model Runner and hitting that annoying 4096 token wall, there's a simple fix you need to know about. Your model isn't the problem—your configuration is.

The Problem: Limited Context Window Despite Model Capabilities
If you're running a large language model using Docker Model Runner (DMR) with llama.cpp, you might encounter a frustrating issue: your model supports a massive context window (like 131K tokens), but the API interface stubbornly limits you to just 4096 tokens. This artificial constraint can significantly impact your model's ability to handle long documents, extended conversations, or complex tasks requiring substantial context.
The good news? This is easily fixable with the right configuration.
Understanding the Context Window Limitation
The 4096 token limit is often a default setting in the inference engine, not a limitation of your model itself. When Docker Model Runner starts your model with llama.cpp, it uses default parameters unless explicitly told otherwise. This means even though your model can handle 131K tokens, the runtime environment caps it at a much lower value.
Solution: Configure Context Size in Docker Compose
The primary way to increase your context window is by setting the context_size
attribute in your compose.yaml
file. This tells Docker Model Runner exactly how large a context window to allocate when starting your model.

Basic Configuration
Here's a simple example of how to set up your compose.yaml
:
services:
my-app:
image: my-app-image
models:
- my_llm
models:
my_llm:
model: ai/llama3.3:70B-Q4_K_M
context_size: 131000
Key points:
- Replace
ai/llama3.3:70B-Q4_K_M
with your actual model identifier - Set
context_size
to match your model's maximum supported context (e.g., 131000 for models with 131K token windows) - Ensure the value doesn't exceed what your hardware can handle
Advanced Configuration with Runtime Flags
If you need more control or the basic configuration isn't working, you can explicitly pass the context size as a runtime flag to llama.cpp:
models:
my_llm:
model: ai/llama3.3:70B-Q4_K_M
context_size: 131000
runtime_flags:
- "--ctx-size"
- "131000"
This approach directly passes the --ctx-size
parameter to the llama.cpp inference engine, giving you explicit control over the context window.
Prerequisites and Requirements
Before implementing this solution, ensure you have:
- Docker Compose v2.38.0 or later - Model support in Docker Compose requires this version or newer
- Sufficient VRAM - Larger context windows require more GPU memory. A 131K context window can require 10GB+ of VRAM depending on your model size
- Compatible model - Verify your model actually supports the context size you're setting
Troubleshooting Common Issues
Still Seeing 4096 Token Limit?
If you're still constrained after updating your configuration:
- Verify Compose file is being used - Ensure Docker is actually reading your
compose.yaml
file - Check Docker Compose version - Run
docker compose version
to confirm you're on v2.38.0+ - Restart the service - After changing configuration, rebuild and restart:
docker compose up --build
- Check logs - Look for initialization messages that show the actual context size being used
Model Fails to Start
If your model won't start after increasing context size:
- Insufficient VRAM - The most common cause. Your GPU might not have enough memory for the larger context
- Reduce context size - Try incrementally increasing from 4096 (e.g., 8192, 16384, 32768) to find what your hardware supports
- Check system resources - Monitor GPU memory usage with tools like
nvidia-smi
Performance Degradation
Larger context windows use more memory and can slow down inference:
- Start with a moderate increase (e.g., 32K instead of 131K) and scale up as needed
- Only use what you actually need for your use case
- Consider the trade-off between context size and inference speed
Best Practices
- Match your use case - Don't always max out the context window. Use 32K for most conversations, 64K for document analysis, and 131K only when truly needed
- Monitor resources - Keep an eye on VRAM usage to avoid out-of-memory errors
- Test incrementally - Start with smaller increases and scale up to ensure stability
- Document your configuration - Note the context size in your compose file comments for future reference
Conclusion
Increasing the context window in Docker Model Runner is straightforward once you know where to configure it. By setting the context_size
parameter in your compose.yaml
file, you can unlock your model's full potential and handle much larger contexts than the default 4096 tokens.
Remember that hardware limitations, particularly VRAM, are the real bottleneck for large context windows. Start conservatively, test thoroughly, and scale up based on your actual needs and available resources.
Further Resources:
