Mistake in Ollama May Hamper AI Performance in Windows 11 – Here's a Solution
Ollama, a popular local AI deployment platform, offers users the flexibility to customise the context length of their AI models, providing a more tailored experience based on specific use cases and hardware capabilities.
Changing Context Length in Ollama
To adjust the context length in Ollama, you can utilize either the graphical user interface (GUI) or the command-line interface (CLI).
Via GUI
Within the Ollama settings, you'll find a slider that allows you to switch between preset values such as 4k, 8k, and up to 128k tokens. While this method is straightforward, it is limited to fixed intervals.
Via CLI
For precise control and the ability to save the context length setting permanently, you can use Ollama's command-line interface. To set the context length, follow these steps:
- Run the model in Ollama's command line interface.
- Set the context length with the command:
For example, to set 8,192 tokens:
- After setting the context length, save this as a new model version with:
This approach allows you to keep multiple versions with different context sizes for various tasks.
Considerations for Choosing Context Length
When deciding on the appropriate context length, consider the following factors:
- Longer contexts enable handling larger inputs or longer conversations but increase memory and computational costs.
- Extremely long contexts (100k+ tokens) require compatible models and potentially specialized hardware.
- Fine-tuning or model architecture may affect the maximum usable context length, as seen with GPT-4.1 fine-tuned models limiting context below base model capacities.
Performance and Context Length
The performance report generated after the response details a number of metrics, including the last eval rate, which indicates the tokens per second and the speed of the performance. Changing the context length can affect the GPU utilization in Ollama. Reducing the context length can lead to increased GPU utilization in Ollama.
Balancing Context Length and Performance
The optimal context length balances your workload requirements against resource constraints. For tasks involving processing very long documents, consider increasing the context length towards the higher end (e.g., 8k to 128k tokens) if supported by your model and hardware in Ollama. For smaller or latency-sensitive tasks, lower context lengths improve performance.
Monitoring CPU and GPU Usage
To see the split of CPU and GPU usage for a model in Ollama, use the terminal command 'ollama ps' after backing out of the model using the command. The performance report generated after the response also provides insights into the GPU utilization.
Conclusion
With Ollama's flexible adjustment options for the context length primarily via the parameter with GUI for presets and CLI for fine control and persistent saving, users can optimise their AI models for various tasks. By carefully considering the context length and its impact on performance, users can make informed decisions to achieve the best results for their specific needs.
Read also:
- IM Motors reveals extended-range powertrain akin to installing an internal combustion engine in a Tesla Model Y
- Amazon customer duped over Nvidia RTX 5070 Ti purchase: shipped item replaced with suspicious white powder; PC hardware fan deceived, discovers salt instead of GPU core days after receiving defective RTX 5090.
- BMW swiftly slashes prices for its upcoming 2026 electric vehicles
- Twitter profile activity of user 'peng' shows a significant increase in Hong Kong, amidst preparations for the fourth-quarter launch of an extended-range Twitter profile feature