The latest version of Ollama, version 0.2, has been released! This update brings a significant enhancement to the platform – concurrency is now enabled by default.
Unveiling Two Major Features: Parallel Requests and Multiple Model Support
Parallel requests: With concurrency enabled, Ollama is now capable of serving multiple requests simultaneously. This feature, which requires only a modest increase in memory for each request, opens up a plethora of possibilities. These use cases include but are not limited to:
- Chat Sessions: Efficiently manage multiple chat sessions concurrently, ensuring smooth and timely interactions.
- Code Completion LLMs: Host code completion models for your team, enabling collaborative coding experiences.
- Document Processing: Break down large documents into parts, allowing for simultaneous processing for increased productivity.
- Multiple Agents: Run multiple agents at once, allowing for more complex and diverse AI behavior.
Run Multiple Models: The new version of Ollama also supports the loading of different models simultaneously. This improvement enhances various use cases such as:
- Retrieval Augmented Generation (RAG): Simultaneously load both the embedding and text completion models into memory for more effective data processing.
- Agents: Run multiple versions of an agent concurrently, enabling diverse AI behavior and improved problem-solving capabilities.
- Running Large and Small Models Side-by-Side: Benefit from the best of both worlds by running large and small models together, optimizing performance and resources.
For added convenience, Ollama automatically loads and unloads models based on the requests received and the available GPU memory. This ensures optimal performance and resource utilization in any given scenario.
Experience the Future of AI with Ollama 0.2
The possibilities are endless with Ollama 0.2, enabling you to tackle complex tasks more efficiently than ever before. Upgrade today and unlock a world of concurrent AI capabilities!