Llama 3.3 70B offers similar performance compared to Llama 3.1 405B model. This model also requires less VRAM, e.g. works well on M4 64GB with a speed of 10 tokens/s.
https://ollama.com/library/llama3.3
Right now our prompts had to ask the model to produce e.g. JSON, with providing an example. In some cases, this way generated incorrect output. Now it’ll be part of the payload with better support for providing output.
https://ollama.com/blog/structured-outputs