Llama 4: Did Meta just push the panic button?
Interconnects - Un podcast de Nathan Lambert
 
   Catégories:
https://www.interconnects.ai/p/llama-4Where Llama 2’s and Llama 3’s releases were arguably some of the top few events in AI for their respective release years, Llama 4 feels entirely lost. Meta has attempted to reinvent their formula of models with substantial changes in size, architecture, and personality, but a coherent narrative is lacking. Meta has fallen into the trap of taking too long to ship, so the bar is impossible to cross successfully.Looking back at the history of Meta’s major open models, the sequence is as follows:* OPT – Released May 3, 2022 (ai.meta.com | 125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, 175B): A foundational open model that is underrated in the arc of language modeling research.* LLaMA – Released February 24, 2023 (ai.meta.com | 7B, 13B, 33B, 65B): The open weight model that powered the Alpaca age of early open chat models.* Llama 2 – Released July 18, 2023 (our coverage | about.fb.com | 7B, 13B, 70B): The open standard for academic research for its time period. Chat version had some bumps, but overall a major win.* Llama 3 – Released April 18, 2024 (our coverage | ai.meta.com | 8B, 70B): The open standard for its time. Again, fantastic base models.* Llama 3.1 – Released July 23, 2024 (our coverage | ai.meta.com | 8B, 70B, 405B): Much improved post training and the 405B marked the first time an open weight model competed with GPT-4!* Llama 3.2 – Released September 25, 2024 (our coverage | ai.meta.com | 1B, 3B, 11B, 90B): A weird, very underperforming vision release, outshined by Molmo on the same day.* Llama 3.3 – Released December 6, 2024 (github.com | 70B): Much improved post-training of the smaller 3.1 models, likely in response to other open releases, but largely a minor update.* Llama 4 – Released April 5, 2025 (ai.meta.com | 17A109B, 17A400B): What we got today.The time between major versions is growing, and the number of releases seen as exceptional by the community is dropping. Llama 4 consists of 3 models, quoting from the blog post, notes in brackets mine:* Llama 4 Scout, a 17 billion active parameter model with 16 experts [and 109B total parameters, ~40T training tokens], is the best multimodal model in the world in its class and is more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU.* Llama 4 Maverick, a 17 billion active parameter model with 128 experts [and 400B total parameters, ~22T training tokens].* These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter [and 2T total parameters] model with 16 experts that is our most powerful yet and among the world’s smartest LLMs…. we’re excited to share more details about it even while it’s still in flight.Here are the reported benchmark scores for the first two models, which are available on many APIs and to download on HuggingFace.Where Llama models used to be scaled across different sizes with almost identical architectures, these new models are designed for very different classes of use-cases.* Llama 4 Scout is similar to a Gemini Flash model or any ultra-efficient inference MoE.* Llama 4 Maverick’s architecture is very similar to DeepSeek V3 with extreme sparsity and many active experts.* Llama 4 Behemoth is likely similar to Claude Opus or Gemini Ultra, but we don’t have substantial information on these.This release came on a Saturday, which is utterly bizarre for a major company launching one of its highest-profile products of the year. The consensus was that Llama 4 was going to come at Meta’s LlamaCon later this month. In fact, it looks like this release may have been pulled forward from today, the 7th, from a commit in the Meta Llama Github:One of the flagship features is the 10M (on Scout, Maverick is 1M) token context window on the smallest model, but even that didn’t have any released evaluations beyond Needle in a Haystack (NIAH), which is seen as a necessary condition, but not one that is sufficient to say it is a good long-context model. Some more modern long-context evaluations include RULER or NoLiMa.Many, many people have commented on how Llama 4’s behavior is drastically different in LMArena — which was their flagship result of the release — than on other providers (even when following Meta’s recommended system prompt). Turns out, from the blog post, that it is just a different model:Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena.Sneaky. The results below are fake, and it is a major slight to Meta’s community to not release the model they used to create their major marketing push. We’ve seen many open models that come around to maximize on ChatBotArena while destroying the model’s performance on important skills like math or code. We’ll see where the released models land.Regardless, here’s the plot Meta used. Look at the fine print at the bottom too.This model is actually the one tanking the technical reputation of the release because its character is juvenile. The actual model on other hosting providers is quite smart and has a reasonable tone!ArtificialAnalysis rated the models as “some of the best non-reasoning models,” beating leading frontier models. This is complicated because we shouldn’t separate reasoning from non-reasoning models; we should just evaluate on reasoning and non-reasoning domains separately, as discussed in the Gemini 2.5 post. So-called “reasoning models” often top non-reasoning benchmarks, but the opposite is rarely true.Other independent evaluation results range from medium to bad and confusing — I suspect very weird results are hosting issues with the very long context models. At the same time, the Behemoth model is outclassed by Gemini 2.5 Pro. To list some of the major technical breakthroughs that Meta made (i.e. new to Llama, not new to the industry):* Mixture of expert architectures, enabling Llama 4 to be trained with less compute than Llama 3 even though they have more total parameters — a lot more.* Very long context up to 10M tokens.* Solid multimodal input performance on release day (and not a later model)Interconnects is a reader-supported publication. Consider becoming a subscriber.Sadly this post is barely about the technical details. Meta nuked their release vibes with weird timing and by having an off-putting chatty model that was easiest to find to talk to. The release process, timing, and big picture raise more questions for Meta. Did they panic and feel like this was their one shot at being state of the art?The evaluation scores for the models are solid, they clear a fairly high bar. With these highly varied MoE architectures, it’s super hard to feel confident in an assessment of the model based on benchmarks, especially when compared to dense models or teacher-student distilled models. The very-long-context base models will be extremely useful for research.The question here is: Why is Meta designing their models in the same way as other frontier labs when their audience is open-source AI communities and businesses, not an API serving business or ChatGPT competitor?The model sizing for the likes of Gemini and ChatGPT is downstream of nuanced decisions based on a balance of training cluster size, inference needs, and performance trade-offs. These trade-offs are very different for open models, where you don’t pay inference, and many users are not hyperscale companies.The model that becomes the “open standard” doesn’t need to be the best overall model, but rather a family of models in many shapes and sizes that is solid in many different deployment settings. Qwen 2.5, with models at 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters, is the closest to this right now. There’s actually far less competition in this space than in the space Meta chose to go into (and take on DeepSeek)!One of these communities historically has been the LocalLlama subreddit, which named the entire community around running models at home around the Llama series of models — they’re not happy with Llama 4. Another community is academics, where the series of models across different size ranges is wonderful for understanding language models and improving methods. These two groups are all GPU-poor, so memory-intensive models like these sparse mixture of experts price out even more participants in the open community (who tend to be memory-limited).This is all on top of an onerous license that makes all artifacts that use Llama in the process be tagged with the “Llama-” name, the Llama license, the “Built with Llama” branding if used commercially, and use-case restrictions. This is at the same time when their competitors, i.e. DeepSeek, released their latest flagship model with an MIT license (which has no downstream restrictions).A third group is potential businesses looking to use open models on-premises as open models close the gap to closed counterparts. These feel like groups that would be sensitive to the extra legal risk that Llama’s license exposes them to.On top of all of this weirdness, many of Meta’s “open-source” efforts are restricted in the European Union. Where the Llama 3.2 models blocked you if you tried to access them from Europe, Llama 4 is available for download but prohibits the use of vision capabilities in an acceptable use policy. This is not entirely Meta’s fault, as many companies are dealing with side effects of the EU AI Act, but regulatory exposure needs to be considered in Meta’s strategy.Meta had a tight grasp on these communities, the Llama projects were rightfully loved, but now they feel lost. With Qwen 3 around the corner and countless other amazing open-weight models out now (and many more teased, such as from OpenAI), the competition is extreme.The soul of the Llama series died by not releasing enough models frequently enough. Reclaiming that with GenAI’s constant organizational headaches looks like a Sisyphean task. What is Meta’s differentiation in the AI space? It still seems about enabling their own platforms to flourish, not about truly supporting open.Meta’s GenAI organization has been showing major signs of cultural challenges thoughout its entire existence — including their head of AI research leaving just a few days before this model was launched.Sadly, the evaluations for this release aren’t even the central story. The vibes have been off since the beginning by choosing a weird release date. Over the coming weeks, more and more people will find reliable uses for Llama 4, but in a competitive landscape, that may not be good enough. Llama is no longer the open standard. Personally, this makes me sad. As an American, I want the default pieces of the open ecosystem to be run by American or American-friendly companies.With the macro pressure coming to Meta’s business and the increasing commoditization of open models, how is Zuckerberg going to keep up in face of shareholder pressure pushing back against the cost of the Llama project? This isn’t the first time he’s done so, but he needs to reevaluate the lowest level principles of their approach to open AI. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
