MENLO PARK, 10 MAY 2026 — Meta has put open-source AI on a collision course with the proprietary frontier, releasing Llama 4 Scout with a 10-million-token context window and Llama 4 Maverick with a multimodal benchmark sheet that beats GPT-4o and matches GPT-5.5 Instant on several head-to-head evaluations, while pricing the weights at zero.
Key Takeaways
- Llama 4 Scout is a 17-billion-active-parameter mixture-of-experts model with 16 experts and a context window of 10 million tokens — an order of magnitude longer than any other openly downloadable model.
- Llama 4 Maverick is also 17-billion active parameters but with 128 experts, traded off for stronger benchmark scores in multimodal reasoning, coding, and complex instruction following.
- Maverick beats GPT-4o and Gemini 2.0 Flash on multiple multimodal benchmarks and lands within 2% of Claude Opus 4 on retrieval-augmented generation evaluations.
- Llama 4 Behemoth — the larger frontier-class flagship — was previewed but not released; weights remain undisclosed.
- A controversy over Meta's initial LMArena scores has dented but not erased the launch, with Meta acknowledging the benchmarked variant was an "experimental chat version" not identical to the released weights.
The Facts
Meta released Llama 4 in early May 2026 at the company's LLM Summit, presenting it as the "Llama 4 herd" — a family of models built on a shared mixture-of-experts (MoE) architecture but tuned for different use profiles. Meta's own announcement post frames the release as the moment open-source weights catch up with closed frontier APIs on the metrics that matter to most builders.
The two models released with full weights are Scout and Maverick. Both share a 17-billion-parameter active footprint per inference pass — the routed expert count — but differ in their total parameter pools and routing behaviour. Scout uses 16 experts; Maverick uses 128. In practical terms, Scout is the easier model to run on a single high-end GPU; Maverick demands more memory across the expert pool and is better suited to multi-GPU or hosted deployment.
The standout specification is Scout's context window. At 10 million tokens, Scout offers roughly 5 to 10 times the context length of any other openly downloadable model, and beats most closed APIs on the same axis. To put that in concrete terms: 10 million tokens is enough to hold a corpus of approximately 7.5 million words — the rough text content of 80 novels — in a single prompt. The technical achievement comes from architectural changes Meta refers to as iRoPE (interleaved Rotary Position Embeddings) combined with attention efficiency techniques that keep memory growth sub-quadratic at long contexts.
Maverick targets a different goal: peak benchmark performance per active parameter. Meta's benchmark sheet shows Maverick beating GPT-4o and Gemini 2.0 Flash across several widely-reported multimodal evaluations. On RAG benchmarks, third-party analysis places Maverick within 2% of Claude Opus 4 performance — a result that, if it holds up in independent reproduction, would be the first time an open-source model has landed within striking distance of Anthropic's flagship at the same task.
The third member of the family — Llama 4 Behemoth — was demoed but not released. Meta described Behemoth as a teacher model: larger, more capable, and used during training to distill knowledge into Scout and Maverick. Whether the Behemoth weights will ever be released is an open question. Meta's previous releases have followed a pattern of holding back the largest variants while releasing the more efficient distilled versions, citing safety review and operational concerns.
A controversy has accompanied the launch. As several independent reviews have pointed out, Meta's initial LMArena claims for Llama 4 were based on a model variant the company labelled "experimental chat" — a tuned-for-leaderboard version that differs from the publicly released weights. After the discrepancy was flagged by the AI community, LMArena adjusted its scoring policies and Meta acknowledged the difference. The released Scout and Maverick weights underperform the experimental variant on the LMArena leaderboard, although they remain competitive on most other benchmarks where the comparison is to a fixed test set rather than a human-preference arena.
Technical Deep-Dive
The mixture-of-experts architecture is the foundation of Llama 4's competitive position. Where a dense model the size of GPT-5.5 must activate all of its parameters for every forward pass, an MoE model routes each token through a small subset of specialist sub-networks — the "experts" — and uses a learned router to pick which experts handle which inputs. Scout's 17B active / 16-expert configuration means that during inference, only about 1/16th of the total parameter pool is engaged at any one moment. The total parameter count is large; the per-token compute is small.
Maverick's 128-expert configuration is more aggressive. With 128 experts to choose from, the router has finer-grained specialisation available, and the model can in principle dedicate experts to narrow domains: code, mathematical reasoning, image understanding, multilingual text, and so on. The trade-off is that 128 experts must all sit somewhere in memory; you cannot prune the unused ones without ceiling the model's capability. This is the reason Maverick is harder to deploy on a single consumer or even single workstation GPU.
The 10-million-token context window in Scout deserves a closer look because it changes what is possible at the prompt level. Most existing long-context techniques in production today rely on either chunked retrieval (RAG, where the prompt is assembled from short relevant snippets) or hierarchical summarisation (where a long document is recursively compressed). Neither approach is the same as actually feeding 10 million tokens into a single forward pass. With Scout, a user can drop an entire codebase, a complete year of legal filings, or every email in a Gmail account into the prompt and ask a question that requires cross-document reasoning. The latency and cost trade-offs are non-trivial — a 10M-token prompt is expensive — but the capability changes the architecture options for builders.
The architectural choices that made this possible include iRoPE (interleaved positional encodings that scale better at long range than standard RoPE) and a "no-positional-embedding" pathway for some attention heads that gives Meta's research team a stable optimisation target across context lengths. The full architecture paper, released alongside the weights, details the design choices in depth.
Multimodality is the other defining feature. Both Scout and Maverick accept text and image inputs natively — there is no separate vision encoder stage as in many earlier multimodal models. Image tokens flow into the same transformer stack as text tokens, with positional information adjusted to preserve 2D image structure. The benchmark results suggest this unified approach pays off: Maverick's multimodal scores are competitive with frontier closed models that use much larger dedicated visual processing pipelines.
For developers, the deployment story is straightforward. Both models are available for download under Meta's Llama Community License — usable for research and commercial use up to a 700-million-monthly-active-user threshold above which a separate Meta licence is required. Most companies will never hit that threshold. The weights are hosted on Meta's own AI download portal and mirrored on Hugging Face. Sample inference pipelines, fine-tuning scripts, and Quantization recipes are provided by Meta and the open-source ecosystem.
ASEAN Perspective
Open-source frontier models are disproportionately important for Southeast Asia, where local-language fine-tuning, on-premise data residency, and cost-per-token sensitivity all push organisations toward open weights they can host themselves.
Indonesia is the largest ASEAN consumer market for Llama models, with multiple research teams at Universitas Indonesia, Institut Teknologi Bandung, and Telkom University running Llama-derived fine-tunes for Bahasa Indonesia processing. Llama 4's stronger multilingual baseline — Meta has stated Llama 4 has materially improved coverage of major South-east Asian languages — gives Indonesian fine-tuners a better starting point. Expect to see Bahasa-tuned Llama 4 Scout variants from Indonesian institutions within 60 days.
Singapore is the regional centre for AI infrastructure, and the country's IMDA-backed AI Singapore programme has been investing in local-language AI capability for years. The 10-million-token context window has specific appeal for the Singapore legal and financial-services sector, where document corpora are large and confidential — the kind of workload that is poorly served by sending data to an overseas API. Local hosting of Scout becomes a credible alternative to OpenAI's API for these workloads.
Vietnam and Thailand have growing developer communities and a strong preference for self-hosted AI. Vietnamese startups in particular have a track record of fine-tuning Llama models for Vietnamese-language consumer applications; Llama 4's improved multilingual baseline reduces the fine-tuning data and compute required to reach acceptable quality. Thai fintech firms — particularly those building chatbots that must operate offline or in low-bandwidth environments — benefit from the lower active parameter count.
Malaysia and the Philippines are smaller absolute markets for AI infrastructure but have surfaced as adoption points for Llama-derived models in banking, telecommunications, and government. The Llama 4 weights' permissive licence (subject to the 700M MAU clause) is well suited to government-sector deployment where API-based proprietary models face cross-border data flow restrictions.
For the wider ASEAN startup ecosystem, the practical story is that Llama 4 closes the open-source-versus-closed-API capability gap to the point where many use cases no longer require a paid frontier model. The cost equation is not a simple "free is cheaper" — running Llama 4 Scout at scale requires GPU infrastructure that is often more expensive per inference than the OpenAI or Anthropic equivalent — but the data-control and customisation benefits matter more in regions with strict data residency and language localisation needs.
What Organisations Should Do
For technical teams considering Llama 4 adoption, the work is concrete:
-
Benchmark on your own workload. Public benchmarks are useful directional signals, but the gap between Meta's "experimental chat" LMArena claims and the released weights is a reminder that production performance depends on your task. Run Scout and Maverick against your existing evaluation suite before committing.
-
Decide between Scout and Maverick on the right axis. Scout is the better choice when your workload involves very long context, document-heavy retrieval-light reasoning, or single-GPU deployment. Maverick is the better choice when you need peak multimodal reasoning, coding ability, or instruction-following and have multi-GPU infrastructure to run it on.
-
Validate the licence terms. Most companies will not approach the 700-million-MAU threshold, but if you do or might, get legal review of the Llama Community License before building Llama 4 into a core product. The licence is liberal but not unlimited.
-
Plan for the 10M-context cost reality. A 10-million-token prompt is expensive to run. Decide which workloads actually need that length and design fallback patterns (chunked retrieval, hierarchical summarisation) for the workloads that do not. Treat 10M-context as a capability to deploy deliberately, not a default.
-
Establish a model-update cadence. Llama 4 is the current generation, but Llama 5 is in development. Build the infrastructure to swap weights cleanly when the next release lands — version pinning, evaluation suites, and rollback paths should all be in place before you go to production with Llama 4.
RECATOOLS Verdict
Llama 4 is the most consequential open-source AI release of 2026 so far, and not for the reason most of the coverage has emphasised. The benchmark headlines about Maverick beating GPT-4o are interesting but slightly stale — GPT-4o is two generations behind the current OpenAI default, and Maverick is closer in capability to GPT-5.3 than to GPT-5.5 Instant.
The actually significant detail is the 10-million-token context window in Scout. This is a primitive that opens new product categories. Consider the architecture that has dominated AI-augmented document understanding for the past three years: chunk the document, embed the chunks, retrieve the relevant chunks at query time, stitch them into a constrained prompt. Every step in that pipeline has bugs, complexity, and quality losses. Scout's 10M-token window means that for a meaningful class of workloads — codebases up to a few hundred thousand lines, contract bundles, multi-year email archives — you can simply put the whole thing in the prompt and let the model attend across it natively.
That changes the build-versus-buy calculus for retrieval infrastructure. Vector databases, embedding services, chunking pipelines, re-ranking models — the whole RAG stack — becomes optional for the workloads that fit in Scout's window. Many of them do. Our prediction: by the end of 2026, a meaningful share of internal enterprise document-search and code-search products will be rebuilt as "just put it in the prompt" architectures using Scout or its successors.
For ASEAN developers, our recommendation is concrete: if you have been holding back on a documentation-search, contract-analysis, or internal knowledge-base product because the retrieval engineering felt like too much work, Llama 4 Scout is the moment to revisit that decision. The model is free to download, runnable on infrastructure that exists in Singapore and Jakarta, and capable enough to make the "naive" architecture actually work.
Frequently Asked Questions
Can I run Llama 4 Scout on a single GPU? Scout's 17-billion active parameters can in principle run on a single high-end GPU — an H100 80GB or an MI300X — at 4-bit quantization. Practical context length on a single GPU is much shorter than the 10-million-token maximum because KV-cache memory grows with context. For the full 10M-token capability you need multi-GPU deployment or a hosted endpoint such as Together AI or Replicate. Meta has published reference deployment configurations.
How does Llama 4 Maverick compare to Claude Opus 4.7? On retrieval-augmented generation benchmarks, Maverick lands within 2% of Claude Opus 4 on most metrics. Note that this comparison is against Claude Opus 4, not the more recent Opus 4.7. On code generation and complex reasoning, Opus 4.7 still leads. On multimodal benchmarks, Maverick is competitive with both. Practical advice: pick Maverick when you can deploy it yourself for cost reasons, pick Opus 4.7 when you need peak performance per query and can pay the API premium.
What is the licence cost for Llama 4 commercial use? Llama 4 is free to download and use commercially under the Llama Community License, including for commercial products, as long as your service has fewer than 700 million monthly active users at the time you build on Llama. Above that threshold you must request a separate Meta licence. The licence also prohibits using Llama outputs to improve competing LLMs and requires the "Built with Llama" attribution in some surfaces.
Will Meta release Llama 4 Behemoth weights? Meta has not committed to releasing Behemoth. The company's pattern with earlier releases (Llama 3 405B, for example) has been to delay or limit release of the largest variants while releasing the smaller distilled models. Independent analysts have estimated Behemoth at the 2-trillion-total-parameter scale, which would put it outside the range of most organisations' deployment infrastructure even if the weights were public.
How does Llama 4's 10M-token context window compare to Gemini Pro's long-context capability? Gemini 2.5 Pro and Gemini 3.5 Pro support context windows in the 1M-to-2M-token range via Google's API. Llama 4 Scout's 10-million-token window is the longest currently available across both open and closed models. However, "long context works" is not the same thing as "long context works well at every length"; both models suffer accuracy degradation near the high end of their advertised windows. Independent benchmarking on workloads similar to yours is the only reliable way to compare.