Llama
Quick Facts
- Vendor
- Meta (Menlo Park)
- Released
- LLaMA (February 2023); Llama 4 (2025)
- Current line
- Llama 4 (Scout, Maverick, Behemoth) · Llama 3.3 · Llama Guard 4
- License
- Llama Community License (bespoke; permissive for most commercial use, with scale-based restrictions)
- Hosting
- Self-hosted (vLLM, llama.cpp, Ollama); hosted via Together, Groq, Fireworks, Bedrock, Vertex
- Context window
- 128K–10M tokens (Scout), depending on variant
- Modalities
- Text; native image in Llama 4
- Architecture
- Dense and MoE variants
Summary
Llama is Meta's open-weights LLM family, first released in February 2023. The Llama 2 release in July 2023 made open-weights frontier usable for commercial applications, and Llama 3 closed much of the gap with closed-weights competitors. Llama 4 (2025) moved to mixture-of-experts architectures and adds native multimodality, with the Scout variant pushing context windows to 10M tokens.
For infrastructure teams, Llama's value is the combination of weights you can run anywhere and an ecosystem that is measured in years rather than months. Every inference runtime, quantization tool, and fine-tuning framework supports Llama first. Llama Guard — a small classifier model trained alongside the main family — is the de facto standard for open-source content moderation.
Model Lineup
- Llama 4 Scout — compact MoE. 10M-token context, single-GPU inference. Long-document workloads on self-hosted hardware.
- Llama 4 Maverick — mid-size MoE. Frontier-competitive on general tasks, multimodal.
- Llama 4 Behemoth — flagship scale. Benchmarks-competitive with top closed models; requires multi-GPU / cluster hosting.
- Llama 3.3 70B / 8B — dense. Still widely deployed; the 8B variant is a sweet spot for Mac Mini edge deployments.
- Llama Guard — small moderation classifier. Screens inputs and outputs for policy violations.
Where Llama Fits
Llama is the default when you need weights on your own hardware. On-premise regulated deployments, air-gapped environments, customer-owned infrastructure (Mac Mini edge compute), and any workload where sending data to a third-party API is a non-starter. Quantized Llama 3.3 8B variants run comfortably on a Mac Mini and are the backbone of the ZeroClaw and PicoClaw runtimes in the Claw ecosystem.
Tradeoffs
- License is not pure open source. The Llama Community License is permissive for most uses but restricts deployments above 700M monthly active users and carries naming / attribution requirements. Large enterprises need legal review.
- Tool use and structured output are less reliable than Claude or GPT at similar scale. Plan for more prompting, validation, and retry logic.
- Operational cost of self-hosting. GPUs, quantization, vLLM tuning, and the observability stack are all on you. See Edge Compute Economics for when it pays off.
Deployment Notes
Within the Claw ecosystem, Llama is the backbone of PicoClaw and ZeroClaw — the edge runtimes that live on customer Mac Minis. vLLM handles higher-end inference; llama.cpp and Ollama cover the laptop and Mac Mini tiers. Llama Guard sits in front of open-model endpoints as a lightweight moderation filter. For workloads that exceed the edge envelope, we route to hosted Llama endpoints (Together, Groq) via the provider arbitrage layer.