Hermes

Summary

Hermes is the flagship fine-tune line from Nous Research, an independent research collective that has shipped more high-signal open-weights work than most well-funded labs. The design stance is "neutrally aligned" — Hermes respects the system prompt as the source of truth for persona and behavior rather than overriding it with safety layers baked deep into the weights. For agent infrastructure this is the point: if you need the model to adopt a role, use a tool, or produce structured output, Hermes is built for that.

Three technical contributions matter. (1) Function calling: Hermes 2 Pro introduced an open-source function-calling format with fine-tunes explicitly trained on it, and Hermes 4 continues that lineage with competitive tool-call reliability against frontier closed models. (2) YaRN context extension: the Nous team authored the YaRN paper and shipped extended-context variants that hold quality past 128K. (3) Agentic training data: DPO and SFT mixes include long-horizon reasoning traces, which materially improves multi-step tool loops over stock base-model fine-tunes.

Model Lineup

Hermes 4 (405B / 70B / 14B) — current flagship. Llama 3.1 and Qwen 2.5 bases. Strong function calling, long context, aggressive DPO. The default Nous fine-tune for new agent deployments.
Hermes 3 (405B / 70B / 8B) — prior flagship. Llama 3.1 base. Still widely deployed; the 8B variant is a sweet spot for edge hardware.
Hermes 2 Pro — first open model with a reference function-calling format. Llama 3 8B and Mistral 7B variants.
DeepHermes — reasoning-oriented variant with explicit chain-of-thought training. Optional "thinking mode" toggle via system prompt.
OpenHermes (legacy) — the 2023 Mistral 7B fine-tune that established the series.

Where Hermes Fits

Hermes is the default choice when you need open-weights function calling at the quality tier agent frameworks assume. Pick Hermes over a stock base model when the workload involves: structured JSON output under strict schemas, multi-step tool loops with retries, or personas driven by the system prompt rather than fine-tuned in. The 8B / 14B variants run comfortably on Mac Mini edge hardware, making Hermes a credible self-hosted alternative to hosted function-calling APIs.

Tradeoffs

Base-model inheritance. License and capability ceilings track the base model. Hermes 4 on Llama 3.1 carries the Llama Community License and its scale clause; Hermes on Qwen or Mistral bases can be Apache 2.0. Audit per variant.
Alignment posture. Hermes is deliberately less risk-averse than base Llama / Qwen instruct tunes. This is a feature for agent builders and a liability for consumer-facing products without a moderation layer (Llama Guard, FrawdBot) in front.
Benchmark gap at the top. Frontier closed models still win head-to-head on the hardest reasoning and coding benchmarks. Hermes closes enough of the gap for most agent work, but not all.
Ecosystem fragmentation. Multiple concurrent lines (Hermes 3, Hermes 4, DeepHermes) with different bases. Choose one and pin it for production.

Deployment Notes

Within the Claw ecosystem, Hermes variants are a first-class alternative to stock Qwen3-Coder for agent loops where system-prompt fidelity matters more than raw coding ability — ops agents, research agents, long-running orchestrators. Hermes 3 8B and Hermes 4 14B run well on Mac Mini edge nodes via vLLM or Ollama. Function-calling-heavy workloads that hit reliability ceilings on stock base models are usually fixed by switching to Hermes Pro or Hermes 4 with no other code changes.

FrawdBot still sits in front as a moderation layer — given Hermes's steerable alignment, the policy boundary lives in infrastructure, not in the weights.

References

[1] Nous Research

[2] Nous Research on Hugging Face

[3] YaRN: Efficient Context Window Extension of Large Language Models

[4] Llama — LLM Wiki

[5] Qwen — LLM Wiki