Hermes 4 and Nous Research

The Landscape 10 min read Updated 11 Jul 2026

In Short

Hermes 4 is a family of open-weight large language models from Nous Research, an American open-source AI lab. Its defining trait is "neutral alignment," trained to follow instructions and refuse far less often than ChatGPT or Claude, while still holding a few hard lines. Each model can switch a step-by-step reasoning mode on or off with a `<think>` toggle. The family is post-trained on several different base models, so it splits into two licensing tiers, restricted Llama and open Apache-2.0. The "outperforms ChatGPT" claim is true only on Nous's own refusal benchmark, not on raw capability. Hermes 4.3 (December 2025) is notable for being post-trained on a decentralized network of computers rather than in one data center.

Snapshot caveat:
Model versions and benchmark numbers churn fast. This reflects June 2026, and every benchmark figure here is Nous's own self-reported result. Re-check the model cards before quoting specifics.

01. What It Is

Hermes 4 is a family of free-to-download large language models made by Nous Research. "Open-weight" means anyone can download the trained numbers and run them. The family's distinguishing trait is what Nous calls "neutral alignment." The models are trained to follow the user's instructions and to refuse far less often than mainstream assistants like ChatGPT or Claude, and each one can switch a step-by-step reasoning mode on or off. Nous's technical report describes Hermes 4 as "a family of hybrid reasoning models that combine structured, multi-turn reasoning with broad instruction-following ability."

Nous Research is a US-based open-source AI group. In its own words it is "a leader in the American open source AI movement" that trains "world-class open source language models" and builds "infrastructure to coordinate distributed, unbiased training." Its stated mission is to advance human rights and freedoms by creating open models and supporting "their unrestricted availability and use." Beyond the Hermes models, Nous builds Atropos, an open reinforcement-learning environment manager, and Psyche, a system for training models across many computers over the open internet. Those two pieces matter later.

02. Why It Matters

Hermes is one of the clearest worked examples of a deliberate design choice, fewer refusals as a stated goal rather than an accident. Mainstream hosted assistants are tuned to decline a wide range of requests. Nous tunes in the other direction, toward steerability and answering. On its own model card it frames this as a model "that is aligned to you," with "extreme improvements on steerability, especially on reduced refusal rates."

"Fewer refusals" is not "anything goes," and the distinction is the point. The same reduction that helps legitimate work, such as research, fiction, security testing, and lawful but sensitive topics, also makes the model comply with requests a mainstream assistant would block. Nous keeps three categories where refusing is the intended behavior, described below. The tradeoff is real, and responsibility shifts to whoever deploys the model.
This is the same tension covered in uncensored-and-unrestricted-models, approached here as a stated design philosophy rather than an after-the-fact edit.
For how labs measure and stress-test these behaviors, see safety-alignment-red-teaming.

03. How It Works

The Hermes line and where Hermes 4 fits

Hermes is Nous's long-running series of instruction-tuned open models. Hermes 3 (2024) was built on Meta's Llama 3.1. Hermes 4 (August 2025) is the first Hermes built as a hybrid reasoning model. Hermes 4.3 (December 2025, not April 2026) is the latest update and the first trained on Nous's decentralized network.

Many bases, two licenses

A common error is that Hermes 4 is "a Llama model." It is not a single base. The family is post-trained on several different foundations, and that choice decides the license:

Hermes 4 405B and 70B start from Meta's Llama 3.1. They carry the Llama 3.1 Community License, which restricts use. They are open-weight but not OSI open-source.
Hermes 4 14B starts from Alibaba's Qwen3, and Hermes 4.3 36B starts from ByteDance's Seed-OSS. Both are Apache-2.0, an OSI-approved open-source license.

This makes Hermes a useful teaching case for open-weight-vs-open-source. One family, two tiers. And even the Apache-2.0 weights do not mean the full training set was published. Nous describes its Hermes 4 synthetic dataset in detail but did not release all of it.

Hybrid reasoning and the `<think>` toggle

A single Hermes 4 model can answer immediately or first work through a visible chain of deliberation. When it reasons, it wraps that monologue in literal <think> </think> tags before giving the final answer. The mode is switched with a chat-template flag or a system prompt, so the user or the model decides per request.

The gap between the two modes is large and concrete. Nous reports Hermes 4 405B scoring 81.9% on the AIME'24 math test with reasoning on, but only 11.4% with reasoning off. More thinking buys better answers at the cost of speed and length.
This is a clean illustration of test-time compute, covered in reasoning-models-test-time-compute.

How "neutral alignment" is measured

Nous measures refusals with its own benchmark, RefusalBench. It hand-built 166 prompts across 32 categories of requests that assistants typically decline, then used Claude Sonnet 4 as an automated judge. A higher score means the model answered more of them. Two things follow. The score measures willingness to answer, not whether the answer is correct or safe. And it is Nous's own benchmark, on which Nous leads by design.

Neutral alignment is not unconditional. Nous applies what it calls "conditional reward inversion" to three categories where refusing is the correct behavior, namely minor specific harm, exploitation and human trafficking, and suicide and self-harm. On these, the training rewards a refusal rather than compliance.

How it is trained

Nous generates training data with DataForge, a graph-based synthetic-data generator that turns seed web text into instruction-answer pairs, each graded by an automated judge. Reasoning traces are filtered by rejection sampling against roughly a thousand task-specific verifiers run through Atropos. Training used 192 NVIDIA B200 GPUs.

The size of the dataset is reported inconsistently across Nous's own sources, so it is worth stating plainly rather than picking one number. The corpus is roughly 5 million samples. The technical report says about 19 billion tokens, the Hugging Face cards say about 60 billion, and a summary table lists about 56 billion training tokens. The "50 times larger than Hermes 3" claim only holds at the 60-billion figure. Treat the token count as approximate.

Benchmarks, in context

Every Hermes benchmark number is Nous self-reported, and no independent re-evaluation was found, so they should be read as vendor claims. On raw capability Hermes 4 is competitive among open-weight models but not in front. By Nous's own comparison table, DeepSeek R1 beats Hermes 4 405B on most capability tests, including MATH-500, the AIME math contests, GPQA, MMLU, and LiveCodeBench. Where Hermes 4 leads by a wide margin is RefusalBench, scoring 57.1 against DeepSeek R1's 16.7, and around 11 for GPT-5 and 15 for Claude Opus 4.1.

That is why the launch headline that Hermes 4 "outperforms ChatGPT" is misleading. It is accurate only on RefusalBench, the refusal metric Nous built, not on general capability. Nous itself concedes that standard benchmarks "are easily gamed."

04. Decentralized Training (Psyche, DisTrO, Solana)

Most large models are trained in a single data center on tightly linked GPUs. Nous is building the opposite, a way to train across many separate computers over the open internet. The system is called Psyche, and it relies on an optimizer called DisTrO that cuts how much data the machines must exchange with each other. Coordination between the scattered participants is tracked on the Solana blockchain.

Hermes 4.3 is the first production Nous model post-trained entirely on the Psyche network. Nous released both a Psyche-trained version and a conventional centralized version, and reports that the Psyche version "outperformed the traditional centralized version on a suite of downstream tasks." The distinctive part is the method, a usable model produced without a single central training cluster.

05. How a Non-Coder Gets It

You do not need the weights to try Hermes. The easiest route is Nous's hosted chat at chat.nousresearch.com, or the Nous Portal, a paid API. Third-party hosts such as Chutes, Nebius, and Luminal also serve Hermes models.

To run one yourself, the smaller sizes ship GGUF builds that load in LM Studio or Ollama, the same tools covered in running-llms-locally. Rough hardware, using 4-bit builds:

14B: about 9 to 10 GB, a mainstream consumer GPU or a 16 GB+ Mac.
36B (Hermes 4.3): about 20 to 22 GB, a single 24 GB card such as an RTX 3090 or 4090. Nous sized this one for local use.
70B: about 40 GB, needing a 48 GB GPU, two 24 GB cards, or a 64 GB+ Mac.
405B: about 230 GB even at 4-bit, a multi-GPU server only. It is the clearest example of open weights you realistically cannot self-host.

For where Hermes sits among other open and closed models, see model-landscape-2026 and small-language-models.

06. Key Terms

Term	Plain-language meaning
Open-weight	A model whose numbers you can download and run, even if the license restricts use and the data is not released. See open-weight-vs-open-source.
Open-source (OSI)	A stricter status. The license must allow free use, study, and redistribution. Hermes 14B and 36B (Apache-2.0) qualify, the Llama-based 70B and 405B do not.
Neutral alignment	Nous's framing for a model tuned to follow instructions and refuse less often, while still declining a few defined categories.
Hybrid reasoning	One model that can answer instantly or after a visible `<think>` monologue, switched on or off per request.
RefusalBench	Nous's own benchmark scoring how often a model answers requests others decline. Higher means fewer refusals, not more correct or safer.
Conditional reward inversion	The training step where Nous rewards refusal for three off-limits categories instead of compliance.
Psyche / DisTrO	Nous's system and optimizer for training a model across many internet-connected computers, coordinated on the Solana blockchain.

07. Examples

These are facts about the models, not endorsements.

The reasoning toggle as test-time compute:
The same Hermes 4 405B scores 81.9% on AIME'24 with reasoning on and 11.4% with it off. Nothing changed but the amount of thinking, which is the whole idea behind reasoning modes.
Hermes 4.3 36B as a locally runnable open model. It is Apache-2.0, sized to fit a single 24 GB consumer GPU, and the first Nous model post-trained on the decentralized Psyche network. A non-coder can download a GGUF build and run it in LM Studio.
Hermes 4 405B as "open weights you cannot self-host."
Anyone can download it and inspect it, but at roughly 230 GB in 4-bit it needs a multi-GPU server. Open to study is not the same as runnable at home.

08. Risks and Limits

Benchmarks are self-reported:
Every Hermes number here comes from Nous, with no independent re-evaluation. Read them as vendor claims.
Fewer refusals moves responsibility to you:
With the guardrails relaxed, the deployer owns what the model produces, including outputs a hosted assistant would have blocked.
The two largest models are not open-source:
The 70B and 405B carry Meta's restricted Llama license, so use has conditions. Only the 14B and 36B are Apache-2.0.
The dataset is not fully public:
Open weights do not mean open data. Nous describes the Hermes 4 corpus but did not release all of it.
Numbers and versions churn:
The token count already disagrees across Nous's own pages, and the model line moves quickly. Re-check the model card.

09. Common Misconceptions

"Hermes 4 is a Llama model."
Partly. The 70B and 405B are built on Llama 3.1, but the 14B is built on Qwen3 and the Hermes 4.3 36B on ByteDance's Seed-OSS. The base, and the license, depends on the size.

"Hermes 4 outperforms ChatGPT."
Only on one metric. The claim traces to Nous's RefusalBench, which measures how seldom a model refuses. On general capability tests Hermes 4 trails models like DeepSeek R1. It does not beat ChatGPT on broad capability.

"Neutral alignment means it answers anything."
No. It refuses less, not never. Nous deliberately keeps three hard lines, minor specific harm, exploitation and human trafficking, and suicide and self-harm, where the training rewards a refusal.

"Hermes Agent is the Hermes 4 model."
No. Hermes Agent is a separate, MIT-licensed agent application that runs in a terminal and across messaging apps. It is model-agnostic and is not the Hermes 4 weights.

"Open weights mean I can run any Hermes at home."
Not the largest. The 14B and 36B run on one consumer GPU, but the 405B needs a multi-GPU server. Downloadable does not mean laptop-runnable.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

Open-weight model: Publicly downloadable weights you can self-host.
Neutral alignment: Nous Research's design goal of building models that follow instructions and refuse far less often than mainstream assistants, while still declining a few defined categories.
Hybrid reasoning: A single model that can answer instantly or first work through a visible step-by-step <think> monologue, toggled on or off by the user or the model.
RefusalBench: Nous's own benchmark that scores how often a model answers requests other assistants tend to decline. A higher score means fewer refusals, not better or safer answers.
Decentralized training: Fine-tuning a model across many separate computers over the open internet rather than in one data center, coordinated here by Nous's Psyche network and DisTrO optimizer.

Sources

More in Models & Providers

See all in this group