ZAYA1-8B: Zyphra's Open Source Reasoning Model Trained on AMD GPUs
A startup no one had heard of just released a free reasoning model that beats frontier models on math -- trained entirely without a single Nvidia GPU.
By Free AI News Editorial · · · 9 min read
The open source AI race has a new dark horse. On May 6, 2026, a relatively unknown Palo Alto startup called Zyphra released ZAYA1-8B -- a small mixture-of-experts reasoning model that, on paper, has no business competing with the models it is beating. With just 760 million active parameters, it outscored Claude 4.5 Sonnet and GPT-5-High on the notoriously difficult HMMT mathematics benchmark. It did this running on AMD hardware. And then Zyphra handed it to the world for free under the Apache 2.0 license. If you have been watching the open source AI space for a while, you have seen models punch above their weight. ZAYA1-8B punches so far above its weight class that it forces a genuine reconsideration of what efficiency in AI actually looks like in 2026.
What Is ZAYA1-8B and Who Made It?
ZAYA1-8B is a mixture-of-experts (MoE) language model developed by Zyphra, an open superintelligence research company based in Palo Alto, California. The model was released publicly on May 6, 2026 with weights available for free download on Hugging Face. It carries an Apache 2.0 license, which means you can use it commercially, modify it, redistribute it, and build products on top of it without any restriction or royalty payment to Zyphra.
The model has two parameter counts that matter: 8.4 billion total parameters and 760 million active parameters. The distinction is central to what makes MoE architectures compelling. In a standard dense language model, every parameter activates for every input token. In a MoE design, only a small subset of the network -- the "experts" -- process each token at inference time. This means ZAYA1-8B can draw on a large, rich set of learned representations while only spending compute equivalent to running a sub-1-billion-parameter model during actual use. The result is lower memory requirements, faster inference, and the ability to run on hardware that dense models of similar capability cannot touch.
Zyphra designed ZAYA1-8B specifically around what they call "intelligence density" -- maximizing the useful reasoning extracted per parameter and per floating-point operation. According to Codersera's deep-dive on ZAYA1-8B, this framing underpins every architectural choice the team made, from the attention mechanism to the training methodology to test-time compute.
How Does ZAYA1-8B's Architecture Work?
ZAYA1-8B is built on Zyphra's proprietary MoE++ architecture, which introduces three core innovations over standard MoE transformer designs. Each addresses a specific bottleneck that limits efficiency in conventional models.
- Compressed Convolutional Attention (CCA) -- Standard attention mechanisms store all past key-value states in a growing cache as context windows lengthen, consuming substantial memory at inference time. CCA operates in a compressed latent space instead, achieving an 8x reduction in KV-cache size compared to multi-head attention. This allows the model to reason over longer contexts without requiring proportionally more memory.
- MLP-Based Router with PID Balancing -- Most MoE models use a simple linear layer to route tokens to experts. ZAYA1-8B replaces this with a more expressive multi-layer perceptron router. To keep expert load balanced during training -- a notoriously difficult problem in MoE systems -- Zyphra applies a bias-balancing scheme inspired by PID (Proportional-Integral-Derivative) controllers from classical control theory. The result is stable training without load collapse across the model's 40 layers.
- Learned Residual Scaling -- Deep neural networks often suffer from residual norm growth as data flows through successive layers, which can destabilize training. ZAYA1-8B adds learned scaling factors to control residual stream growth at negligible parameter and compute cost, preventing gradient explosion without requiring architectural workarounds.
Beyond architecture, Zyphra integrated reasoning capabilities during pretraining rather than bolting them on afterward. They developed Answer-Preserving (AP) Trimming to handle long chain-of-thought training examples that exceed early context windows: instead of truncating from the end (losing the solution) or dropping the example entirely, the system trims the middle of the reasoning trace while preserving both the problem setup and the final answer. The model learns the relationship between complex problems and their solutions even when the full internal logic cannot fit in memory.
The most distinctive innovation is Markovian RSA, a test-time compute methodology that decouples thinking depth from context size. Traditional test-time compute lets a model "think longer" by generating longer chains of thought -- but as context grows, models lose focus. Markovian RSA generates multiple parallel reasoning traces, then extracts only the final few thousand tokens ("tails") from each, and presents those tails to the model in an aggregation prompt asking it to reconcile the different approaches. By carrying forward only tails, ZAYA1-8B can reason indefinitely without context overflow. The full technical details are available in the Zyphra technical report on arXiv.
What Are ZAYA1-8B's Benchmark Results?
The performance numbers are what made ZAYA1-8B a headline story across AI research coverage. Zyphra's evaluations place it ahead of or competitive with models that are 10 to 100 times larger by active parameter count. Here is how it compares against similar-class and larger open-weight models on key benchmarks:
| Model | Active / Total Params | AIME '26 | HMMT '26 | LiveCodeBench v6 | GPQA-Diamond |
|---|---|---|---|---|---|
| ZAYA1-8B | 0.7B / 8B | 89.1 | 71.6 | 63.8 | 71.0 |
| Qwen3-4B-Thinking | 4B / 4B | 77.5 | 60.8 | 54.2 | 66.5 |
| Gemma-4-E4B-it | 4B / 8B | 50.3 | 32.1 | 54.2 | 57.4 |
| OLMo-3.1-32B-Think | 32B / 32B | 78.9 | 50.6 | 58.3 | 59.6 |
| Mistral-Small-4-119B | 6B / 119B | 86.4 | 70.6 | 57.9 | 77.2 |
| N3-Nano-30B | 3B / 30B | 90.1 | 75.5 | 64.6 | 75.1 |
As MarkTechPost noted in their analysis, the HMMT February 2026 benchmark is particularly notable. HMMT is a high-school math competition benchmark of extreme difficulty that closely tracks graduate-level mathematical reasoning. ZAYA1-8B's score of 71.6 on HMMT '26 beats the Mistral-Small-4-119B model (70.6) which has 119 billion total parameters. When Markovian RSA is applied at test time, Zyphra reports the model achieves 89.6 on HMMT '25, surpassing both Claude 4.5 Sonnet and GPT-5-High on that particular evaluation. The gap between ZAYA1-8B's active parameter count (760M) and the models it is beating represents a fundamental efficiency advance, not just incremental progress.
It is worth noting that these numbers come from Zyphra's own evaluation harness. Independent third-party replications are ongoing in the community. Some task-specific results may vary depending on evaluation methodology. The Hugging Face model card contains full benchmark details and in-context comparison tables that community members are actively auditing.
Why Was ZAYA1-8B Trained Entirely on AMD Hardware?
The entire ZAYA1-8B training pipeline -- pretraining, midtraining, and supervised fine-tuning -- ran on a cluster of 1,024 AMD Instinct MI300x nodes connected via AMD Pensando Pollara interconnect, in a custom cluster built in partnership with IBM. Not a single Nvidia GPU was involved at any stage of training. This is arguably as significant as the benchmark numbers themselves.
Nvidia's dominance in AI training has been near-total since the deep learning era began. The H100 and H200 GPUs have become the de facto standard for frontier model training, and the implicit assumption across most of the industry is that serious AI research requires access to Nvidia silicon. ZAYA1-8B proves that assumption wrong. The AMD MI300x is a capable alternative training platform, and Zyphra's results demonstrate that high-quality reasoning models can be produced end-to-end on AMD infrastructure.
This matters for the open source community for a practical reason: AMD hardware is more accessible for many organizations than Nvidia hardware, which frequently faces allocation constraints and premium pricing due to overwhelming demand. A successful model trained on AMD infrastructure validates the ROCm software stack and AMD's AI compute ecosystem at a level no benchmark sheet alone could achieve. It also signals that the next wave of open source models may not be bottlenecked by Nvidia GPU availability in the way that frontier labs currently are.
For context on how ZAYA1-8B fits into the broader open source AI model landscape, see our Open Source AI hub tracking all major model releases throughout 2026.
How Can You Download and Run ZAYA1-8B?
Getting ZAYA1-8B running locally requires a few specific steps. Zyphra currently recommends using their custom fork of the vLLM inference library, since the MoE++ architecture with CCA and the MLP-based router requires modified inference code that has not yet been merged into the main vLLM release. The installation command is:
pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1-pr"
After installing Zyphra's vLLM fork, you can pull the model weights directly from Hugging Face using the standard transformers or vLLM model loading conventions. The model card at huggingface.co/Zyphra/ZAYA1-8B includes full quickstart code for both standard inference and test-time compute configurations.
If you want to test ZAYA1-8B without running any local infrastructure, Zyphra offers a free playground through their Zyphra Cloud product. You can access it directly from the Zyphra website and run queries against the full model without any setup. This is the fastest path to evaluating whether the model suits your specific use case before investing in local deployment.
The pretraining base version of the model -- the version before post-training and RLHF-style alignment -- is also available separately on Hugging Face as ZAYA1-reasoning-base, for researchers who want to study the pre-alignment behavior or conduct their own fine-tuning experiments from the base weights.
How Does ZAYA1-8B Compare to Other Small Open Source Models?
The 8B parameter class is currently the most competitive segment of the open source model market. Google's Gemma 4 E4B, Alibaba's Qwen3-4B and Qwen3.5-4B series, and various Mistral variants are all targeting the same use case: models that run affordably on consumer and small-cloud hardware while delivering useful performance on real-world tasks. ZAYA1-8B enters this space with a different architectural bet than all of them.
Where Gemma 4 and Qwen3 are primarily dense or near-dense models, ZAYA1-8B's MoE++ design means its active compute footprint is closer to a 700M parameter dense model. This makes it considerably cheaper per token at inference time when self-hosted. The tradeoff is that total model size is larger: storing 8.4B parameters requires more disk and initial RAM than a 4B dense model. But for inference-heavy applications where throughput and cost per query matter, the active parameter count is the figure that drives operating costs.
On reasoning-specific benchmarks (AIME, HMMT, LiveCodeBench), ZAYA1-8B's advantage is most pronounced. On instruction following and general capability benchmarks like MMLU-Pro and IFEval, the Qwen3.5-4B scores slightly higher. This suggests ZAYA1-8B is an especially strong choice for mathematical, scientific, and coding reasoning tasks, while general-purpose chatbot or instruction-following workloads may benefit from comparing it directly against the Qwen3 series before committing to deployment.
For developers choosing between the current generation of small open source models, it is worth reading our open source model comparison guide alongside Zyphra's published benchmarks. And if you want to track when major models release new free-tier access or open weight drops, the Free Tier Tracker keeps a running log of changes across all major providers.
🔑 Key Takeaways
- ZAYA1-8B is free to download and use commercially under the Apache 2.0 license, making it immediately accessible to developers, researchers, and enterprises without any cost or legal friction.
- Its MoE++ architecture activates only 760 million of its 8.4 billion parameters per inference pass, giving it the efficiency of a sub-1B model while maintaining the representational capacity of a much larger system.
- On the HMMT February 2026 math benchmark, ZAYA1-8B with Markovian RSA scored 89.6, outperforming Claude 4.5 Sonnet (88.3) and GPT-5-High -- a proprietary frontier model -- despite being a fraction of their size.
- The entire training run used AMD Instinct MI300x hardware exclusively, proving for the first time at this benchmark quality level that Nvidia GPUs are not a prerequisite for producing state-of-the-art open source reasoning models.
- Markovian RSA, Zyphra's novel test-time compute method, decouples reasoning depth from context length by aggregating only the tail of multiple parallel reasoning traces, allowing arbitrarily deep reasoning within a fixed context budget.
Frequently Asked Questions
What is ZAYA1-8B?
ZAYA1-8B is an open source mixture-of-experts language model developed by Zyphra, a Palo Alto AI startup. Released on May 6, 2026 under the Apache 2.0 license, it has 8.4 billion total parameters but only 760 million active parameters per inference pass, making it highly efficient for local deployment and competitive with models many times its size on math and coding benchmarks.
Is ZAYA1-8B free to use commercially?
Yes. ZAYA1-8B is released under the Apache 2.0 license, one of the most permissive open source licenses available. You can download the model weights from Hugging Face at no cost, modify the model, integrate it into commercial products, and deploy it in enterprise settings without paying royalties or licensing fees to Zyphra.
How many parameters does ZAYA1-8B have?
ZAYA1-8B has 8.4 billion total parameters, but because it uses a mixture-of-experts architecture, only 760 million parameters are active during any single inference pass. This MoE design gives the model the representational capacity of a much larger dense model while requiring far less compute and memory at runtime, making local deployment practical on consumer hardware.
What hardware do I need to run ZAYA1-8B locally?
Because only 760 million parameters are active at inference time, ZAYA1-8B can run on consumer-grade hardware that would struggle with larger models. Zyphra notes it is suitable for on-device local deployment. A GPU with 8-16 GB of VRAM should be sufficient for most configurations, though exact requirements depend on quantization settings and whether you are using test-time compute features like Markovian RSA.
How does ZAYA1-8B benchmark against GPT-5 and Claude?
On the HMMT February 2026 math competition benchmark, ZAYA1-8B using Markovian RSA scored 89.6, surpassing Claude 4.5 Sonnet at 88.3 and GPT-5-High on that evaluation. On AIME 2026, it scored 89.1, beating Qwen3-4B and Gemma-4-E4B models in its parameter class. These benchmark results come from Zyphra's published evaluation harness, with independent community verification ongoing.
What is Markovian RSA and why does it matter?
Markovian RSA is Zyphra's test-time compute method that lets ZAYA1-8B reason more deeply without running out of context window. The model generates multiple parallel reasoning traces, extracts only the final few thousand tokens from each, and then aggregates them into a better solution. By carrying forward only "tails" rather than full traces, the model can reason indefinitely without context overflow -- a key advantage for hard math and coding problems.