๐Ÿค– Open Source AI

Cohere Command A+: Free 218B MoE Model Running on 2 H100s

Cohere just made its most capable model fully open source under Apache 2.0. Here is what makes Command A+ different from every other big open-weight release this year.

By Free AI News Editorial ยท ยท ยท 9 min read

Quick Answer: Cohere Command A+ is a 218-billion-parameter MoE model released May 20, 2026 under Apache 2.0. Only 25B parameters activate at inference, letting it run on two H100 GPUs at 375 tokens per second. It unifies reasoning, vision, tool use, and multilingual support in one freely downloadable model.

Most "open" model releases come with caveats: non-commercial clauses, capability-limited checkpoints, or hardware requirements that push you toward a vendor's cloud. When Cohere dropped Command A+ on May 20, 2026, it did something notably different. The company founded by Aidan Gomez โ€” co-author of the landmark "Attention Is All You Need" paper โ€” published a 218-billion-parameter model under a full Apache 2.0 license, complete with near-lossless quantization that compresses it down to a two-GPU footprint. No commercial restrictions. No watered-down weights. Just a frontier-class agentic LLM you can download, modify, and run on your own hardware today.

Server infrastructure representing open-source AI model deployment

What exactly is Cohere Command A+?

Command A+ is the consolidation of four prior Cohere models โ€” Command A, Command A Reasoning, Command A Vision, and Command A Translate โ€” into a single, unified checkpoint. Rather than maintaining a separate model for each capability, Cohere trained a single Sparse Mixture-of-Experts architecture that handles all of them simultaneously. The model carries 218 billion total parameters but uses a router to activate only 25 billion of them per forward pass โ€” a design borrowed from Google's Gemini and Meta's mixture-of-experts work, but applied here to an enterprise-focused, fully open-weight release.

The official Cohere announcement describes Command A+ as the engine powering North, its enterprise workspace product, which has been running in production with large customers for over a year. That deployment history matters: the benchmarks on paper are backed by real-world usage data across complex enterprise workflows including RAG pipelines, MCP-connected file systems, and long-horizon agentic chains. According to the Cohere release blog, this is the first time the company has released a fully open-weight flagship model under Apache 2.0.

How does the architecture make it so efficient?

Command A+ is a decoder-only Sparse MoE Transformer. It has 128 expert sub-networks per layer, of which 8 are activated per token, plus a single shared expert that processes every token regardless of routing. This means that while the model "knows" a great deal โ€” embedded in all 218B parameters โ€” any given inference step touches only the 25B-parameter slice relevant to that query. The attention layers alternate between sliding-window attention (with Rotary Positional Embeddings) and global attention in a 3:1 ratio, keeping memory usage manageable even at 128K input context.

The architecture is trained fully dropless, meaning all expert capacity is always available rather than being dropped during training for speed. This tends to produce better expert specialization and more consistent inference quality at low batch sizes โ€” a meaningful advantage for teams self-hosting for internal tools rather than serving millions of requests.

What hardware options are available for running Command A+?

Cohere ships three quantization variants with different GPU requirements, all with negligible quality differences according to benchmark comparisons published by the company. W4A4 is their recommendation for most deployments:

Quantization Minimum GPU (NVIDIA) Notes
W4A4 (4-bit) 1x B200 or 2x H100 80GB 375 t/s output, 113ms TTFT โ€” recommended
FP8 (8-bit) 2x B200 or 4x H100 80GB Higher precision, doubles GPU count
BF16 (16-bit) 4x B200 or 8x H100 80GB Full precision, maximum GPU requirement

The W4A4 quantization is technically the most interesting part of this release. Reasoning models normally pay an outsized "quantization tax" โ€” compressing weights to 4-bit typically causes visible regressions on complex problem-solving because long decoding traces compound per-token errors. Cohere solved this by applying NVFP4 W4A4 only to the MoE expert layers while keeping the attention pathway (Q, K, V, O projections, KV cache, and attention compute) at full BF16 precision. They then closed the remaining quality gap using Quantization-Aware Distillation (QAD): the quantized student model was trained against the full-precision teacher using fake quantization operators in the forward pass and straight-through estimators during backpropagation. The result is a compression that MarkTechPost describes as "nearly lossless" across all tested benchmarks.

How does Command A+ perform compared to previous Cohere models?

The gains over the previous Command A Reasoning model are substantial on agentic tasks, which is the intended use case. On tau-squared-Bench Telecom โ€” a benchmark testing multi-step agentic reasoning through realistic enterprise scenarios โ€” scores improved from 37% to 85%. On Terminal-Bench Hard, which measures real coding agent performance in a sandboxed terminal, scores jumped from 3% to 25%. These are not marginal improvements; they indicate a qualitative step change in the model's ability to plan and execute multi-tool workflows autonomously.

Cohere also measured three internal benchmarks on the North platform, all scored using LLM-as-a-judge techniques. Agentic Question Answering accuracy โ€” answering enterprise questions via MCP-connected cloud file systems โ€” improved 20% over Command A Reasoning. Spreadsheet analysis quality jumped 32%. Memory Usage Quality, which tests how well the model carries context from previous agent sessions into new ones, scored 54% compared to 39% for Command A Reasoning. For multimodal tasks, MathVista improved from 73.5% to 80.6% and MMMU reached 75.1%. For the full benchmark detail, see this detailed technical breakdown.

Abstract visualization representing AI neural network architecture

What capabilities does Command A+ include out of the box?

Command A+ supports text, image, and tool-use inputs with text, reasoning chains, and tool-use outputs. The 128K input context and 64K max generation length are generous enough for most enterprise document workflows. Here is what comes with the model by default:

The native citation capability deserves particular mention. Most retrieval-augmented generation setups require a separate post-processing step to attribute generated text back to source chunks. Command A+ generates inline citations as part of its output, which simplifies RAG pipeline architecture and reduces the chance of unattributed hallucination slipping through quality gates. This was a core feature of the North enterprise product and is now available to any developer self-hosting the open weights.

How do you actually deploy Command A+ yourself?

The weights are available on Hugging Face at CohereLabs/command-a-plus-05-2026-w4a4. Cohere recommends vLLM for production inference. The minimum vLLM version is 0.21.0, and you also need the cohere_melody library (version 0.9.0 or later) for tool-call and reasoning span parsing. The basic serve command for a two-H100 deployment looks like this:

vllm serve CohereLabs/command-a-plus-05-2026-w4a4 \
  -tp 2 \
  --tool-call-parser cohere_command4 \
  --reasoning-parser cohere_command4 \
  --enable-auto-tool-choice

Cohere recommends sampling parameters of temperature 0.9, top_p 0.95, and repetition_penalty 1.04 for generation quality. For teams that prefer managed infrastructure, the model is also available through Cohere's Model Vault dashboard with dedicated GPU capacity. The Hugging Face Inference Endpoints path is also viable โ€” select the w4a4 repo, attach two H100s, and you typically have a private endpoint running within about 30 minutes. Compare this to running other large open-source models and see what fits your workflow at our free vs paid AI compare guide.

Why does Apache 2.0 matter more than other open licenses?

The AI open-source ecosystem has a licensing problem. Many models marketed as "open" carry restrictions that prevent commercial use, require attribution in user interfaces, or limit modification and redistribution. Meta's Llama license, for example, restricts use by services with over 700 million monthly active users. Mistral models often carry community licenses with carve-outs. The Apache 2.0 license that Command A+ ships under has none of these restrictions: you can use it commercially, build products on top of it, fine-tune and redistribute modified versions, and run it in private environments with no obligation to Cohere whatsoever.

This is what Cohere's CEO Aidan Gomez described as "sovereign AI" โ€” the ability for enterprises, governments, and developers to run AI entirely within their own secure environments without dependence on any vendor's infrastructure, pricing, or uptime. For organizations in regulated industries (healthcare, finance, government) where data residency and audit trails are non-negotiable, this licensing posture is often the difference between being able to use a model at all and not. If you want to explore what other open models are freely available right now, our open-source AI tracker covers the full landscape.

๐Ÿ”‘ Key Takeaways

  • Command A+ packs 218B total parameters but activates only 25B per inference step, allowing it to run on two NVIDIA H100 GPUs at W4A4 quantization without meaningful quality loss.
  • Apache 2.0 licensing makes Command A+ genuinely free for commercial use, redistribution, and modification โ€” a first for any Cohere flagship model and a strong statement about the "sovereign AI" movement.
  • Cohere solved the quantization quality problem for reasoning models by applying 4-bit compression only to MoE expert layers while keeping attention at full precision, validated by lossless benchmark parity.
  • Agentic performance improved dramatically over the prior Command A Reasoning model, with tau-squared-Bench Telecom jumping from 37% to 85% โ€” reflecting real production hardening from running North with enterprise customers.
  • Command A+ ships with native citation generation built into the output pipeline, simplifying RAG architectures by eliminating the post-processing step needed to attribute claims back to source documents.

Frequently Asked Questions

What is Cohere Command A+?

Command A+ is a 218-billion-parameter Sparse Mixture-of-Experts language model released by Cohere on May 20, 2026. It is available under Apache 2.0, supports text, image, and tool-use inputs with a 128K context window, and runs on just two NVIDIA H100 GPUs at 4-bit quantization. It unifies four prior Cohere models into a single, freely downloadable checkpoint.

How many GPUs does Command A+ require?

At W4A4 (4-bit) quantization, Command A+ runs on a single NVIDIA Blackwell B200 GPU or two NVIDIA H100 80GB GPUs. The FP8 variant requires 2x B200 or 4x H100, while the full BF16 version needs 4x B200 or 8x H100. Cohere recommends W4A4 for most deployments because quality differences between quantization levels are negligible on standard benchmarks.

Is Command A+ truly free to use commercially?

Yes. Command A+ is released under Apache 2.0, which permits unrestricted commercial use, modification, and redistribution. Weights are freely available on Hugging Face. This is Cohere's first fully open-weight flagship model and a clear break from its historically proprietary approach to enterprise AI.

How does Command A+ compare to GPT-5.5 and Claude?

Command A+ is optimized for agentic and enterprise workflows rather than general-purpose frontier tasks. It excels on tau-squared-Bench Telecom (85%), enterprise RAG, and multilingual work across 48 languages. Closed frontier models from OpenAI and Anthropic still hold advantages in general reasoning and creative tasks, but for self-hosted enterprise workflows the efficiency and licensing advantages of Command A+ are compelling.

What inference frameworks support Command A+?

Command A+ has day-one support for vLLM (0.21.0+) and Hugging Face Transformers. The recommended vLLM configuration uses the cohere_command4 tool-call and reasoning parsers. Cohere also offers managed inference via its Model Vault dashboard. Hugging Face Inference Endpoints is another quick deployment path, with a private endpoint typically running within 30 minutes on a two-H100 setup.

Browse All Open Source Models โ†’ Compare Free vs Paid AI

๐Ÿ”” Get Free AI Alerts First

When a model goes free, a paywall drops, or a deal appears โ€” you'll know before everyone else. No spam, just signal.