Mistral Small 4: Free Apache 2.0 Model With Vision (2026)

Q: What is configurable reasoning in Mistral Small 4?

Configurable reasoning lets you set a reasoning_effort parameter per API call. Set it low for fast, instant replies on simple tasks, or high for slower, step-by-step chain-of-thought on hard problems. You can also toggle reasoning mode completely on or off, which is unique among open-source models of this size.

Q: Where can I download Mistral Small 4?

The official weights are on Hugging Face at mistralai/Mistral-Small-4-119B-2603. A GGUF-quantized version for local inference is available at lmstudio-community/Mistral-Small-4-119B-2603-GGUF. You can also access the model via the Mistral API at $0.15 per million input tokens without hosting anything yourself.

Q: How does Mistral Small 4 compare to Claude Haiku and GPT-4o?

On the AA LCR logical reasoning benchmark, Mistral Small 4 outperforms GPT-OSS 120B and matches Claude Haiku 3.5 on HumanEval coding tasks. Its API pricing of $0.15 per million input tokens undercuts Claude Haiku 4.5 at $1.00 per million input tokens by roughly 6x, making it a strong cost-efficiency choice for production deployments.

Q: Can I run Mistral Small 4 locally on a consumer GPU?

Running the full 119B model locally requires significant VRAM (roughly 60-80 GB in BF16). However, 4-bit GGUF quantized versions reduce that to approximately 60-70 GB, which fits on a dual-GPU workstation. For single-GPU consumers, Mistral's API is the practical free-to-access option.

Until March 2026, running a capable open-source AI model for a real product meant juggling at least three separate deployments: a reasoning model for hard problems, a vision model for images, and a coding-focused model for software tasks. Mistral AI changed that calculus on March 16, 2026, when it released Mistral Small 4 -- a single model that handles all three, ships under the Apache 2.0 license with zero usage restrictions, and achieves throughput roughly three times higher than its predecessor at 40 percent lower inference cost. For developers, startups, and researchers who cannot afford per-token fees on proprietary APIs, Mistral Small 4 is the most practically free frontier-class model released in the first half of 2026.

Abstract visualization of a neural network representing Mistral Small 4 open-source AI architecture

What Is Mistral Small 4 and Why Does Apache 2.0 Matter?

Mistral Small 4 is the first model from Mistral AI to combine the capabilities of three previously separate product lines -- Magistral (step-by-step reasoning), Pixtral (image understanding), and Devstral (agentic software development) -- into a single unified checkpoint. The official announcement on mistral.ai describes it as "a hybrid model optimized for general chat, coding, agentic tasks, and complex reasoning" with reasoning effort configurable on a per-request basis.

The Apache 2.0 license is what separates this release from many competing open-weight models. Apache 2.0 allows anyone to use, modify, distribute, and sublicense the model for commercial products without paying royalties or seeking permission. You can fine-tune Mistral Small 4 on your own data, build a SaaS product on top of it, and charge customers -- all without owing Mistral AI anything beyond basic attribution. By contrast, Meta's Llama models carry a custom license that adds restrictions once a deployment exceeds 700 million monthly active users, and many Chinese open-weight models ship under modified licenses that prohibit certain competitive applications. Mistral Small 4's clean Apache 2.0 status makes it uniquely attractive for commercial open-source projects.

This matters especially for teams building AI-powered products that need legal clarity. When a legal team reviews a model license, Apache 2.0 is the gold standard -- familiar, battle-tested, and unambiguous. The same license governs widely deployed infrastructure software like Apache HTTP Server and Kubernetes. Using Mistral Small 4 in production carries the same licensing risk profile as adopting any well-known open-source library, which is to say, essentially none. If you want to compare licensing restrictions across current open-source models, our open-source model tracker keeps a current list of licenses and commercial restrictions.

What Are Mistral Small 4's Technical Specs?

Mistral Small 4 uses a Mixture-of-Experts (MoE) architecture with 128 experts and roughly 119 billion total parameters, but only 6 billion parameters are active during any single forward pass. This means inference cost scales with the active parameter count -- similar to a 6B dense model -- rather than the full 119B, which is why Mistral reports 40 percent faster inference and three times higher throughput compared to its predecessor Mistral Small 3.

Specification	Mistral Small 4
Total parameters	119B
Active parameters per token	~6B
Expert count	128
Context window	256,000 tokens
License	Apache 2.0
Reasoning	Configurable (on/off per request)
Vision / image input	Yes (native)
API price (input)	$0.15 / 1M tokens
API price (output)	$0.60 / 1M tokens
Release date	March 16, 2026

The 256,000-token context window is notably generous for an open-source model at this price point. For reference, that is roughly 200,000 words -- enough to process an entire novel, a large codebase, or hours of meeting transcripts in a single pass. The model also natively accepts image inputs, making it usable for document processing, screenshot analysis, and visual QA without any additional pipeline overhead.

How Does Mistral Small 4's Configurable Reasoning Actually Work?

The standout technical feature of Mistral Small 4 is its configurable reasoning system. Most models either always reason (with a fixed chain-of-thought at inference time) or never reason. Mistral Small 4 exposes a reasoning_effort parameter in the API that lets you dial up or down how much compute the model spends on intermediate thinking before producing a response.

According to the official Mistral documentation, reasoning effort is configurable on the Mistral chat completions endpoint as well as the Agents and Conversations endpoints. In practical terms this means a single deployment can handle both a quick autocomplete call (low reasoning effort, fast and cheap) and a complex multi-step problem-solving task (high reasoning effort, slower but more accurate) without switching models or maintaining two separate inference endpoints. The documentation shows a Python example using mistral-small-latest as the model identifier with reasoning_effort set at the per-call level.

On the AA LCR (Logical Consistency Reasoning) benchmark, Mistral Small 4 with reasoning enabled scores 0.72 accuracy using just 1,600 characters of output. By comparison, competing models from Qwen require 5,800 to 6,100 characters to achieve similar accuracy -- three and a half to four times more tokens, which directly multiplies output cost. The efficiency gap is significant if you are running high-volume reasoning tasks at the API level.

Developer note: You can also toggle reasoning using prompt tokens. Adding /think to a prompt enables chain-of-thought; /nothink disables it. This works in local deployments where API parameters are not available, such as when running the GGUF version through LM Studio or Ollama.

How Does Mistral Small 4 Compare to GPT-4o and Claude Haiku?

A fair comparison requires separating benchmark performance from cost, because Mistral Small 4's competitive advantage is strongest on the cost axis. According to the Testing Catalog release summary and Mistral's official model card, Mistral Small 4 API access is priced at $0.15 per million input tokens and $0.60 per million output tokens. Claude Haiku 4.5 costs $1.00 per million input tokens -- approximately 6.7 times more expensive for input. GPT-4o sits at $2.50 per million input tokens -- about 16 times more expensive.

On raw benchmark performance, the picture is more nuanced. Mistral AI's own data shows Small 4 beating OpenAI's GPT-OSS 120B on the AA LCR reasoning benchmark, which is a meaningful result given both models are in the same parameter-count neighborhood. However, Qwen 3.5 122B and Claude Haiku outperform Small 4 on LiveCodeBench, which measures code generation quality on real-world coding challenges. MindStudio's independent analysis places Mistral Small 4 "in the same tier as Claude Haiku 3.5 and Qwen 2.5 14B" on HumanEval coding tasks -- competitive, but not class-leading.

Reasoning (AA LCR): Mistral Small 4 beats GPT-OSS 120B, matches Claude Haiku 3.5 tier with 3.5-4x fewer output tokens
Coding (LiveCodeBench): Below Qwen 3.5 122B and Claude Haiku; acceptable for general coding tasks, not top-tier for competitive benchmarks
Multimodal / vision: Native image input, no direct benchmark comparison available at publication; roughly equivalent to Pixtral-class performance based on shared architecture
Cost efficiency: 6.7x cheaper than Claude Haiku 4.5 on input tokens; roughly 16x cheaper than GPT-4o; free to self-host under Apache 2.0
Context window: 256k tokens, longer than Claude Haiku 4.5 (200k) and GPT-4o (128k)

The practical takeaway is that Mistral Small 4 is not the best model on any single benchmark, but it is arguably the best model when you weight price, license freedom, context length, and multi-capability breadth simultaneously. For production applications that handle high token volumes -- customer support, document parsing, coding assistance -- the cost differential alone often justifies choosing Small 4 over Claude or GPT-4o, even if you accept slightly lower accuracy on complex reasoning tasks. You can compare current free vs. paid AI model options on our model comparison page.

Developer working with open-source AI model code on multiple monitors

How Do You Download and Run Mistral Small 4 for Free?

There are three main ways to access Mistral Small 4 without paying Mistral AI for the model weights themselves.

Option 1 -- Hugging Face (full weights): The official model card is at mistralai/Mistral-Small-4-119B-2603 on Hugging Face. You will need to install the main branch of the Transformers library (uv pip install git+https://github.com/huggingface/transformers.git) as the model requires features not yet in the stable release at time of launch. Running the full BF16 checkpoint requires approximately 238 GB of disk storage and enough VRAM or CPU RAM to load the weights -- practically speaking, this requires a multi-GPU server or a machine with 256 GB or more of system RAM.

Option 2 -- GGUF quantized (local inference): The community repository lmstudio-community/Mistral-Small-4-119B-2603-GGUF on Hugging Face provides 4-bit and 8-bit quantized versions. A 4-bit Q4_K_M quantization reduces the model to roughly 60-70 GB, which can run on a dual-GPU workstation. LM Studio and Ollama both support this GGUF format, so setup is a matter of downloading the file and pointing your preferred inference frontend at it.

Option 3 -- Mistral API (pay-per-token, no hosting required): For developers who do not want to manage GPU infrastructure, the model is available as mistral-small-latest on the Mistral API at $0.15 per million input tokens and $0.60 per million output tokens. While this is not technically free, it is the most accessible entry point and costs less than the price of a single ChatGPT Plus message at scale. A developer running 10 million input tokens per month would pay $1.50, compared to $10.00 on Claude Haiku 4.5 or $25.00 on GPT-4o.

For teams evaluating whether self-hosting or API access makes more economic sense, check out our free tier tracker which monitors pricing changes across major AI providers in real time. There is also a growing list of free-to-use hosted endpoints on OpenRouter and similar aggregators, where Mistral Small 4 has appeared in preview tiers that allow limited free calls per day.

What Are the Real-World Use Cases Where Mistral Small 4 Shines?

The unified capability profile of Mistral Small 4 makes it particularly well-suited for agentic workflows -- automated pipelines where an AI model needs to handle diverse inputs and task types without human intervention to switch models. A document processing agent, for example, might need to read a scanned PDF (vision), summarize its contents (instruction-following), extract structured data (reasoning), and then write a follow-up email draft (general generation). Mistral Small 4 handles all four steps natively.

Customer support automation is another strong fit. The 256k context window means the model can ingest a full customer conversation history, a product manual, and previous ticket resolutions in a single prompt without truncation. The configurable reasoning parameter lets support pipelines skip heavy chain-of-thought for simple FAQ responses (fast and cheap) while enabling it for complex billing disputes or technical troubleshooting (slower but more accurate).

Fine-tuning for domain-specific applications is where the Apache 2.0 license creates the largest competitive advantage. Healthcare providers, legal tech companies, and financial services firms often need models fine-tuned on proprietary or regulated data. Using a model under a permissive license eliminates the legal review cycle around redistribution and commercial deployment that can delay projects by months. Mistral AI has published fine-tuning guides on their platform, and the Hugging Face model card includes compatible LoRA adapter configurations.

The open-source AI community has already built several notable deployment setups, including a guide for running Small 4 on NVIDIA's DGX Spark hardware with SGLang for high-throughput serving. For researchers and independent developers building AI tools without enterprise GPU budgets, the GGUF path combined with a mid-range workstation provides a credible self-hosted option that was not available for models of this capability level even six months ago.

🔑 Key Takeaways

Mistral Small 4 is the first open-source model to unify reasoning, vision, and agentic coding in a single Apache 2.0 checkpoint, eliminating the need to manage three separate model deployments.
The Apache 2.0 license allows completely unrestricted commercial use, fine-tuning, and redistribution -- the most permissive licensing in the frontier open-source AI field as of mid-2026.
Configurable reasoning effort (per API call) lets developers balance response speed against accuracy dynamically, making a single deployment viable for both high-volume simple tasks and low-volume complex ones.
At $0.15 per million input tokens, Mistral Small 4 is roughly 6.7x cheaper than Claude Haiku 4.5 and 16x cheaper than GPT-4o, with competitive performance on reasoning benchmarks and a longer 256k context window.
Full weights are freely downloadable on Hugging Face; GGUF-quantized versions support local inference on workstation-class hardware, giving developers a genuine zero-cost path to production-grade open-source AI.

Related Resources

In-depth reviews of AI tools See how the tools behind the headlines actually perform.
AI tools by profession and use case Find the right tool for what you actually do.
AI scam prevention and alerts Stay safe while exploring new AI tools.

Frequently Asked Questions

Is Mistral Small 4 really free to use commercially?

Yes. Mistral Small 4 is released under the Apache 2.0 license, which permits free commercial use, modification, and redistribution without royalties. You can fine-tune it, deploy it in a product, and charge customers without any licensing fees or restrictions from Mistral AI. Attribution in your documentation is the primary requirement.

How many parameters does Mistral Small 4 have?

Mistral Small 4 is a 119-billion-parameter Mixture-of-Experts model with 128 experts and 6 billion active parameters per forward pass. Because MoE only activates a fraction of weights per token, inference costs track the active 6B count rather than the full 119B, keeping latency and GPU memory usage much lower than the total parameter count suggests.

What is configurable reasoning in Mistral Small 4?

Configurable reasoning lets you set a reasoning_effort parameter per API call -- low for fast instant replies on simple tasks, high for step-by-step chain-of-thought on hard problems. You can also use /think and /nothink tokens in the prompt when using local GGUF inference. This is unique among open-source models of this size class in 2026.

Where can I download Mistral Small 4?

The official weights are on Hugging Face at mistralai/Mistral-Small-4-119B-2603. A GGUF-quantized version for local inference is at lmstudio-community/Mistral-Small-4-119B-2603-GGUF. You can also access the model via the Mistral API using the identifier mistral-small-latest at $0.15 per million input tokens without hosting anything yourself.

How does Mistral Small 4 compare to Claude Haiku and GPT-4o?

On reasoning benchmarks (AA LCR), Small 4 outperforms GPT-OSS 120B and produces results 3.5-4x more token-efficiently than comparable Qwen models. Its API pricing at $0.15/M input tokens undercuts Claude Haiku 4.5 ($1.00/M) by roughly 6.7x. It trails on LiveCodeBench versus Qwen 3.5 122B and Claude Haiku but compensates with its permissive license and lower total cost of ownership.

Can I run Mistral Small 4 locally on a consumer GPU?

The full 119B model in BF16 requires roughly 238 GB of storage and significant VRAM -- practical only on multi-GPU servers. However, 4-bit GGUF quantized versions reduce requirements to approximately 60-70 GB, fitting on a dual-GPU workstation. Single-GPU consumer users are better served by the Mistral API or by waiting for smaller distilled variants that typically follow major releases.

Mistral Small 4 represents the clearest example yet of open-source AI closing the capability gap with proprietary models on a per-dollar basis. It is not the single best model in any one category, but for production deployments that need multi-modal breadth, long context, and commercial licensing clarity without a per-token bill that scales dangerously with volume, it is the most compelling free option available as of mid-2026. For the latest model releases and free tier changes, follow our AI model news feed or sign up for alerts below.

Browse Open Source Models → Compare Free vs Paid

What Is Mistral Small 4 and Why Does Apache 2.0 Matter?

What Are Mistral Small 4's Technical Specs?

How Does Mistral Small 4's Configurable Reasoning Actually Work?

How Does Mistral Small 4 Compare to GPT-4o and Claude Haiku?

How Do You Download and Run Mistral Small 4 for Free?

What Are the Real-World Use Cases Where Mistral Small 4 Shines?

🔑 Key Takeaways

Related Resources

Frequently Asked Questions

Is Mistral Small 4 really free to use commercially?

How many parameters does Mistral Small 4 have?

What is configurable reasoning in Mistral Small 4?

Where can I download Mistral Small 4?

How does Mistral Small 4 compare to Claude Haiku and GPT-4o?

Can I run Mistral Small 4 locally on a consumer GPU?

🔔 Get Free AI Alerts First

Related Resources