Liquid AI LFM2.5: Free Open-Weight Model Runs on Any Device
Liquid AI dropped LFM2.5-8B-A1B on May 28, 2026 -- a compact Mixture-of-Experts model that delivers surprising benchmark results from a tiny hardware footprint. Here is everything you need to know before you download it.
By Free AI News Editorial · · · 9 min read
Every week, a new AI model claims to be "the most efficient edge model ever." Most of them require a high-end workstation or a paid API to get real results. Liquid AI's LFM2.5-8B-A1B is different. Released on May 28, 2026, it activates just 1.5 billion parameters per inference pass -- meaning your laptop's CPU handles it comfortably. The weights are free on Hugging Face, fine-tuning is supported, and there are no usage fees. This guide breaks down what you actually get.
What Is Liquid AI's LFM2.5, and Why Should You Care?
Liquid AI is an MIT spin-out founded to build AI models with architectures that differ fundamentally from standard transformers. Their LFM (Liquid Foundation Model) series combines Mixture-of-Experts (MoE) routing, grouped-query attention (GQA), and gated short convolution blocks -- a hybrid that prioritizes throughput on consumer hardware over raw parameter count.
LFM2.5-8B-A1B is the follow-up to LFM2-8B-A1B (released October 2025). The "8B" refers to total parameters; the "A1B" means only about 1.5 billion are activated per forward pass. That sparsity is the key to on-device performance: the GPU or CPU only has to process a fraction of the model's full weight for each token it generates.
According to Liquid AI's official release post, this version was specifically designed for tool-calling workloads -- the kind of agentic AI tasks where a model needs to chain multiple function calls to complete a real-world job, like booking a calendar event or querying a database. The model produces a chain-of-thought reasoning trace before answering, which reduces errors on complex instructions.
What Are LFM2.5's Key Specs and Benchmarks?
Here is the technical profile of LFM2.5-8B-A1B compared to its predecessor, sourced directly from Liquid AI's release notes and independently covered by MarkTechPost:
| Spec / Benchmark | LFM2-8B-A1B (Oct 2025) | LFM2.5-8B-A1B (May 2026) |
|---|---|---|
| Total parameters | 8.3B | 8.3B |
| Active parameters per token | ~1.5B | ~1.5B |
| Context window | 32,768 tokens | 128,000 tokens |
| Training tokens | 12 trillion | 38 trillion |
| Vocabulary size | 65,536 | 128,000 |
| IFEval (instruction following) | 79.44 | 91.84 (+12.4) |
| MATH500 | 74.80 | 88.76 (+13.96) |
| AIME 2025 math competition | 20.00 | 42.53 (+22.53) |
| Tau2 Telecom (tool-calling) | 13.60 | 88.07 (+74.47) |
| Non-hallucination rate (AA-Omniscience) | 7.46% | 63.47% (+56.01) |
| BFCLv3 (function-calling) | 45.07 | 64.36 (+19.29) |
The non-hallucination rate jump -- from 7.46% to 63.47% -- is the headline number. The previous version was essentially unreliable for factual queries. LFM2.5 addresses this through a targeted reinforcement learning stage that uses an avg@k-based reward to teach the model to abstain on questions outside its reliable knowledge rather than confabulating an answer. This matters enormously for agentic pipelines where a hallucinated API call can corrupt downstream data.
How Does LFM2.5 Stack Up Against Gemma 4, Qwen3, and Mistral?
The open-weight small-model landscape in mid-2026 is crowded. Google's Gemma 4 series and Alibaba's Qwen3.6 family are the most commonly cited competitors. Here is how LFM2.5 compares:
- vs. Google Gemma 4-26B -- LFM2.5-8B-A1B matches Gemma 4-26B on IFEval instruction-following benchmarks, according to Liquid AI's data. Gemma 4-26B has more than three times LFM2.5's active parameters. The efficiency ratio here is notable: the same instruction-following quality at a fraction of the compute cost.
- vs. Alibaba Qwen3-1.7B -- Despite similar active parameter counts, LFM2.5-8B-A1B outperforms Qwen3-1.7B on most benchmarks, according to independent analysis from byteiota.com. Qwen3 has the advantage of Apache-2.0 licensing (fully open source), while LFM2.5 uses the more restrictive LFM1.0 license.
- vs. Mistral 7B class -- Liquid AI claims LFM2.5 is the fastest throughput in its size class on both CPU and GPU inference. The architecture's convolution-heavy design reduces memory bandwidth requirements compared to standard attention-only transformers, which helps on CPU-only machines.
- vs. Gemma 4-E2B -- LFM2.5 competes with Gemma 4-E2B (Google's 2B active-parameter MoE variant) on instruction following while offering a substantially wider 128K context window -- important for processing long documents or extended conversations without truncation.
Community reception on Reddit's r/LocalLLaMA has been mixed. Early testers found the model impressive on structured tool-calling tasks but noted some issues with tool use hallucinations in real-world, loosely-specified prompts. The comparison base also matters: several comments pointed out that LFM2.5 should be compared against the newest Qwen3.5 and Gemma 4 releases rather than older versions used in Liquid AI's own benchmark tables. Independent third-party benchmarks are still being published as of this writing.
How Can You Download and Run LFM2.5 for Free Right Now?
Liquid AI supports four inference backends from day one, covering the most common ways developers and hobbyists run local models. Here is the fastest path for each setup:
- llama.cpp (CPU + GPU) -- Download the GGUF quantized weights from Hugging Face under LiquidAI. The Q4_K_M quantization is the recommended balance of quality and size. Load with
./llama-cli -m LFM2.5-8B-A1B-Q4_K_M.gguf. Works on Mac, Windows, and Linux. - MLX (Apple Silicon) -- The
LFM2.5-8B-A1B-MLX-bf16model is available directly on Hugging Face and has been updated actively (as of June 2026). Run with the standardmlx_lm.generatecommand for fast inference on M-series Macs. - vLLM or SGLang (production server) -- Both frameworks have day-one support. Use for high-throughput API serving on Linux with NVIDIA or AMD GPUs. SGLang is recommended for multi-turn agentic workflows due to its KV-cache prefix caching.
- Unsloth (fine-tuning) -- Unsloth's LFM2.5 tutorial covers supervised fine-tuning with LoRA adapters. You can fine-tune on consumer hardware (a single 8GB VRAM GPU works for the 4-bit quantized version).
Liquid AI recommends these inference hyperparameters as a starting point: temperature 0.2, top_k 80, and repetition_penalty 1.05. The lower temperature reflects the model's reasoning-first design -- it works best with focused, deterministic outputs rather than high-temperature creative generation. You can also test the model directly in Liquid AI's online playground before downloading anything.
For tool-calling, LFM2.5 uses dedicated special tokens: <|tool_call_start|> and <|tool_call_end|>. Tools are provided as a JSON schema in the system prompt. Consult the Liquid AI docs for the exact prompt format -- getting the tool-call template right is critical for reliable function-calling performance.
What Use Cases Is LFM2.5 Best Suited For?
LFM2.5 was engineered for specific deployment scenarios. It is not a general-purpose replacement for GPT-4o or Claude Sonnet for every task. Where it excels:
- Offline personal AI assistant -- If you want a private assistant that runs entirely on your laptop with no internet connection and no cloud fees, LFM2.5 is now one of the strongest options in the 8B class. It handles calendar queries, document summarization, and step-by-step reasoning reliably.
- Multilingual edge applications -- The doubled vocabulary particularly benefits Hindi, Thai, Vietnamese, Indonesian, and Arabic applications. Tokenization efficiency gains of 100-238% in those languages mean cheaper inference costs when running on metered hardware.
- Agentic pipelines with structured tool use -- The Tau2 Telecom score of 88.07 (up from 13.60) signals a genuine leap in the model's ability to chain tool calls in production environments. Developers building AI agents that interact with APIs, databases, or file systems should test this model first.
- Long-document processing -- The 128K context window (4x larger than the previous version) enables processing full research papers, legal contracts, or long code repositories in a single context window without the chunking workarounds that smaller context models require.
- Edge and IoT deployment -- Liquid AI's broader product line targets wearables, robotics, and automotive use cases. LFM2.5 sits at the large end of their on-device stack but is still designed to run on NPUs and specialized silicon where full transformer models would be too slow.
For benchmarking and model comparisons against other open-weight models, the Artificial Analysis Omniscience Index provides ongoing third-party evaluations. LFM2.5 scored -24.70 on that index (higher is better, range -100 to 100) -- still negative, but a 53-point improvement over its predecessor. The index penalizes hallucinations heavily, so continued improvement in the non-hallucination rate should move this score into positive territory over time.
What Are LFM2.5's Limitations Before You Commit?
LFM2.5 is a genuinely strong model for its resource footprint, but there are real limitations worth understanding before you build on it:
- LFM1.0 license, not Apache or MIT -- "Open-weight" does not mean "open-source" here. The LFM1.0 license allows download, fine-tuning, and deployment, but Liquid AI does not release training code or data. Commercial restrictions may apply at scale. Review the full license terms before deploying in a product.
- Reasoning-only mode is always on -- Unlike models such as Qwen3 where reasoning can be toggled off, LFM2.5 always generates a chain-of-thought trace before its final answer. This adds latency and token cost for simple tasks that do not require deep reasoning.
- No larger variant currently available -- LFM2.5 is only offered in the 8B total / 1.5B active size as of June 2026. There is no 70B+ option for tasks requiring more capability. Users who need bigger model performance should compare against DeepSeek V4 or other larger open-weight models.
- Community tool-use reliability still being validated -- Early Reddit testers reported real-world tool-calling reliability varying significantly by prompt format. The benchmark numbers are strong, but community testing under diverse real-world conditions is still ongoing. Run your own evals before relying on it in production.
- Playground still cloud-based -- Testing in the official Liquid AI playground does not require a download, but it runs on Liquid AI's servers. Only the downloaded weights provide true privacy and offline capability.
🔑 Key Takeaways
- LFM2.5-8B-A1B is free to download from Hugging Face under the LFM1.0 license, with no API costs and no usage caps for personal or research use.
- Its 1.5B active parameter design makes it the most CPU-efficient 8B-class model available in June 2026, running comfortably on entry-level laptops without a dedicated GPU.
- The non-hallucination rate jumped from 7.46% to 63.47% versus the previous version, making LFM2.5 substantially more reliable for agentic tool-calling workflows.
- LFM2.5 matches Google's Gemma 4-26B on IFEval instruction-following benchmarks while activating roughly three times fewer parameters per inference pass.
- The LFM1.0 license is not Apache or MIT -- commercial use at scale requires reviewing license terms, and training code is not publicly released, so this is open-weight but not fully open-source.
Frequently Asked Questions
Is Liquid AI LFM2.5 truly free to use?
LFM2.5 weights are free to download from Hugging Face under Liquid AI's LFM1.0 license, which permits personal use, fine-tuning, and deployment without API costs. It is open-weight, not open-source -- the training code and data are not publicly released -- but there are no usage fees for the model weights themselves. Review the license before large-scale commercial deployment.
Can LFM2.5 run on a regular laptop without a GPU?
Yes. LFM2.5-8B-A1B activates only 1.5 billion parameters per token, making it one of the most CPU-efficient 8B-class models available. Liquid AI lists it as designed for entry-level laptops. The GGUF quantized versions via llama.cpp run on most modern consumer machines, including those without a dedicated GPU, though inference speed will be slower than GPU-accelerated setups.
What languages does LFM2.5 support?
LFM2.5 officially supports nine languages: Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, and Spanish. The new 128K vocabulary also delivers strong tokenization efficiency for Thai, Vietnamese, and Indonesian -- languages where the older tokenizer was notably weak -- with efficiency gains exceeding 100% in those languages.
How do I download and run LFM2.5 locally?
Download the GGUF version from the LiquidAI organization page on Hugging Face and load it with llama.cpp, Ollama, or LM Studio. For Python inference use vLLM or SGLang. Unsloth provides a fine-tuning tutorial at unsloth.ai/docs. Recommended inference settings are temperature 0.2, top_k 80, and repetition_penalty 1.05. You can also try it without downloading via the Liquid AI playground.
How does LFM2.5 compare to Gemma 4 and Qwen3?
LFM2.5-8B-A1B matches Google's Gemma 4-26B on IFEval instruction-following benchmarks while using roughly three times fewer active parameters. It outperforms Qwen3-1.7B on most benchmarks despite similar active parameter counts. Its primary advantage is throughput speed on CPU and GPU, which Liquid AI claims is fastest in the 8B active-parameter class as of June 2026.
LFM2.5 joins an increasingly crowded open-weight small model category, but its combination of efficient MoE routing, long context window, and tool-calling reliability make it worth testing for any developer building local AI applications. The non-hallucination improvements alone put it in a different tier from its predecessor. Download the GGUF file, run a few test prompts against your actual use case, and compare against the other leading open-source models we track before making it your primary model.