OpenAI Releases Open-Weight Models: gpt-oss-120b and gpt-oss-20b Explained
For the first time in five years, OpenAI has released model weights the public can download and run. Here is everything you need to know about gpt-oss-20b and gpt-oss-120b -- what they are, how they work, and how to run them locally today.
By Free AI News Editorial · · · 9 min read
In May 2026, OpenAI did something it had not done in approximately five years: it released model weights. The gpt-oss family -- short for GPT Open Source -- consists of two Mixture-of-Experts reasoning models that anyone can download, modify, and deploy. The announcement marks a significant shift for a company that has been criticized for moving away from its original open-research mission. Whether this is genuine openness or strategic positioning, the practical result is the same: you can now run a frontier-class OpenAI model on your own hardware, free of charge, without rate limits or API costs.
The gpt-oss models land in an open-weight landscape that has never been more competitive. Meta, Mistral, Alibaba's Qwen team, DeepSeek, Moonshot (Kimi), and MiniMax have all shipped powerful open-weight models in 2026. OpenAI's entry changes the competitive calculus: it brings frontier training expertise and a massive RLHF pipeline to the open-weight tier for the first time.
What Are the OpenAI gpt-oss Models?
According to OpenAI's official GitHub repository, gpt-oss is a family of two open-weight reasoning models designed for different deployment contexts. The larger model, gpt-oss-120b, targets production and cloud workloads where a single high-end GPU is available. The smaller model, gpt-oss-20b, is built for local deployment, edge inference, and lower-latency use cases where consumer hardware is the constraint.
OpenAI describes both as "reasoning models" -- meaning they are optimized for multi-step problem-solving, coding tasks, and analytical work rather than pure conversational fluency. The model card notes they do not use explicit chain-of-thought reasoning at inference time, which keeps response speed high compared to o-series thinking models.
| Specification | gpt-oss-120b | gpt-oss-20b |
|---|---|---|
| Total Parameters | 117B | 21B |
| Active Parameters (per token) | 5.1B | 3.6B |
| Architecture | Mixture-of-Experts (MoE) | Mixture-of-Experts (MoE) |
| Context Window | 128K tokens (131,072) | 128K tokens |
| Min. VRAM (MXFP4) | ~80 GB (single H100) | ~16 GB |
| Quantization Format | MXFP4 (native) | MXFP4 (native) |
| Available On | HuggingFace, Azure, AWS Bedrock, NVIDIA | HuggingFace, Ollama, Jan AI, vLLM |
| Release Date | May 12, 2026 | May 12, 2026 |
How Does the gpt-oss Architecture Work?
Both models use a Mixture-of-Experts (MoE) architecture, which explains how gpt-oss-120b can have 117 billion total parameters while only activating 5.1 billion of them for any given token. A gating router examines each token and directs it to the top-4 most relevant "expert" feedforward layers out of the full pool. This means the model has the knowledge capacity of a 120B dense model but the compute cost of roughly a 5B dense model per forward pass -- a major efficiency win.
According to Progressive Robot's detailed architectural analysis, the MoE blocks use a Gated SwiGLU activation function with an unusual modification: OpenAI added both clamping and a residual connection to the standard SwiGLU implementation. This likely smooths optimization during training at scale and accelerates convergence in large transformer architectures.
The attention mechanism alternates between two modes: Grouped Query Attention (GQA), which reduces the key-value cache memory footprint, and Sliding Window Attention (SWA), which handles long-range context efficiently. The model uses 8 key-value heads and includes a learned bias in the softmax denominator -- a design choice similar to off-by-one attention that helps with numerical stability.
Context length is extended to 131,072 tokens (128K) using YaRN (Yet Another RoPE-scaling method), which extends the position encoding range of the base model without full retraining. Rotary Position Embeddings (RoPE) encode each token's absolute position as a rotation of the query and key vectors, giving the model order-awareness. Attention Sinks -- special tokens at the start of each sequence -- stabilize attention across very long contexts.
One of the most significant architectural choices is native MXFP4 quantization. MXFP4 (Microscaling FP4) is a new numeric format that compresses model weights to 4-bit precision using shared microscale factors, preserving more accuracy than older INT4 quantization. The gpt-oss MoE layers were trained natively in MXFP4 -- not post-quantized -- which means there is minimal quality degradation compared to full-precision inference. This is the reason gpt-oss-120b fits on a single H100 GPU and gpt-oss-20b runs in 16 GB of VRAM. Note that MXFP4 requires Ada Lovelace GPU architecture or newer (RTX 4000 series / H100 / H200 / B200); older Ampere and Turing cards are not natively supported, though GGUF versions via Unsloth work on a broader range of hardware through software emulation.
What Hardware Do You Need to Run gpt-oss Locally?
Hardware requirements depend heavily on which model and which runtime you use. The open-source AI guide on this site covers general local LLM hardware guidance, but here are the specifics for gpt-oss:
- gpt-oss-20b on Ollama (native MXFP4): approximately 14 GB of VRAM. An NVIDIA RTX 4090 (24 GB) or RTX 5090 runs it comfortably. AMD RDNA 3/4 cards via ROCm 6.x also work. This is currently the most accessible path for consumer hardware users.
- gpt-oss-20b via GGUF (Unsloth): runs on 16 GB VRAM or can be partially offloaded to system RAM. Works on older Ampere cards (RTX 3090, A6000) and even Mac M-series with Metal via llama.cpp. Expect lower tokens per second than the native MXFP4 path.
- gpt-oss-120b on a single H100 (MXFP4 + vLLM): requires ~80 GB HBM. As documented by Spheron Network's inference guide, you can run it with vLLM using the command:
--model openai/gpt-oss-120b --quantization mxfp4 --gpu-memory-utilization 0.92. Renting an H100 SXM5 on Lambda Labs or Vast.ai costs roughly $2 to $3 per hour in June 2026. - gpt-oss-120b via cloud API: available on Azure OpenAI Service and AWS Bedrock. No local hardware required. Microsoft describes it as combining "frontier reasoning skills with enterprise-grade flexibility."
For most individual developers and hobbyists, gpt-oss-20b is the practical choice. The 16 GB VRAM threshold is attainable on a midrange professional workstation today, and the model delivers reasoning quality that substantially exceeds older 7B to 13B open-weight models.
How Does gpt-oss Compare to Other Open-Weight Models in 2026?
The open-weight landscape in 2026 is genuinely crowded with frontier-quality models. You can read our full guide to the top open-weight LLMs, but here is how gpt-oss stacks up against the key competitors:
- vs. Kimi K2.6 (Moonshot AI): Kimi K2.6 tops the Artificial Analysis open-weight index at 54, ranking 4th overall including closed models. It uses a 1 trillion total parameter MoE with 32B active parameters under a Modified MIT License. gpt-oss-120b has a much lower active-parameter count (5.1B), prioritizing inference efficiency over raw benchmark throughput.
- vs. DeepSeek V4 Pro: DeepSeek V4 Pro leads on agentic coding benchmarks like SWE-Bench and competes with closed frontier models. DeepSeek made its standard pricing permanent at $0.435 per million input tokens. gpt-oss is free to run locally, eliminating per-token costs entirely for self-hosted deployments.
- vs. Llama 4 Scout / Maverick (Meta): Llama 4 Scout runs on consumer hardware but at smaller scale. Meta's Llama license permits commercial use. gpt-oss does not yet carry an explicit Apache or MIT tag, but the weights are downloadable for research and personal use. OpenAI's commercial terms for gpt-oss usage-at-scale should be reviewed at the GitHub repository before deploying in production.
- vs. Mistral Small 4 (Apache 2.0): Mistral's models carry a clean Apache 2.0 license, the gold standard for commercial open-weight use. If license clarity is your primary concern, Mistral remains the safer choice. If performance-per-GPU-hour matters more, gpt-oss-20b is a serious contender in the same VRAM range.
- vs. Gemma 4 (Google): Gemma 4 is optimized for Google hardware and Vertex AI. gpt-oss runs better on NVIDIA hardware given the MXFP4 training format and NVIDIA's optimizations for it via TensorRT-LLM and Eagle speculative decoding.
NVIDIA has published optimized inference support for gpt-oss-120b, including an Eagle3 long-context variant on NVIDIA's developer platform that uses speculative decoding to improve throughput substantially on Blackwell architecture GPUs.
Where Can You Download and Run gpt-oss?
The model weights are available across multiple channels. The fastest way to get started depends on your hardware and preferred runtime:
- Hugging Face (official): The OpenAI organization page hosts both models. Community quantizations from Unsloth are at unsloth/gpt-oss-120b-GGUF and unsloth/gpt-oss-20b-GGUF, covering Q4_K_M through 1-bit quantizations. Download via
huggingface-cli download openai/gpt-oss-20b. - Ollama: Ollama natively supports the MXFP4 format with new kernels co-developed with OpenAI. Pull with
ollama pull gpt-oss:20b. Inference speed benchmarks show 6+ tokens per second on RTX 4090 in MXFP4 mode. An older GPU falls back to GGUF automatically. - Jan AI: Jan AI provides a GUI-based beginner path. Their 5-minute setup guide covers downloading and running gpt-oss-20b locally with no command line required.
- vLLM (production servers): The recommended path for multi-user deployments. The Unsloth documentation provides detailed vLLM configuration flags, including memory utilization settings and MXFP4 activation flags.
- Cloud APIs: gpt-oss-120b is available on Azure OpenAI Service and AWS Bedrock (model ID: openai.gpt-oss-120b). This path requires no local hardware and gives you managed scalability with pay-per-token pricing.
- GitHub inference code: The openai/gpt-oss GitHub repository contains the model architecture code in Python/Triton, including the MoE routing logic, attention implementation, and reference benchmarks. Useful for researchers wanting to inspect or modify the inference code.
Why Did OpenAI Release Open-Weight Models After Five Years?
OpenAI last released public model weights with GPT-2 in 2019. In the intervening years, the company shifted to a closed API-only model for GPT-3, GPT-4, and the GPT-5 family. The gpt-oss release breaks that pattern, and the timing suggests several converging pressures.
First, the competitive context has changed dramatically. Meta's Llama models have proven that open-weight releases drive developer ecosystem adoption at scale. Developers who build on Llama or Mistral build tooling, integrations, and fine-tunes around those models -- creating a gravitational pull toward those ecosystems. OpenAI's absence from the open-weight tier has cost it developer mindshare even as its API revenue has grown.
Second, NVIDIA's investment in gpt-oss optimization -- including 10x inference performance improvements on Blackwell GPUs using TensorRT-LLM and Eagle speculative decoding -- suggests a coordinated go-to-market push. When a hardware partner optimizes specifically for your model at launch, it is rarely spontaneous. The gpt-oss release appears to be part of a broader NVIDIA-OpenAI alignment around on-premise enterprise AI deployment, where enterprises want frontier-quality models they can run in their own data centers without API dependencies.
Third, the specific models released -- a 20B model that runs on consumer hardware -- signal a deliberate effort to reach the free-tier developer audience. A developer who fine-tunes gpt-oss-20b on their local machine and builds a product around it is more likely to graduate to OpenAI's paid API tier at scale than a developer who has spent two years in the Llama ecosystem. The open-weight release is arguably a top-of-funnel strategy as much as an openness gesture.
Regardless of the motivation, the practical effect is substantial. Developers with a 16 GB GPU can now run an OpenAI-trained model locally, fine-tune it with Unsloth for specific domains, and deploy it without paying per-token API costs. That is a genuine and meaningful change in what is accessible for free.
🔑 Key Takeaways
- OpenAI released gpt-oss-120b and gpt-oss-20b on May 12, 2026 -- its first public model weights since GPT-2 in 2019, marking a major shift in the company's openness stance.
- Both models use Mixture-of-Experts (MoE) architecture with native MXFP4 quantization, allowing gpt-oss-20b to run in just 16 GB of VRAM and gpt-oss-120b to fit on a single H100 GPU.
- The weights are free to download from Hugging Face and can be run via Ollama, vLLM, Jan AI, or llama.cpp, with no API costs for self-hosted deployments.
- NVIDIA has optimized both models for Blackwell GPU architecture, delivering up to 10x inference performance improvement over baseline, and offers an Eagle3 long-context variant of gpt-oss-120b.
- The release appears strategically timed to compete with Llama 4, Kimi K2.6, and DeepSeek V4 for developer ecosystem mindshare, with the 20B model specifically targeting developers who build locally before scaling to cloud APIs.
Frequently Asked Questions
What is gpt-oss from OpenAI?
gpt-oss is a family of two open-weight reasoning models released by OpenAI in May 2026: gpt-oss-120b for production workloads and gpt-oss-20b for local and edge use. Both use a Mixture-of-Experts (MoE) architecture, can be downloaded from Hugging Face, and are available through Ollama, vLLM, and major cloud APIs including Azure OpenAI and AWS Bedrock.
Can you run gpt-oss on a consumer GPU?
Yes. The gpt-oss-20b model requires approximately 16 GB of VRAM using native MXFP4 quantization, putting it within reach of the NVIDIA RTX 4090 or RTX 5090. The larger gpt-oss-120b needs an H100. For older hardware without MXFP4 support, quantized GGUF versions from Unsloth run on RTX 3090-class cards and Mac M-series chips via llama.cpp, though at reduced speed.
How does gpt-oss compare to Llama 4 and DeepSeek V4?
All three use MoE architecture. Kimi K2.6 currently tops the open-weight benchmark index, while DeepSeek V4 Pro leads on agentic coding tasks. gpt-oss-120b is optimized for single-GPU inference using MXFP4, giving it a practical deployment edge over models that require multi-GPU setups. The gpt-oss-20b runs in a similar VRAM footprint to Llama 4 Scout but benefits from OpenAI's RLHF training pipeline.
Is gpt-oss truly open source?
gpt-oss is open-weight, not fully open source. The model weights are publicly available for download and local deployment. OpenAI has released the inference architecture code on GitHub. However, the full training dataset, data curation process, and RLHF details have not been disclosed. For a fully open license like Apache 2.0, Mistral Small 4 remains the cleaner choice for commercial applications.
Where can I download the gpt-oss model weights?
Weights are available from the OpenAI organization on Hugging Face, and community GGUF quantizations are at unsloth/gpt-oss-120b-GGUF and unsloth/gpt-oss-20b-GGUF. You can also pull via Ollama using "ollama pull gpt-oss:20b". The GitHub repository at github.com/openai/gpt-oss contains the inference code. Cloud access via Azure OpenAI and AWS Bedrock requires no local download.
When was gpt-oss released?
OpenAI released both gpt-oss-120b and gpt-oss-20b on May 12, 2026. The release was announced via OpenAI's model release notes and the openai/gpt-oss GitHub repository. NVIDIA published simultaneous optimizations for Blackwell GPU architecture, suggesting the launch was coordinated with hardware partners.