🤖 Open Source AI

OpenAI Releases Open-Weight Models: gpt-oss-120b and gpt-oss-20b Explained

For the first time in five years, OpenAI has released model weights the public can download and run. Here is everything you need to know about gpt-oss-20b and gpt-oss-120b -- what they are, how they work, and how to run them locally today.

By Free AI News Editorial · · · 9 min read

Quick Answer: OpenAI released gpt-oss-120b and gpt-oss-20b on May 12, 2026 -- its first open-weight models since GPT-2 in 2019. Both use Mixture-of-Experts architecture, weights are free to download on Hugging Face, and the smaller 20b model runs in just 16 GB of VRAM on a consumer GPU.

In May 2026, OpenAI did something it had not done in approximately five years: it released model weights. The gpt-oss family -- short for GPT Open Source -- consists of two Mixture-of-Experts reasoning models that anyone can download, modify, and deploy. The announcement marks a significant shift for a company that has been criticized for moving away from its original open-research mission. Whether this is genuine openness or strategic positioning, the practical result is the same: you can now run a frontier-class OpenAI model on your own hardware, free of charge, without rate limits or API costs.

The gpt-oss models land in an open-weight landscape that has never been more competitive. Meta, Mistral, Alibaba's Qwen team, DeepSeek, Moonshot (Kimi), and MiniMax have all shipped powerful open-weight models in 2026. OpenAI's entry changes the competitive calculus: it brings frontier training expertise and a massive RLHF pipeline to the open-weight tier for the first time.

What Are the OpenAI gpt-oss Models?

According to OpenAI's official GitHub repository, gpt-oss is a family of two open-weight reasoning models designed for different deployment contexts. The larger model, gpt-oss-120b, targets production and cloud workloads where a single high-end GPU is available. The smaller model, gpt-oss-20b, is built for local deployment, edge inference, and lower-latency use cases where consumer hardware is the constraint.

OpenAI describes both as "reasoning models" -- meaning they are optimized for multi-step problem-solving, coding tasks, and analytical work rather than pure conversational fluency. The model card notes they do not use explicit chain-of-thought reasoning at inference time, which keeps response speed high compared to o-series thinking models.

Specification gpt-oss-120b gpt-oss-20b
Total Parameters117B21B
Active Parameters (per token)5.1B3.6B
ArchitectureMixture-of-Experts (MoE)Mixture-of-Experts (MoE)
Context Window128K tokens (131,072)128K tokens
Min. VRAM (MXFP4)~80 GB (single H100)~16 GB
Quantization FormatMXFP4 (native)MXFP4 (native)
Available OnHuggingFace, Azure, AWS Bedrock, NVIDIAHuggingFace, Ollama, Jan AI, vLLM
Release DateMay 12, 2026May 12, 2026
Glowing AI chip circuit board representing OpenAI gpt-oss open-weight model architecture

How Does the gpt-oss Architecture Work?

Both models use a Mixture-of-Experts (MoE) architecture, which explains how gpt-oss-120b can have 117 billion total parameters while only activating 5.1 billion of them for any given token. A gating router examines each token and directs it to the top-4 most relevant "expert" feedforward layers out of the full pool. This means the model has the knowledge capacity of a 120B dense model but the compute cost of roughly a 5B dense model per forward pass -- a major efficiency win.

According to Progressive Robot's detailed architectural analysis, the MoE blocks use a Gated SwiGLU activation function with an unusual modification: OpenAI added both clamping and a residual connection to the standard SwiGLU implementation. This likely smooths optimization during training at scale and accelerates convergence in large transformer architectures.

The attention mechanism alternates between two modes: Grouped Query Attention (GQA), which reduces the key-value cache memory footprint, and Sliding Window Attention (SWA), which handles long-range context efficiently. The model uses 8 key-value heads and includes a learned bias in the softmax denominator -- a design choice similar to off-by-one attention that helps with numerical stability.

Context length is extended to 131,072 tokens (128K) using YaRN (Yet Another RoPE-scaling method), which extends the position encoding range of the base model without full retraining. Rotary Position Embeddings (RoPE) encode each token's absolute position as a rotation of the query and key vectors, giving the model order-awareness. Attention Sinks -- special tokens at the start of each sequence -- stabilize attention across very long contexts.

One of the most significant architectural choices is native MXFP4 quantization. MXFP4 (Microscaling FP4) is a new numeric format that compresses model weights to 4-bit precision using shared microscale factors, preserving more accuracy than older INT4 quantization. The gpt-oss MoE layers were trained natively in MXFP4 -- not post-quantized -- which means there is minimal quality degradation compared to full-precision inference. This is the reason gpt-oss-120b fits on a single H100 GPU and gpt-oss-20b runs in 16 GB of VRAM. Note that MXFP4 requires Ada Lovelace GPU architecture or newer (RTX 4000 series / H100 / H200 / B200); older Ampere and Turing cards are not natively supported, though GGUF versions via Unsloth work on a broader range of hardware through software emulation.

What Hardware Do You Need to Run gpt-oss Locally?

Hardware requirements depend heavily on which model and which runtime you use. The open-source AI guide on this site covers general local LLM hardware guidance, but here are the specifics for gpt-oss:

For most individual developers and hobbyists, gpt-oss-20b is the practical choice. The 16 GB VRAM threshold is attainable on a midrange professional workstation today, and the model delivers reasoning quality that substantially exceeds older 7B to 13B open-weight models.

Digital globe network visualization representing open-weight AI model distribution and global access

How Does gpt-oss Compare to Other Open-Weight Models in 2026?

The open-weight landscape in 2026 is genuinely crowded with frontier-quality models. You can read our full guide to the top open-weight LLMs, but here is how gpt-oss stacks up against the key competitors:

NVIDIA has published optimized inference support for gpt-oss-120b, including an Eagle3 long-context variant on NVIDIA's developer platform that uses speculative decoding to improve throughput substantially on Blackwell architecture GPUs.

Where Can You Download and Run gpt-oss?

The model weights are available across multiple channels. The fastest way to get started depends on your hardware and preferred runtime:

Why Did OpenAI Release Open-Weight Models After Five Years?

OpenAI last released public model weights with GPT-2 in 2019. In the intervening years, the company shifted to a closed API-only model for GPT-3, GPT-4, and the GPT-5 family. The gpt-oss release breaks that pattern, and the timing suggests several converging pressures.

First, the competitive context has changed dramatically. Meta's Llama models have proven that open-weight releases drive developer ecosystem adoption at scale. Developers who build on Llama or Mistral build tooling, integrations, and fine-tunes around those models -- creating a gravitational pull toward those ecosystems. OpenAI's absence from the open-weight tier has cost it developer mindshare even as its API revenue has grown.

Second, NVIDIA's investment in gpt-oss optimization -- including 10x inference performance improvements on Blackwell GPUs using TensorRT-LLM and Eagle speculative decoding -- suggests a coordinated go-to-market push. When a hardware partner optimizes specifically for your model at launch, it is rarely spontaneous. The gpt-oss release appears to be part of a broader NVIDIA-OpenAI alignment around on-premise enterprise AI deployment, where enterprises want frontier-quality models they can run in their own data centers without API dependencies.

Third, the specific models released -- a 20B model that runs on consumer hardware -- signal a deliberate effort to reach the free-tier developer audience. A developer who fine-tunes gpt-oss-20b on their local machine and builds a product around it is more likely to graduate to OpenAI's paid API tier at scale than a developer who has spent two years in the Llama ecosystem. The open-weight release is arguably a top-of-funnel strategy as much as an openness gesture.

Regardless of the motivation, the practical effect is substantial. Developers with a 16 GB GPU can now run an OpenAI-trained model locally, fine-tune it with Unsloth for specific domains, and deploy it without paying per-token API costs. That is a genuine and meaningful change in what is accessible for free.

🔑 Key Takeaways

  • OpenAI released gpt-oss-120b and gpt-oss-20b on May 12, 2026 -- its first public model weights since GPT-2 in 2019, marking a major shift in the company's openness stance.
  • Both models use Mixture-of-Experts (MoE) architecture with native MXFP4 quantization, allowing gpt-oss-20b to run in just 16 GB of VRAM and gpt-oss-120b to fit on a single H100 GPU.
  • The weights are free to download from Hugging Face and can be run via Ollama, vLLM, Jan AI, or llama.cpp, with no API costs for self-hosted deployments.
  • NVIDIA has optimized both models for Blackwell GPU architecture, delivering up to 10x inference performance improvement over baseline, and offers an Eagle3 long-context variant of gpt-oss-120b.
  • The release appears strategically timed to compete with Llama 4, Kimi K2.6, and DeepSeek V4 for developer ecosystem mindshare, with the 20B model specifically targeting developers who build locally before scaling to cloud APIs.

Frequently Asked Questions

What is gpt-oss from OpenAI?

gpt-oss is a family of two open-weight reasoning models released by OpenAI in May 2026: gpt-oss-120b for production workloads and gpt-oss-20b for local and edge use. Both use a Mixture-of-Experts (MoE) architecture, can be downloaded from Hugging Face, and are available through Ollama, vLLM, and major cloud APIs including Azure OpenAI and AWS Bedrock.

Can you run gpt-oss on a consumer GPU?

Yes. The gpt-oss-20b model requires approximately 16 GB of VRAM using native MXFP4 quantization, putting it within reach of the NVIDIA RTX 4090 or RTX 5090. The larger gpt-oss-120b needs an H100. For older hardware without MXFP4 support, quantized GGUF versions from Unsloth run on RTX 3090-class cards and Mac M-series chips via llama.cpp, though at reduced speed.

How does gpt-oss compare to Llama 4 and DeepSeek V4?

All three use MoE architecture. Kimi K2.6 currently tops the open-weight benchmark index, while DeepSeek V4 Pro leads on agentic coding tasks. gpt-oss-120b is optimized for single-GPU inference using MXFP4, giving it a practical deployment edge over models that require multi-GPU setups. The gpt-oss-20b runs in a similar VRAM footprint to Llama 4 Scout but benefits from OpenAI's RLHF training pipeline.

Is gpt-oss truly open source?

gpt-oss is open-weight, not fully open source. The model weights are publicly available for download and local deployment. OpenAI has released the inference architecture code on GitHub. However, the full training dataset, data curation process, and RLHF details have not been disclosed. For a fully open license like Apache 2.0, Mistral Small 4 remains the cleaner choice for commercial applications.

Where can I download the gpt-oss model weights?

Weights are available from the OpenAI organization on Hugging Face, and community GGUF quantizations are at unsloth/gpt-oss-120b-GGUF and unsloth/gpt-oss-20b-GGUF. You can also pull via Ollama using "ollama pull gpt-oss:20b". The GitHub repository at github.com/openai/gpt-oss contains the inference code. Cloud access via Azure OpenAI and AWS Bedrock requires no local download.

When was gpt-oss released?

OpenAI released both gpt-oss-120b and gpt-oss-20b on May 12, 2026. The release was announced via OpenAI's model release notes and the openai/gpt-oss GitHub repository. NVIDIA published simultaneous optimizations for Blackwell GPU architecture, suggesting the launch was coordinated with hardware partners.

Browse All Open-Source Models → Compare Free vs Paid AI

🔔 Get Free AI Alerts First

When a model goes free, a paywall drops, or a deal appears -- you'll know before everyone else. No spam, just signal.