Skip to content

Updated model from NVIDIA and OpenAI achieves milestone speed of 1.5 million tokens per second

AI models fine-tuned for logical thinking and reasoning, developed by OpenAI and NVIDIA, made accessible to developers across the globe in their entirety.

NVIDIA and OpenAI's latest gpt-oss model achieves a remarkable speed of processing 1.5 million...
NVIDIA and OpenAI's latest gpt-oss model achieves a remarkable speed of processing 1.5 million tokens every second.

Updated model from NVIDIA and OpenAI achieves milestone speed of 1.5 million tokens per second

In a groundbreaking collaboration, OpenAI and NVIDIA have jointly released two new state-of-the-art open-source large language models under the Apache 2.0 license. The models, named gpt-oss-120b and gpt-oss-20b, are designed for efficient yet powerful real-world use.

Key Features and Capabilities

The gpt-oss-120b and gpt-oss-20b share several key features and capabilities. Both models employ the Mixture-of-Experts (MoE) architecture and 4-bit quantization for efficient inference. They support very long context lengths (128K tokens), and are integrated into popular open-source AI frameworks, enabling broad customization and deployment flexibility.

| Feature / Capability | gpt-oss-120b | gpt-oss-20b | |-------------------------------|-------------------------------------|------------------------------------| | Total parameters | 117 billion | 21 billion | | Active parameters per token| ~5.1 billion | ~3.6 billion | | Number of experts (MoE) | 128 total experts, 4 active per token | 32 total experts, 4 active per token | | Model architecture | Mixture-of-Experts (MoE) with SwigGLU activations | Mixture-of-Experts (MoE) with SwigGLU activations | | Precision and optimization| FP4 (4-bit floating point) optimized for NVIDIA Blackwell GPUs | FP4 (4-bit floating point) optimized for edge devices | | Context window | 128K tokens | 128K tokens | | Hardware requirements | Runs efficiently on a single 80 GB GPU such as NVIDIA H100 or GB200; requires datacenter-class GPU for best performance | Fits on a single 16 GB GPU, enabling on-device or edge deployment, including Mac laptops with 32GB RAM | | Performance benchmarks | Achieves near-parity with OpenAI’s proprietary o4-mini model on core reasoning tasks, excels at math, code, and domain-specific Q&A | Similar performance to OpenAI o3-mini on common reasoning benchmarks | | Capabilities | Strong reasoning, few-shot function calling, chain-of-thought (CoT) reasoning, tool use, autonomous agent tasks, and structured chat | Instruction-following, chain-of-thought reasoning, tool use in structured chat formats | | Software ecosystem compatibility | Supported by Hugging Face Transformers (v4.55+), vLLM, Ollama, TensorRT-LLM, Llama.cpp and OpenAI-compatible Response API | Same as gpt-oss-120b with a focus on fast, low-latency inference for accessible deployment | | Training | Trained with reinforcement learning and techniques based on OpenAI’s advanced internal models (o3 and other frontier systems); required over 2.1 million GPU hours on NVIDIA H100 | About 10x less training time than 120b model |

Summary

The gpt-oss-120b is a high-capability model optimized for robust reasoning tasks, capable of matching OpenAI’s proprietary models like o4-mini while running on a single powerful GPU. Its large expert pool and deep attention layers support complex, agentic AI workflows such as code execution and tool calls, ideal for data center or server deployments where latency and cost are critical.

The gpt-oss-20b model prioritizes accessibility and speed, enabling strong performance similar to o3-mini on common reasoning benchmarks but with much lighter hardware requirements (16 GB GPU). This allows efficient on-device or edge use, suitable for local inference, development, or bandwidth-constrained environments.

Both models were trained on NVIDIA's H100 GPUs and run best on Blackwell-powered GB200 NVL72 systems, achieving inference speeds of 1.5 million tokens per second. Jensen Huang, founder and CEO of NVIDIA, stated that these models let developers everywhere build on an open-source foundation, strengthening U.S. technology leadership in AI.

To ensure safety, the models were evaluated using OpenAI's Preparedness Framework and adversarial fine-tuning tests. Independent experts reviewed the methodology, helping establish safety standards comparable to the company's closed frontier models.

In benchmark testing, gpt-oss-120b outperformed several proprietary models, including OpenAI's o1 and o4-mini, on tasks related to healthcare (HealthBench), mathematics (AIME 2024 and 2025), and coding (Codeforces).

The models are designed for advanced reasoning tasks and are optimized for deployment across a wide range of environments. Both models support 128k context lengths, employ Rotary Positional Embeddings, and feature advanced attention techniques.

OpenAI and NVIDIA have partnered with major deployment platforms like Azure, AWS, Vercel, and Databricks, and hardware leaders including AMD, Cerebras, and Groq. Microsoft is enabling local inference of gpt-oss-20b on Windows devices via ONNX Runtime.

Both models perform strongly in chain-of-thought (CoT) reasoning, tool use, and structured outputs, making them ideal for low-latency, real-time tasks. The models are released under the Apache 2.0 license, allowing full commercial and research use.

The innovation in these models, gpt-oss-120b and gpt-oss-20b, lies in their employment of science and technology, such as the Mixture-of-Experts (MoE) architecture and 4-bit quantization, for efficient inference. Furthermore, they demonstrate the potential of technology by delivering strong reasoning capabilities, few-shot function calling, and chain-of-thought (CoT) reasoning, while catering to diverse environments, from data centers to edge devices.

Read also:

    Latest