Nvidia launches the Nemotron 3 Nano Omni: a versatile open multimodal model featuring 30 billion parameters, with 3 billion active, designed for edge AI agents.
**TL;DR**
Nvidia introduced the Nemotron 3 Nano Omni, an open-weight multimodal model integrating vision, audio, and language into a single framework, featuring 30 billion parameters but activating only 3 billion during inference. It boasts nine times the throughput of similar open models and leads six benchmarks. Available under Nvidia’s Open Model Agreement for commercial purposes, it aims for edge AI agent deployment on single GPUs, positioning Nvidia as a competitor in both AI infrastructure and the models that utilize it.
On Tuesday, Nvidia launched the Nemotron 3 Nano Omni, an open-weight multimodal AI model unifying vision, audio, and language comprehension within a singular architecture aimed at driving autonomous AI agents on edge devices. This model comprises 30 billion parameters but utilizes only three billion for each forward pass through a mixture-of-experts framework, allowing it to operate on a single GPU while maintaining or surpassing the multimodal capabilities of significantly larger models. Nvidia claims it offers nine times the throughput of similar open multimodal models with comparable interactivity, 2.9 times faster single-stream reasoning for multimodal tasks, and nearly nine times greater effective system capacity for video reasoning. The model excels in six benchmarks spanning document intelligence, video analysis, and audio understanding. It can process diverse inputs, including text, images, audio, video, documents, charts, and graphical interfaces, producing text as output, thus potentially replacing the mixed array of specialized models commonly found in enterprise AI setups. This release, accessible on Hugging Face under Nvidia’s Open Model Agreement with full commercial usage rights, marks Nvidia’s most assertive step yet in providing not just the infrastructure for AI but also the models that operate on it.
**The architecture**
The Nemotron 3 Nano Omni utilizes a hybrid Mamba-Transformer architecture featuring 23 Mamba-2 selective state-space layers, 23 mixture-of-experts layers with 128 experts routing to six per token plus a shared expert, and six grouped-query attention layers. Its vision encoder, C-RADIOv4-H, manages variable-resolution images with patches of 16 by 16, scaling from 1,024 to 13,312 visual patches per image. The audio encoder, Parakeet-TDT-0.6B-v2, processes both speech and environmental sounds. Video processing employs three-dimensional convolutions to capture motion throughout frames, instead of treating video purely as a sequence of still images. The foundational text model underwent pretraining on 25 trillion tokens and accommodates a 256,000-token context window. The design philosophy emphasizes maximizing capability per active parameter instead of total parameters, acknowledging that edge deployment is limited by computation per inference rather than model size at rest. The model’s three billion active parameters during inference enable it to function on hardware announced at Nvidia’s GTC 2026 developer conference, such as the DGX Spark and DGX Station workstations, without necessitating the multi-GPU clusters that support larger models typically found in data centers.
The mixture-of-experts framework is not novel, but its integration into a multimodal model of this scale is. Most open multimodal models either utilize a dense architecture requiring all parameters to be active during each inference step or employ separate specialist models linked together in a pipeline, introducing latency at each transition. Nemotron 3 Nano Omni circumvents these issues by directing each token to six of 128 experts within a unified structure, allowing vision tokens, audio tokens, and text tokens to traverse the same model while activating different expertise according to the modality. Consequently, it can process a video feed, a spoken command, and a document concurrently, eliminating the inter-model latency that renders pipeline frameworks ineffective for real-time applications. This functionality simplifies operational complexities for enterprises by merging the management of separate vision, speech, and language models into a single model serving a unified endpoint.
**The strategy**
Nvidia has leveraged the AI boom by selling infrastructure, including GPUs, networking, and the CUDA software ecosystem that ties developers to its hardware. The Nemotron model family, downloaded over 50 million times in the past year, signifies a dual strategy wherein Nvidia also offers the models that run on said infrastructure. The reasoning is cyclical but impactful: Nvidia’s models are optimized for its hardware, and its hardware is tailored for its models, thus creating a comprehensive ecosystem that competes with model-plus-cloud solutions from Google, Amazon, and Microsoft. The case for compact, domain-specific language models has been established in sectors like education, healthcare, and enterprise, with Nemotron 3 Nano Omni extending this rationale to multimodal applications; instead of relying on a large cloud model for every vision or audio task, enterprises can deploy a compact model locally to manage the entire perceptual spectrum.
Early enterprise adopters include Foxconn, Palantir, Aible, ASI, Eka Care, and H Company, while Dell, DocuSign, Infosys, Oracle, and Zefr are assessing the model
Other articles
Nvidia launches the Nemotron 3 Nano Omni: a versatile open multimodal model featuring 30 billion parameters, with 3 billion active, designed for edge AI agents.
Nvidia's Nemotron 3 Nano Omni combines vision, audio, and text into a single open-weight model with a throughput that is 9 times greater. It utilizes only 3 billion of its 30 billion parameters at each step and is designed for edge AI agents operating on single GPUs.
