Google introduces Gemini Omni Flash, a video generation model designed for conversation, although the avatar mode has been postponed.
The initial model in DeepMind's new Omni family will create and modify video from any combination of image, audio, video, and text inputs. Speech editing will not be available initially, and SynthID watermarking will be active by default.
On Tuesday at the I/O 2026 developer conference, Google unveiled Gemini Omni, a new multimodal model family from Google DeepMind designed for generating and editing video using various inputs, including images, audio, and text.
The first model, Gemini Omni Flash, began its rollout on the same day to the Gemini app and Google Flow for Google AI Plus, Pro, and Ultra subscribers, in addition to being accessible without charge on YouTube Shorts and the YouTube Create app. Access to the API for developers and enterprise clients will be available in the coming weeks.
Koray Kavukcuoglu, CTO of Google DeepMind and Chief AI Architect at Google, described Omni as a model that “combines images, audio, video, and text as input to produce high-quality videos based on Gemini’s real-world knowledge.” Users can integrate inputs within a single prompt.
Edits are performed in a conversational manner, with each instruction building on the previous one, ensuring continuity in character consistency, physics, and scene context across multiple exchanges. Outputs beyond video, such as image and audio generation, are “coming in time,” Kavukcuoglu stated on the company’s blog.
The marketing positioning of Omni rests on three key assertions. First, the model exhibits a superior intuitive understanding of physical forces such as gravity, kinetic energy, and fluid dynamics, enabling it to generate scenes with more precise physics.
Second, it utilizes Gemini’s existing world knowledge to link language, imagery, and meaning beyond simple pattern recognition, as evidenced by prompts that range from claymation protein-folding explanations to chain-reaction physics scenarios. Third, the conversational editing feature maintains consistency throughout multi-turn revisions, addressing challenges faced by previous video models regarding character unique identity and scene continuity.
Additionally, the Omni family now includes digital avatar generation. These avatars allow users to record their own voice and likeness to create videos that resemble them, with the onboarding process requiring users to record themselves and recite a sequence of numbers.
Aside from avatars, Google is currently refraining from offering general-purpose audio and speech editing within Omni. “We are still working to test this and better understand how we can responsibly provide this capability to users,” Kavukcuoglu noted, which has been interpreted by third-party coverage as a cautious move away from the territory of consent-free voice editing associated with deepfakes.
All videos produced by Omni will include Google’s SynthID imperceptible digital watermark by default. Users can check if a clip was created by Omni through the Gemini app, Gemini in Chrome, and Google Search, according to the company.
The SynthID watermarking system aligns with the infrastructure adopted by OpenAI earlier this year under the C2PA open standard, and it is now marketed as the standard for AI-generated visual provenance across the industry.
Regarding the initial limitations, Flash-tier clips are limited to 10 seconds at launch, a decision made for deployment rather than a constraint of the model itself. This duration is shorter than OpenAI’s Sora maximum of 60 seconds, where Sora’s tokenization of spatiotemporal patches serves as the closest comparison to published frontier models.
Google has not disclosed the cost structure per clip, the computational resources required per generation, or the benchmarks used to assess Omni against Veo 3 or third-party models like ByteDance’s Seedance.
Omni headlines a broader I/O 2026 announcement, which also covered Gemini 3.5 and marked what Sundar Pichai referred to as the “agentic Gemini era” during his keynote. The strategic question regarding the model is whether the multi-input conversational editing flow represents a genuinely new product category or merely a more integrated version of capabilities demonstrated in the wider frontier video domain.
The upcoming API rollout for developers and enterprise customers in the following weeks will reveal the cost structure and maximum clip length for paid tiers.
What Google has yet to disclose includes the fundamental Omni model architecture in relation to Veo 3, the computational requirements for each generation, pricing for clips beyond the Flash tier, benchmark results against DeepMind’s previous video models and competing frontier offerings, and the timeline for audio and speech editing within the Omni family.
The onboarding process for avatars and the enforcement of SynthID are the company’s official approach to addressing the consent and provenance concerns raised by the launch.
Other articles
Google introduces Gemini Omni Flash, a video generation model designed for conversation, although the avatar mode has been postponed.
Google has introduced Gemini Omni Flash, a novel multimodal video-generation model developed by DeepMind. This model allows users to create and edit videos interactively using inputs from images, audio, video, and text, with SynthID watermarking enabled by default.
