OpenAI introduces GPT-Realtime-2 along with two new voice API models.

OpenAI introduces GPT-Realtime-2 along with two new voice API models.

      GPT-Realtime-2 introduces GPT-5-level reasoning for live voice interactions. It features a dedicated translation model supporting over 70 input languages and a streaming version of Whisper for transcription purposes. The pricing strategy is competitive enough to make comparisons inevitable.

      OpenAI has launched three new voice models in its API, expanding the capabilities for developers to integrate GPT-level reasoning into live audio. These models include GPT-Realtime-2, which is an advancement over the existing real-time voice model and incorporates GPT-5-like reasoning; GPT-Realtime-Translate, a live translation model with support for more than 70 input languages and 13 output languages; and GPT-Realtime-Whisper, a streaming speech-to-text model designed for low-latency transcription.

      This release comes at a time when the voice-AI sector is undergoing significant development, with businesses that have introduced voice agents relying on a combination of various components, such as Whisper or Deepgram for transcription, ElevenLabs or Cartesia for text-to-speech, and GPT-4 or Claude for the reasoning processes, alongside custom turn-taking and barging mechanisms.

      What OpenAI provides with GPT-Realtime-2 is a unified model that processes both audio input and output with reasoning occurring directly within the audio loop rather than separating transcription and synthesis.

      So, what’s new?

      GPT-Realtime-2 incorporates multiple capabilities that production voice teams have been simulating through prompt scaffolding. For instance, preambles allow an agent to reassure the user with phrases like “let me check that” while accessing tools, preventing awkward silences. The model can initiate tool calls in parallel, enabling it to make multiple backend requests at once while informing users about the ongoing processes. Moreover, if something goes wrong, the model can recognize failures and address them instead of freezing the conversation. Additionally, it can intentionally modulate tone; for example, it may adopt a calmer tone in support scenarios and a more cheerful tone for confirmations.

      Two key metrics are particularly significant: the context window has been increased to 128K from 32K, which allows for longer conversations and more complex interactions without the need for external state management. The reasoning effort is adjustable with settings ranging from minimal to xhigh, with low set as the default to maintain low latency.

      According to OpenAI’s own benchmarks, GPT-Realtime-2 at high effort scores 15.2% higher than GPT-Realtime-1.5 on Big Bench Audio, the company’s audio-reasoning measure, and 13.8% higher on Audio MultiChallenge for instructions at xhigh effort. Customer evaluations show even more significant improvements.

      Zillow has reported a 26-point increase in call success rates on its most challenging benchmark, achieving 95% success with GPT-Realtime-2 compared to 69% with the previous model. BolnaAI, which is developing voice AI for Indian languages, has noted a 12.5% reduction in word error rates for Hindi, Tamil, and Telugu using the translation model.

      Pricing for GPT-Realtime-2 stands at $32 per million audio input tokens, $0.40 for cached input tokens, and $64 per million audio output tokens. GPT-Realtime-Translate is priced at $0.034 per minute, while GPT-Realtime-Whisper is priced at $0.017 per minute.

      The competitive translation pricing of GPT-Realtime-Translate, at a third of a cent per minute, poses a significant challenge to existing enterprise translation systems, offering lower costs and bundled services that typically require compromises on latency and language coverage. Similarly, Whisper streaming is priced aggressively at half that.

      Companies like ElevenLabs, which is the leading funded voice company on the market and involved in the recent seed rounds for Twilio’s revenue growth in voice AI, use a per-minute pricing model that combines synthesis and model inference.

      Buying decisions become more complex with OpenAI’s integrated model handling reasoning. Deepgram, known for its direct streaming-transcription services, faces similar pressures regarding Whisper streaming.

      OpenAI's customer launch list includes notable names: Zillow, Glean, Genspark, Bluejay, Intercom, Priceline, and Foundation Health for the real-time model; BolnaAI, Vimeo, and Deutsche Telekom for translation services.

      However, none of the three models eliminates the development work required for establishing guardrails, evaluation, escalation, and analytics necessary for voice agents before deployment. OpenAI offers active classifiers and EU data residency, but compliance, brand voice, and tool-call observability remain the developer's responsibility.

      The competitive landscape hinges on which platform can simplify these burdens most efficiently, and OpenAI’s strategy is to offer a cohesive audio reasoning model as a more sustainable solution than integrating multiple vendors. The ability of ElevenLabs, Deepgram, and others to maintain their market position will depend on how swiftly they advance their own integrated solutions.

      The upcoming quarter will provide the first opportunity to evaluate these products in

Other articles

OpenAI introduces GPT-Realtime-2 along with two new voice API models.

OpenAI has released three new voice models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper.