DeepL introduces real-time voice-to-voice translation in more than 40 languages.
The translation company based in Cologne, known for its text tools, has introduced a comprehensive voice product suite designed for meetings, conversations, group interactions, and an API for enterprise integration. A live demonstration in Seoul showed a delay of one to two sentences, and DeepL’s Chief Product Officer acknowledged that differences in word order among languages pose a significant challenge.
DeepL, the Cologne-based AI language firm recognized for its high-quality text translation, has launched DeepL Voice-to-Voice: a real-time spoken translation suite tailored for live business communications.
The product encompasses four specific use cases: virtual meetings, mobile and web conversations, group settings for frontline employees, and enterprise applications via an API. It supports over 40 languages, including all 24 official languages of the EU, along with Vietnamese, Thai, Arabic, Norwegian, Hebrew, Bengali, and Tagalog.
The suite consists of four components, each at varying stages of availability. Voice for Conversations, which allows for real-time translation across mobile and web platforms without app installation, is now generally accessible.
Voice for Meetings, which integrates with Microsoft Teams and Zoom so that participants can converse in their native languages while hearing simultaneous translations in their own, will begin an early access program in June. The Voice-to-Voice API, which enables businesses to incorporate DeepL’s translation engine into their customer-facing applications like call centers, is currently in early access. A customization feature known as Spoken Terms, which helps the system learn specific vocabulary from various industries, company names, and personal names, is set to be generally available on May 7.
Jarek Kutylowski, DeepL’s founder and CEO, characterized the launch as an achievement in "exploring new frontiers in translation.”
“DeepL Voice-to-Voice enables natural dialogue in one’s own language without the barriers or costs associated with interpreters,” he stated.
DeepL is positioning this product as an enterprise solution rather than a consumer offering: the company has emphasized that its voice technology does not utilize customer data for training its models and does not permanently keep transcription or translation data post-call. This security aspect differentiates it from consumer AI voice products and targets regulated sectors.
The current system operates through a three-step process: speech is converted to text, which is then translated via DeepL’s renowned translation engine, and finally, the translated text is converted back into speech.
DeepL's competitive edge hinges on the quality of this translation stage: the company claims its text translation models are superior to alternatives, an advantage that extends to the voice output.
In blind evaluations commissioned by DeepL and conducted independently by Slator, a language industry research firm, 96% of professional linguists preferred DeepL Voice over the native translation tools in Google Meet, Microsoft Teams, and Zoom, highlighting its superior fluency and contextual accuracy. DeepL Voice received scores of 96.4 out of 100 for Zoom and 96.3 for Microsoft Teams.
However, a live demonstration by Chief Product Officer Gonzalo Gaiolas during the DeepL Connect Seoul event on April 15 revealed a current limitation: a noticeable delay of one to two sentences between the conclusion of the speaker’s input and the delivery of the translation.
Gaiolas directly acknowledged this delay: “Different languages have different word orders and sentence structures, which results in delays in real-time interpretation,” he mentioned, as reported by Seoul Economic Daily.
The company aims to mitigate latency through ongoing model enhancement. In terms of voice quality, the existing system utilizes a fixed synthetic voice; DeepL has stated plans to introduce a voice-preservation feature that maintains the speaker's original voice characteristics in the translated output by late 2026.
DeepL enters a market with several well-funded competitors. Sanas, which employs AI to adjust speakers' accents in real time for call center solutions, recently raised $65 million led by Quadrille Capital.
Camb.AI, based in Dubai, focuses on speech synthesis and translation for media dubbing, while Palabra, supported by Reddit co-founder Alexis Ohanian's Seven Seven Six, is developing a real-time speech translation engine aimed at preserving the speaker's voice characteristics.
Google, Microsoft, and Zoom each provide their own meeting translation features, which DeepL concurrently seeks to compete with and integrate into. DeepL’s strategic focus is that translation quality, its most established differentiator, will be a strong counterbalance to the structural advantages held by its competitors in platform distribution.
Other articles
DeepL introduces real-time voice-to-voice translation in more than 40 languages.
DeepL has introduced Voice-to-Voice, a real-time spoken translation suite designed for meetings, conversations, and enterprise API usage.
