DeepL introduces real-time voice-to-voice translation in over 40 languages.
The translation company based in Cologne, known for its text tools, has introduced a comprehensive voice product suite that encompasses meetings, conversations, group environments, and an API for enterprise integration. A live demonstration in Seoul revealed delays of one to two sentences, and DeepL’s Chief Product Officer acknowledged that variations in word order between languages present a significant challenge.
DeepL, the Cologne-based AI language firm recognized for its high-quality text translation, has launched DeepL Voice-to-Voice, a real-time spoken translation suite tailored for immediate business communication.
This product caters to four specific scenarios: virtual meetings, mobile and web conversations, group settings for frontline workers, and enterprise applications via an API. It supports more than 40 languages, including all 24 official EU languages, as well as others like Vietnamese, Thai, Arabic, Norwegian, Hebrew, Bengali, and Tagalog.
The suite comprises four components, each at different stages of readiness. Voice for Conversations, which facilitates real-time translation on mobile and web platforms without needing app installation, is now fully available.
Voice for Meetings, designed to integrate with Microsoft Teams and Zoom allowing participants to converse in their native language while others receive simultaneous translation in theirs, will begin an early access program in June.
The Voice-to-Voice API, enabling businesses to incorporate DeepL’s translation capabilities into their customer-facing applications, such as call centers, is currently in early access. A customization feature, Spoken Terms, allowing the system to familiarize itself with specific industry vocabulary, company names, and personal names, is set to be generally available on May 7.
Jarek Kutylowski, founder and CEO of DeepL, characterized the launch as reaching "another frontier in translation." He stated, "DeepL Voice-to-Voice allows everyone to speak naturally in their own language without the friction or cost of interpreters."
DeepL has positioned this product as an enterprise solution rather than a consumer one, emphasizing that its voice technology does not use customer data for model training and does not permanently retain transcription or translation data after a call ends. This security perspective differentiates it from consumer AI voice products and targets regulated industries.
The existing system operates through a three-step process: speech is converted to text, that text is translated using DeepL’s proven translation engine, and the final output is transformed back into speech.
DeepL's competitive edge is based on the quality of the middle step: the company asserts that its text translation models surpass those of competitors, and that benefit carries through to the voice output.
In blind evaluations commissioned by DeepL and carried out independently by Slator, a language industry research firm, 96% of professional linguists preferred DeepL Voice over the native translation services in Google Meet, Microsoft Teams, and Zoom, highlighting its superior fluency and contextual accuracy. DeepL Voice achieved scores of 96.4 out of 100 for Zoom and 96.3 for Microsoft Teams.
However, a live demo presented by Chief Product Officer Gonzalo Gaiolas during the DeepL Connect Seoul event on April 15 revealed a current limitation: a noticeable delay of one to two sentences between when the speaker finished and when the translation was delivered. Gaiolas directly acknowledged this delay, noting, “Different languages have different word orders and sentence structures, which causes delays in real-time interpretation,” as reported by Seoul Economic Daily.
The company aims to minimize latency through ongoing model enhancements. Regarding voice quality, the current system uses a fixed synthetic voice; DeepL announced plans to introduce a voice-preservation feature that maintains the speaker’s original voice traits in the translated output by the end of 2026.
DeepL is entering a competitive market filled with well-capitalized rivals. Sanas, which leverages AI to adjust speakers’ accents in real time for call center purposes, recently raised $65 million in a funding round led by Quadrille Capital.
Based in Dubai, Camb.AI specializes in speech synthesis and translation for media dubbing. Palabra, supported by Alexis Ohanian’s Seven Seven Six, is working on a real-time speech translation engine focused on preserving the speaker's voice characteristics.
Google, Microsoft, and Zoom all provide their own translation features for meetings, which represent both competition and potential integration channels for DeepL. The company's strategic gamble is that the quality of its translations, its most established differentiator, can outweigh the inherent advantages existing platforms have in distribution.
Other articles
DeepL introduces real-time voice-to-voice translation in over 40 languages.
DeepL has introduced Voice-to-Voice, a real-time spoken translation tool designed for meetings, discussions, and enterprise API.
