| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
Imagine learning chess from a grand master, or negotiating tactics from an expert FBI hostage negotiator. ElevenLabs’ voice AI technology is making that unlock possible. Sarah Guo sits down with Mati ...
Mati Staniszewski, co-founder of ElevenLabs, discusses how his company reached $300M ARR in three years by building foundational audio models and products for both creators and enterprises. The conversation covers ElevenLabs' dual-platform strategy (creative tools and conversational agents), their approach to combining research with rapid product deployment, and why voice will become the primary interface for technology. Key insights include their 50/50 revenue split between self-serve and enterprise, the importance of controllable voice models over raw quality, and emerging use cases from AI tutors to agentic government.
Mati reveals ElevenLabs has reached $300M ARR with 350 employees globally, split 50/50 between self-serve creative platform (5M monthly actives) and enterprise agents platform (thousands of customers including Fortune 500s). He addresses the counterintuitive nature of building both research and multi-segment products simultaneously, explaining how the initial insight came from Poland's poor dubbing experience.
Mati explains ElevenLabs' organizational structure of creating specialized 'labs' (voice lab, agent lab, music lab) that combine researchers, engineers, and operators around specific problems. He discusses how they sequence research first, then build product layers, and the critical decision framework of whether to wait for research breakthroughs or ship product improvements (3-month rule).
Discussion of how voice selection is highly subjective and context-dependent, requiring ElevenLabs to employ 'voice sommeliers' who help enterprises choose appropriate voices for their brand and use cases. Mati reveals that benchmarking audio quality is fundamentally harder than text/image because voice preference varies dramatically by use case, language, and even customer demographics.
Mati outlines the evolution of voice agents from reactive customer support to proactive assistants that enhance entire customer journeys. Examples include Meesho (India's largest e-commerce) using agents for product discovery and checkout, Square enabling voice ordering, and Epic Games bringing Darth Vader's voice to Fortnite for interactive gameplay.
Mati identifies education as the most exciting future application, with examples of Chess.com offering lessons from Hikaru Nakamura and Magnus Carlsen, and Masterclass enabling practice negotiations with FBI negotiator Chris Voss. He envisions a future where students have personalized AI tutors on-demand while maintaining dedicated human interaction time.
Mati describes Ukraine's ambitious project to transform all government ministries using AI agents, combining customer support (benefits, employment, travel processes), proactive citizen communication, and education through a digital app. Engineering leaders in each ministry coordinate with central digital transformation team.
Mati clarifies ElevenLabs' positioning against consulting firms (Palantir), point solution providers (Sierra), and foundation model companies (OpenAI). He argues that foundation labs won't focus on product layers or audio-specific research, while ElevenLabs excels at multi-use-case deployments with international support and forward-deployed engineering.
Mati predicts base models will commoditize within 2-4 years, making product layer and fine-tuning critical. He distinguishes between narration (where open source quality is good), real-time interaction (still 1+ year from passing Turing test), and real-time dubbing (2 years away). The key differentiator is controllability, not raw quality.
Mati reveals ElevenLabs just released Scribe 2 speech-to-text model with sub-150ms latency and 93.5% accuracy across 30 languages. Upcoming releases include new orchestration mechanisms that add emotional context to conversations, and investments in parallelized speech-to-speech approaches for more natural interactions.
Mati shares his vision for voice as the primary interface, preferring 'Jarvis-style' super assistants over social companions. He predicts education will be transformed by on-demand AI tutors (potentially voiced by Richard Feynman or Einstein), while maintaining explicit human-to-human learning time. Voice will be critical for the coming 'decade of robots' following the 'decade of agents.'
The Future of Voice AI: Agents, Dubbing, and Real-Time Translation with ElevenLabs Co-Founder Mati Staniszewski
Ask me anything about this podcast episode...
Try asking: