| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
fal is building the infrastructure layer for the generative media boom. In this episode, founders Gorkem Yurtseven, Burkay Gur, and Head of Engineering Batuhan Taskaya explain why video models present...
fal's founders discuss building infrastructure for generative media, explaining why video models present fundamentally different optimization challenges than LLMs—they're compute-bound rather than memory-bound, require custom kernels and tracing compilers, and have a 30-day half-life at the top of performance benchmarks. The team runs 600+ models simultaneously across 35 data centers, serving use cases from AI-native studios to personalized education and programmatic advertising. They argue the open-source video ecosystem thrives because visual outputs benefit from fine-tuning and customization in ways text models don't, and predict Hollywood will adapt to AI tools similar to how animation studios adopted CGI.
The team explains why generative video was initially overlooked compared to LLMs: lack of clear industry use cases and slower research investment. They made an early bet on image/video infrastructure when everyone else was focused on language models, recognizing the space was growing fast with less competition. Within months of positioning as a 'generative media platform,' Sora was announced, validating their thesis.
Batuhan (22-year-old Python core maintainer) explains fal's technical approach: a tracing compiler that finds common patterns across models and applies semi-generalized template kernels at runtime. This allows them to optimize 600+ models simultaneously while maintaining mathematical correctness. The key difference from LLMs: video models are compute-bound (saturating GPU compute) versus LLMs being memory-bound (moving large weights).
The infrastructure challenge goes beyond GPU optimization: fal runs 600 models simultaneously across 35 heterogeneous data centers, routing traffic efficiently while loading/unloading models and managing warm caches. They built custom orchestration and CDN services to treat distributed compute as a homogeneous cluster, tapping into scarce GPU capacity wherever available.
Video generation requires dramatically more compute than text or images. Using a 200-token LLM prompt as 1x baseline: a single image is ~100x, and a 5-second 24fps video is ~10,000x (100 frames × 100x per frame). 4K video adds another 10x. Real-time video streaming at 24fps presents new infrastructure challenges around latency and network optimization.
The 'omni-model' prediction hasn't materialized—specialized models outperform generalists for specific outputs. Visual domain benefits from fine-tuning in ways text doesn't: different aesthetics, styles, and use cases create 50+ active models with distinct 'personas.' Top model half-life is just 30 days, with expensive high-quality models and cheaper 'workhorse' models serving different use cases.
fal operates as a marketplace aggregating developers (demand) and model vendors (supply), including both proprietary APIs (OpenAI, Google) and open models they host. Model labs use fal for distribution to reach developers not locked into any single model. The platform's marketing machine and developer base enables exclusive/early launch access for models like Kling and MiniMax.
Professional workflows mirror traditional filmmaking: iterate on aesthetics with image models, create storyboards, then generate video interpolating between frames. Top customers use 14 models simultaneously in chains (text-to-image → upscaler → image-to-video). fal built no-code workflow builder with Shopify for non-technical teams, accessible via API for production.
Emerging use cases span security training with dynamic content (Adaptive Security), AI-native studios creating apps like Faith (Bible stories), design tools (Canva, Adobe), and programmatic advertising. Education represents huge untapped potential—video can compress concepts better than text. Ads range from UGC-style to high-production (Coca-Cola) to personalized individual-level content.
Hollywood initially seemed too slow to adapt, but shifted summer 2024. Jeffrey Katzenberg compared AI video to computer animation's arrival—initial rebellion, then inevitable adoption. Existing IP holders are well-positioned medium-term due to storytelling expertise, technical talent, and IP ownership. Both IP value increase and democratization of new IP creation (e.g., Italian Train Rod characters) are happening simultaneously.
Team predicts feature-grade short films (<20 min) in under a year, likely animation/fantasy rather than photorealistic due to cost dynamics. Text-to-game as natural evolution of text-to-video, with discardable hyper-casual games coming soon. Major R&D needs: better compression (24x on time dimension vs. current 4x), architectural improvements for 100x efficiency gains to reach real-time 4K generation.
The Rise of Generative Media: fal's Bet on Video, Infrastructure, and Speed
Ask me anything about this podcast episode...
Try asking: