Episode	Podcast	Published	Duration	Status

Training Data

The Rise of Generative Media: fal's Bet on Video, Infrastructure, and Speed

December 10, 2025•3738•10,029 words•Sequoia Capital

Description

fal is building the infrastructure layer for the generative media boom. In this episode, founders Gorkem Yurtseven, Burkay Gur, and Head of Engineering Batuhan Taskaya explain why video models present...

Summary

fal's founders discuss building infrastructure for generative media, explaining why video models present fundamentally different optimization challenges than LLMs—they're compute-bound rather than memory-bound, require custom kernels and tracing compilers, and have a 30-day half-life at the top of performance benchmarks. The team runs 600+ models simultaneously across 35 data centers, serving use cases from AI-native studios to personalized education and programmatic advertising. They argue the open-source video ecosystem thrives because visual outputs benefit from fine-tuning and customization in ways text models don't, and predict Hollywood will adapt to AI tools similar to how animation studios adopted CGI.

Jump to Topic

Why Generative Video Was Overlooked and fal's Early Bet

The team explains why generative video was initially overlooked compared to LLMs: lack of clear industry use cases and slower research investment. They made an early bet on image/video infrastructure when everyone else was focused on language models, recognizing the space was growing fast with less competition. Within months of positioning as a 'generative media platform,' Sora was announced, validating their thesis.

•Video/image lacked obvious use cases like coding or search that drove LLM investment
•Models felt like toys 3 years ago but are now capable enough for real industry applications
•Team saw initial customer growth and doubled down while competitors over-indexed on AGI narrative
•Changed website to 'generative media platform' just 2-3 months before Sora announcement

Core Inference Engine: Tracing Compiler and Custom Kernels

Batuhan (22-year-old Python core maintainer) explains fal's technical approach: a tracing compiler that finds common patterns across models and applies semi-generalized template kernels at runtime. This allows them to optimize 600+ models simultaneously while maintaining mathematical correctness. The key difference from LLMs: video models are compute-bound (saturating GPU compute) versus LLMs being memory-bound (moving large weights).

•Built tracing compiler to avoid optimizing single models that become obsolete in 30 days
•10% of team writes templated kernels that are 95% generalized, specialized at runtime
•Video models denoise tens of thousands of tokens simultaneously vs. LLMs predicting next token
•Diffusion models saturate compute bandwidth; LLMs bottlenecked on memory bandwidth
•Maintains #1 spot on performance benchmarks through pure focus while others chase LLM optimization

Distributed Supercomputing: Managing 600 Models Across 35 Data Centers

The infrastructure challenge goes beyond GPU optimization: fal runs 600 models simultaneously across 35 heterogeneous data centers, routing traffic efficiently while loading/unloading models and managing warm caches. They built custom orchestration and CDN services to treat distributed compute as a homogeneous cluster, tapping into scarce GPU capacity wherever available.

•Must run each of 600 models better than if someone ran just that single model
•Built custom orchestrator and CDN from fundamentals to route consumer-facing workloads
•Hyperscalers lack advantage: no expertise in inference optimization, buying from same neo-clouds
•Neo-cloud pricing 2-3x cheaper than hyperscalers due to market pressure and lower operational overhead
•Top 100 customers use average of 14 different models simultaneously

Compute Requirements: Text vs Image vs Video (10,000x Difference)

Video generation requires dramatically more compute than text or images. Using a 200-token LLM prompt as 1x baseline: a single image is ~100x, and a 5-second 24fps video is ~10,000x (100 frames × 100x per frame). 4K video adds another 10x. Real-time video streaming at 24fps presents new infrastructure challenges around latency and network optimization.

•5-second standard definition video = 10,000x compute of 200-token LLM prompt
•4K video adds another 10x multiplier on top of base video compute
•Image models run on single GPU with smaller parameter counts (easier than LLMs)
•Video parameter counts growing toward LLM scale, requiring distributed computing
•Real-time video requires <50ms system overhead, globally distributed GPU routing

Why Video Has a Long Tail of Models (Unlike LLMs)

The 'omni-model' prediction hasn't materialized—specialized models outperform generalists for specific outputs. Visual domain benefits from fine-tuning in ways text doesn't: different aesthetics, styles, and use cases create 50+ active models with distinct 'personas.' Top model half-life is just 30 days, with expensive high-quality models and cheaper 'workhorse' models serving different use cases.

•Best upscaling, editing, and text-to-image models are all different specialized architectures
•Visual outputs make fine-tuning differences immediately apparent vs. indistinguishable LLM fine-tunes
•~50 models active at any time across categories (upscaling, editing, video) × personas
•Top 5 model rankings change every 30 days—extremely fast depreciation schedule
•Two-tier model usage: expensive high-quality (Vio, Kling, Sora) + cheaper workhorse for volume

Marketplace Dynamics: Aggregating Developers and Model Vendors

fal operates as a marketplace aggregating developers (demand) and model vendors (supply), including both proprietary APIs (OpenAI, Google) and open models they host. Model labs use fal for distribution to reach developers not locked into any single model. The platform's marketing machine and developer base enables exclusive/early launch access for models like Kling and MiniMax.

•Built developer love first, which attracted enterprise customers and then model labs
•Model labs see fal as distribution channel to developer ecosystem not tied to single model
•Co-marketing with model providers yields exclusive release windows, sometimes permanent
•Platform effect: more developers attract more models, which attract more developers
•Open source ecosystem thrives because visual fine-tuning creates tangible differentiation

Developer Workflows: Storyboarding, Chaining Models, No-Code Tools

Professional workflows mirror traditional filmmaking: iterate on aesthetics with image models, create storyboards, then generate video interpolating between frames. Top customers use 14 models simultaneously in chains (text-to-image → upscaler → image-to-video). fal built no-code workflow builder with Shopify for non-technical teams, accessible via API for production.

•Workflows resemble Pixar's storyboarding process: iterate images, then generate video
•Studios prefer open-source models for granular control over each pipeline component
•Complex ComfyUI-style workflows with replaceable nodes for different model pieces
•No-code builder enables PMs/marketing to experiment, then productionize via API
•Single individuals spending $500K+, small studios spending more on generation costs

Use Cases: Education, AI-Native Studios, Programmatic Ads

Emerging use cases span security training with dynamic content (Adaptive Security), AI-native studios creating apps like Faith (Bible stories), design tools (Canva, Adobe), and programmatic advertising. Education represents huge untapped potential—video can compress concepts better than text. Ads range from UGC-style to high-production (Coca-Cola) to personalized individual-level content.

•Adaptive Security generates personalized security training content dynamically per person
•Faith app (Bible stories) is top-ranked App Store app using AI-generated video
•Programmatic ads can personalize to individual level, not just demographics
•Education market largely untapped—video compresses learning better than 10K characters of text
•VFX (explosions, building collapses) already AI-native due to ease of generation

Hollywood's Adaptation and the Future of IP

Hollywood initially seemed too slow to adapt, but shifted summer 2024. Jeffrey Katzenberg compared AI video to computer animation's arrival—initial rebellion, then inevitable adoption. Existing IP holders are well-positioned medium-term due to storytelling expertise, technical talent, and IP ownership. Both IP value increase and democratization of new IP creation (e.g., Italian Train Rod characters) are happening simultaneously.

•Katzenberg: AI video parallels computer animation adoption—rebellion then inevitability
•Hollywood has storytelling know-how, technical talent, IP, and budgets for medium term
•Infinite content generation paradoxically increases value of finite, nostalgic IP
•Community-created IP (Italian Train Rod) can emerge and become cultural phenomena
•First generative media conference signals industry taking technology seriously

Timelines and Future R&D: Architecture, Compression, and Interactive Games

Team predicts feature-grade short films (<20 min) in under a year, likely animation/fantasy rather than photorealistic due to cost dynamics. Text-to-game as natural evolution of text-to-video, with discardable hyper-casual games coming soon. Major R&D needs: better compression (24x on time dimension vs. current 4x), architectural improvements for 100x efficiency gains to reach real-time 4K generation.

•Feature-grade short films (<20 min) achievable in under 1 year with current model quality
•Animation/fantasy more likely than photorealistic—filming is cheap, VFX/animation is expensive
•Text-to-game as continuation of text-to-video: making video interactive
•Discardable hyper-casual games coming soon; AAA production 3-4 years away
•Architecture needs to change for 10-100x scale: better latent space compression critical
•Far from scaled-up engineering—abundant video data, will run out of compute first

Training Data

The Rise of Generative Media: fal's Bet on Video, Infrastructure, and Speed

0:00 / 0:00

Summary

Jump to Topic

Why Generative Video Was Overlooked and fal's Early Bet

•Video/image lacked obvious use cases like coding or search that drove LLM investment
•Models felt like toys 3 years ago but are now capable enough for real industry applications
•Team saw initial customer growth and doubled down while competitors over-indexed on AGI narrative
•Changed website to 'generative media platform' just 2-3 months before Sora announcement

Core Inference Engine: Tracing Compiler and Custom Kernels

•Built tracing compiler to avoid optimizing single models that become obsolete in 30 days
•10% of team writes templated kernels that are 95% generalized, specialized at runtime
•Video models denoise tens of thousands of tokens simultaneously vs. LLMs predicting next token
•Diffusion models saturate compute bandwidth; LLMs bottlenecked on memory bandwidth
•Maintains #1 spot on performance benchmarks through pure focus while others chase LLM optimization

Distributed Supercomputing: Managing 600 Models Across 35 Data Centers

•Must run each of 600 models better than if someone ran just that single model
•Built custom orchestrator and CDN from fundamentals to route consumer-facing workloads
•Hyperscalers lack advantage: no expertise in inference optimization, buying from same neo-clouds
•Neo-cloud pricing 2-3x cheaper than hyperscalers due to market pressure and lower operational overhead
•Top 100 customers use average of 14 different models simultaneously

Compute Requirements: Text vs Image vs Video (10,000x Difference)

•5-second standard definition video = 10,000x compute of 200-token LLM prompt
•4K video adds another 10x multiplier on top of base video compute
•Image models run on single GPU with smaller parameter counts (easier than LLMs)
•Video parameter counts growing toward LLM scale, requiring distributed computing
•Real-time video requires <50ms system overhead, globally distributed GPU routing

Why Video Has a Long Tail of Models (Unlike LLMs)

•Best upscaling, editing, and text-to-image models are all different specialized architectures
•Visual outputs make fine-tuning differences immediately apparent vs. indistinguishable LLM fine-tunes
•~50 models active at any time across categories (upscaling, editing, video) × personas
•Top 5 model rankings change every 30 days—extremely fast depreciation schedule
•Two-tier model usage: expensive high-quality (Vio, Kling, Sora) + cheaper workhorse for volume

Marketplace Dynamics: Aggregating Developers and Model Vendors

•Built developer love first, which attracted enterprise customers and then model labs
•Model labs see fal as distribution channel to developer ecosystem not tied to single model
•Co-marketing with model providers yields exclusive release windows, sometimes permanent
•Platform effect: more developers attract more models, which attract more developers
•Open source ecosystem thrives because visual fine-tuning creates tangible differentiation

Developer Workflows: Storyboarding, Chaining Models, No-Code Tools

•Workflows resemble Pixar's storyboarding process: iterate images, then generate video
•Studios prefer open-source models for granular control over each pipeline component
•Complex ComfyUI-style workflows with replaceable nodes for different model pieces
•No-code builder enables PMs/marketing to experiment, then productionize via API
•Single individuals spending $500K+, small studios spending more on generation costs

Use Cases: Education, AI-Native Studios, Programmatic Ads

•Adaptive Security generates personalized security training content dynamically per person
•Faith app (Bible stories) is top-ranked App Store app using AI-generated video
•Programmatic ads can personalize to individual level, not just demographics
•Education market largely untapped—video compresses learning better than 10K characters of text
•VFX (explosions, building collapses) already AI-native due to ease of generation

Hollywood's Adaptation and the Future of IP

•Katzenberg: AI video parallels computer animation adoption—rebellion then inevitability
•Hollywood has storytelling know-how, technical talent, IP, and budgets for medium term
•Infinite content generation paradoxically increases value of finite, nostalgic IP
•Community-created IP (Italian Train Rod) can emerge and become cultural phenomena
•First generative media conference signals industry taking technology seriously

Timelines and Future R&D: Architecture, Compression, and Interactive Games

•Feature-grade short films (<20 min) achievable in under 1 year with current model quality
•Animation/fantasy more likely than photorealistic—filming is cheap, VFX/animation is expensive
•Text-to-game as continuation of text-to-video: making video interactive
•Discardable hyper-casual games coming soon; AAA production 3-4 years away
•Architecture needs to change for 10-100x scale: better latent space compression critical
•Far from scaled-up engineering—abundant video data, will run out of compute first

Training Data

The Rise of Generative Media: fal's Bet on Video, Infrastructure, and Speed

0:00 / 0:00

The Rise of Generative Media: fal's Bet on Video, Infrastructure, and Speed

Description

Summary

Jump to Topic

Why Generative Video Was Overlooked and fal's Early Bet

Core Inference Engine: Tracing Compiler and Custom Kernels

Distributed Supercomputing: Managing 600 Models Across 35 Data Centers

Compute Requirements: Text vs Image vs Video (10,000x Difference)

Why Video Has a Long Tail of Models (Unlike LLMs)

Marketplace Dynamics: Aggregating Developers and Model Vendors

Developer Workflows: Storyboarding, Chaining Models, No-Code Tools

Use Cases: Education, AI-Native Studios, Programmatic Ads

Hollywood's Adaptation and the Future of IP

Timelines and Future R&D: Architecture, Compression, and Interactive Games

Navigate

Chat with Episode

Summary

Jump to Topic

Why Generative Video Was Overlooked and fal's Early Bet

Core Inference Engine: Tracing Compiler and Custom Kernels

Distributed Supercomputing: Managing 600 Models Across 35 Data Centers

Compute Requirements: Text vs Image vs Video (10,000x Difference)

Why Video Has a Long Tail of Models (Unlike LLMs)

Marketplace Dynamics: Aggregating Developers and Model Vendors

Developer Workflows: Storyboarding, Chaining Models, No-Code Tools

Use Cases: Education, AI-Native Studios, Programmatic Ads

Hollywood's Adaptation and the Future of IP

Timelines and Future R&D: Architecture, Compression, and Interactive Games

Navigate

Chat with Episode