Workshop "From Noise to Narrative: The Evolution of Visual Generative Models"
Sergey Kastryulin
Yandex Research
Visual generative models have rapidly evolved from producing stochastic, uncontrolled outputs to synthesizing highly specific, prompt-aligned imagery. This talk provides a condensed overview of this progression, tracing the development of key conditioning mechanisms that enabled the shift from simple class-labels to the nuanced, free-form text control of today's powerful text-to-image systems. We will establish a clear conceptual path of how the field has systematically gained greater control over the generative process, setting the stage for the next wave of multi-modal innovation.
The current research frontier is now pushing beyond static generation toward the unification of vision and language within single, cohesive architectures. We will explore how these emerging multi-modal models are being designed to not only generate images but also to engage in dialog, follow complex editing instructions, and produce interleaved text-and-image outputs. Drawing on insights from our group's work, the talk will address critical practical challenges, including model distillation, inference scaling, and fine-tuning for real-world applications. We will conclude by discussing the challenges of large-scale data curation and high computational demands that must be overcome to realize the full potential of the next-generation conversational visual systems.