Workshop "Understanding How LLMs Work: An Introduction to Interpretability", AIRI
Alexey Dontsov
AIRI
Elena Tutubalina
AIRI
We will cover the foundations of neural network interpretation, focusing on Sparse Autoencoders - a simple but powerful concept that has brought us closer to understanding the inner workings of large language models. We'll explore why self-attention works, viewed through the lens of information flow, what circuits are and how models leverage them to solve problems, and how these insights are advancing our understanding of LLM architecture.