Tutorial "Matrix Optimization and Muon: A Natural Perspective on Neural Network Training"

Maxim Rakhuba
HSE University

How does a state-of-the-art optimizer like Muon actually work under the hood? In this first lecture, we will dive into the numerical core that powers its efficiency. We will explore the essential numerical linear algebra concepts that make Muon so effective. This involves solving matrix optimization problems, including the famous matrix Procrustes problem. This first part of the course will ensure that you will have a smoother understanding of the second part, where optimization theory is discussed.

Aleksandr Beznosikov
MIPT, Innopolis, ISP RAS

It is no secret that the most well-known optimization methods are obtained from the simplest approximation of the target loss function by its Taylor expansion. This is how the Newton's method is obtained from the second-order expansion, and if we roughen this approximation, we get a gradient descent. A rather natural idea arises - what happens if this approximation is further improved? In this way, other methods similar to gradient descent can be obtained. The most interesting thing happens when we note that it is more natural to look at neural network optimization as a problem with matrix variables. Thus, we come to the MuON method, probably the most famous optimizer of the last year. But in fact, other well-known methods such as Shampoo and SOAP live in the same neighborhood of this idea. During the tutorial, we will look at this story in detail.

Tutorial "Matrix Optimization and Muon: A Natural Perspective on Neural Network Training"

Maxim Rakhuba HSE University

Aleksandr Beznosikov MIPT, Innopolis, ISP RAS

Maxim Rakhuba
HSE University

Aleksandr Beznosikov
MIPT, Innopolis, ISP RAS