• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Scientific seminar

Structural Learning Seminar is a seminar for researchers, Master’s and Ph.D. students of HSE, Skoltech, IITP. This seminar is devoted to the problems of Mathematics and Computer Science and covers areas such as Mathematical Statistics, Machine Learning, Random Matrix Theory, Optimization, Information Theory, etc. 

Seminar organizers: Sergey Samsonov,  Nikita Puchkin .

Follow the updates of the seminars in our telegram channel: HDI Lab junior team seminar

Those who wish to make a presentation at the Seminar are asked to e-mail: npuchkin@hse.ru or svsamsonov@hse.ru with an indication of the topic of the presentation, an abstract in English (not more than 1/2 page) and a suitable date of presentation. The speaker should inform in advance about the chosen form of the report - on the chalk board or computer presentation (the preferred form of the report is on the chalk board). The length of the presentation is 1 hour (if necessary, the length of the presentation may be extended to 1.5-2 hours)

For the winter 2024 semester, the seminar takes place on Thursdays at 2:40 p.m.


  Speaker Title Abstract


Marina Sheshukova  (HSE University) and Arthur Goldman  (HSE University)

Part l: Stochastic Methods in Variational Inequalities: Ergodicity, Bias and Refinements

Part ll: Reflection Coupling for Wasserstein geometric convergence

At the seminar, we will discuss the general formulation of the problem of variational inequalities. We will analyze two algorithms Stochastic Gradient Descent Ascent and Double Step-size Stochastic Extra Gradient as well as their theoretical properties.


In Part ll, we will look at how it is possible to prove the convergence of measures according to the Waserstein metric through the construction of caplings.

25.01.2024 Alexander Beznosichov (MIPT) On Distributed Optimization Problems under Similarity: Optimal Algorithms and Fast Extensions In this presentation, we consider distributed methods for solving optimization problems, namely variational inequalities, which are interesting both in themselves and because they include both classical minimization problems and minimax problems. In the distributed formulation, the target operator of a variational inequality is divided into parts, and each of these parts can be accessed only by a local agent/worker. We deal with the case where the local operators are "similar" to each other in some sense. Due to the "similarity" it is possible to achieve a significant acceleration of the theoretical guarantees of convergence of methods in terms of estimates on communication complexity. Besides the issue of convergence of algorithms and obtaining upper bounds, we touch upon lower complexity bounds and verify the optimality of the proposed methods. In the remaining time, we try to discuss the question of how we can "break through" the lower estimates and construct an even faster method, in particular, we additionally introduce the possibility of compressing the transmitted information, modify the proposed algorithms and obtain upper and lower bounds in a new formulation.
15.02.2024 Ilya Levin (HSE University) Central Limit Theorems for Stochastic Approximation
22.02.2024 Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin) Gaussian Variational Inference Variational inference (VI) considers the problem of approximating a given measure (posterior) by a member of a given parametric family by minimizing the related KL-divergence. Gaussian VI focuses on the Gaussian approximation. We revisit the recent results by Katsevich and Rigollet (2023) and demonstrate how this problem can be solved using some advanced tools from optimization (the basic lemma).
29.02.2024 Yuri Svirshchevsky (HSE University) Monte Carlo Variational Auto-Encoders Variational autoencoders are generative models that are trained by maximizing some lower estimate of the validity of the observed data (evidence lower bound or ELBO). The quality of the trained model depends on the accuracy of the assessment. The seminar will explore ways to obtain two ELBO based on Annealed Importance Sampling and Sequential Importance Sampling, their differentiation and application in VAE.
07.03.2024 Nikita Puchkin (HSE University) Dimension-free Structured Covariance Estimation Given a sample of i.i.d. high-dimensional centered random vectors, we consider a problem of estimation of their covariance matrix $\Sigma$ with an additional assumption that $\Sigma$ can be represented as a sum of a few Kronecker products of smaller matrices. Under mild conditions, we derive the first non-asymptotic dimension-free high-probability bound on the Frobenius distance between $\Sigma$ and a widely used penalized permuted least squares estimate. Because of the hidden structure, the established rate of convergence is faster than in the standard covariance estimation problem.
14.03.2024 Alexander Osinsky (Skoltech, IVM RAS) Existence and construction of column and cross approximations of matrices close to optimal: probabilistic estimates -
28.03.2024 Zhuo-Song Zhang (Southern University of Science and Technology) Berry-Esseen bounds for generalized U-statistics We establish optimal Berry-Esseen bounds for the generalized U- statistics. The proof is based on a new Berry-Esseen theorem for exchangeable pair approach by Stein’s method under a general linearity condition setting. As applications, an optimal convergence rate of the normal approximation for subgraph counts in Erdös-Rényi graphs and graphon-random graph is obtained.
04.04.2024 Vadim Popov (Huawei) Advantages of additional training of diffusion probabilistic models using the Huber loss function -



Fall 2023 

Date Speaker Title Abstract


Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)

Deviation bounds for the norm of a random vector under exponential moment conditions with applications

Hanson-Wright inequality provides a powerful tool for bounding the norm \| \xi \| of a centered stochastic vector \( \xi \) with sub-gaussian behavior. This paper extends the bounds to the case when \( \xi \) only has bounded exponential moments of the form \( \log E \exp \langle V^{-1} \xi,u \rangle \leq \| u \|^{2}/2 \), where \( V^{2} \geq \mathrm{Var}(\xi) \) and \( \| u \| \leq g \) for some fixed \( g \). For a linear mapping \( Q \), we present an upper quantile function \( z_{c}(B,x) \) ensuring \( P(\| Q \xi \| > z_{c}(B,x)) \leq 3 e^{-x} \) with \( B = Q \, V^{2} Q^{T} \). The obtained results exhibit a phase transition effect: with a value \( x_{c} \) depending on \( g \) and \( B \), for \( x \leq x_{c} \), the function \( z_{c}(B,x) \) replicates the case of a Gaussian vector \( \xi \), that is, \( z_{c}^{2}(B,x) = \tr(B) + 2 \sqrt{x \tr(B^{2})} + 2 x \| B \| \). For \( x > x_{c} \), the function \( z_{c}(B,x) \) grows linearly in \( x \). The results are specified to the case of Bernoulli vector sums and to covariance estimation in Frobenius norm.


Arthur Goldman (HSE University) RL with friends: how many agents does it take to solve an environment? As part of the report, students will get acquainted with the main definitions and results of the field related to the study multi-agent reinforcement learning и mean-field games.


Nikita Puchkin (HSE University) Sharper dimension-free bounds on the Frobenius distance between sample covariance and its expectation We consider the problem of covariance estimation from a sample of high-dimensional random vectors. We derive dimension-free bounds on the squared Frobenius distance between sample covariance and its expectation, which significantly improve over the existing results.


Fedor Noskov (HSE University, MIPT, Skoltech) Analysis of Boolean functions At the seminar, we will discuss general techniques of Boolean analysis, as well as the consequences that these techniques lead to in learning theory, approximation theory, and combinatorics. The main subject of our discussion will be the Fourier analysis of Boolean functions. Also, we will touch a little on the phenomenon of hypercontractivity and its various applications.
19.10.2023  Denis Rakitin (HSE University)  Generative models based on ODE/DE  -
09.11.2023 Evgeny Frolov (HSE University, AIRI) Learning from sequences using tensor decompositions The report is devoted to the application of tensor decompositions and effective computational methods of linear algebra to create an analog of the self-attention model on sequences of user actions. This approach can be applied both in the standard task of predicting the next user actions, and in a wider range of tasks related to the analysis of incomplete multidimensional time series.
14.12.2023 Alexander Tarakanov (HSE University, Odnoklassniki). Application of machine learning methods for computing eigenfunctions of differential operators -

Spring 2023
Date Speaker Title Abstract


Alexandra Senderovich (HSE University)

Norm and trace estimation with random rank-one vectors (paper overview)

To estimate the norm of an arbitrary matrix or the trace of a symmetric nonnegatively defined matrix, several multiplications of the matrix by random vectors are often enough. Using rank 1 vectors, i.e. Kronecker products of smaller random vectors, it is possible to speed up the calculation of such an estimate by speeding up sampling and matrix multiplication. The article shows that with random vectors of rank 1 sampled from a normal distribution or a Rademacher distribution, the probabilities of getting good estimates for the norm deteriorate slightly compared to using ordinary vectors from the same distributions. The authors also conducted numerical experiments confirming the derived probabilities and showing the advantages of the approach with rank 1 vectors.

31.01.2023 Dmitry Lishudi (HSE University) "How benign is benign overfitting?" On many datasets, neural networks when trained with SGD achieve almost zero error on training data and good generalizability on test data, even when the training partitioning is noisy. This effect is called benign overfitting. Nevertheless, such models are vulnerable to adversarial attacks. The first reason for this vulnerability is poor sampling. Thus, theoretical and empirical analysis shows that noise in data partitioning is one of the causes of adversarial vulnerability and robust models fail to achieve zero training error. However, removing incorrect labels does not achieve resilience to adversarial attacks. The paper suggests that it is also a matter of suboptimal representation learning. Using a simple theoretical statement, it is shown that the choice of representations can significantly affect adversarial stability.

7.02.2023 - 12.02.2023


Сonference and Autumn School Fall into ML



Nikita Puchkin (HSE University)

 Exploring Local Norms in Exp-concave Statistical Learning

 We consider the standard problem of stochastic convex optimization with exp-concave losses using Empirical Risk Minimization in a convex class. Answering a question raised in several prior works, we provide a $O ( d/n + 1/n log ( 1/ \delta ) )$  excess risk bound valid for a wide class of bounded exp-concave losses, where d is the dimension of the convex reference set, n is the sample size, and $\delta$ is the confidence level. Our result is based on a unified geometric assumption on the gradient of losses and the notion of local norms.


Denis Belomestny (HSE University, Duisburg Essen University)

A Reproducing Kernel Hilbert Space approach to singular local stochastic volatility McKean-Vlasov models

Motivated by the challenges related to the calibration of financial models, we consider the problem of numerically solving a singular McKean-Vlasov equation
This equation can be considered as a singular local stochastic volatility model.
Whilst such models are quite
popular among practitioners, unfortunately, its well-posedness has not been fully understood yet and, in general, is possibly not guaranteed at all.
We develop a novel regularization approach based on the reproducing kernel Hilbert space (RKHS) technique and show that the regularized model is well-posed. Furthermore, we prove propagation of chaos. We demonstrate numerically that a thus regularized model is able to perfectly replicate option prices due to typical local volatility models. Our results are also applicable to more general McKean--Vlasov equations.


Sergey Samsonov (HSE University)

Finite-time high-probability bounds for linear stochastic approximation

We discuss a finite-time analysis of linear stochastic approximation (LSA) algorithms, a core method in statistics and machine learning. LSA is used to compute approximate solutions of a d-dimensional linear system, where both the system matrix and the right-hand side can only be estimated through (asymptotically) unbiased observations. We consider here the case where the observations are driven either by sequence of i.i.d. noise variables or by uniformly geometrically ergodic Markov chain. We discuss the p-moments inequality and high probability bounds for the LSA iterates and its Polyak-Ruppert averaged version focusing on tight dependence upon the mixing time of the chain. We then prove finite-time instance-dependent bounds on the Polyak-Ruppert averaged sequence of iterates. These results are sharp in the sense that the leading term we obtain matches the local asymptotic minimax limit. We also cover the applications to the temporal-difference methods in RL


Maxim Vasiliev (HSE University)

Approximation and sampling of multivariate probability distributions in the tensor train decomposition (papers overview)



Fedor Noskov (HSE University, MIPT, Skoltech)

Optimal Estimation in Mixed-Membership Stochastic Block Model

Community detection is one of the central problems in modern network science. It has numerous applications in the analysis of social and biological networks, designing network protocols and many other areas. Recently, much attention has been paid to the detection of overlapping communities, where each node in a network may belong to multiple groups. We consider the parameter estimation problem in Mixed Membership Stochastic Block Model (MMSB), which is a quite general instance of a random graph model allowing for overlapping community structure. We discuss different approaches to parameter estimation in this model, their theoretical properties and practical performance. Finally, we propose an approach to improve over existing estimates that allows obtaining minimax optimal rates of estimation for all the model parameters.


Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)

Inference in Error-in-Operator model

We consider the problem of recovering the source signal \( x \) from the noise observation \( Y \) given by the equation \( Y = A x + \epsilon \) in the situation when the operator \( A \) is not precisely known. Instead, a pilot estimate \( \hat{A} \} is available. New approach is suggested which allows to obtain finite sample nearly sharp accuracy guarantees and incorporate the smoothness of the signal \( x \) and the operator \( A \) in a rather general form.


Evgeny Lagutin (HSE University)

Stable Rank Normalization for Improved Generalization in Neural Networks and GANs (paper overview)

Exciting new work on generalization bounds for neural networks (NN) given by Bartlett et al. (2017); Neyshabur et al. (2018) closely depend on two parameter- dependent quantities: the Lipschitz constant upper bound and the stable rank (a softer version of rank). Even though these bounds typically have minimal practical utility, they facilitate questions on whether controlling such quantities together could improve the generalization behaviour of NNs in practice. Spectral Normalization (Miyato et al., 2018) is a technique making it easy to restrict the Lipschitz constant upper bound and thus is a popular way to stabilize training of generative adversarial networks and is to make a general discriminative neural network robust. On the other hand Spectral Rank Normalization (Sanyal et al., 2020) tackles the problem of restricting a stable rank of neural network's layers. Motivated by the generalization bound view this technique in combination with Spectral Normalization further improve the robustness properties on classification tasks and the quality on image generation tasks.


Evgeny Nikishin (Mila, University of Montreal)

Deep Reinforcement Learning with Plasticity Injection

A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, and performance in RL. We introduce plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are two-fold: first, as a diagnostic tool — if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent exhausted its plasticity and has to re-learn from scratch or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance compared to alternative methods while being computationally efficient.


Daniil Tyapkin (HSE University)

Fast Rates for Maximum Entropy Exploration

We consider the reinforcement learning (RL) setting, in which the agent has to act in unknown environment driven by a Markov Decision Process (MDP) with sparse or even reward free signals. In this situation, exploration becomes the main challenge. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization that was previously considered by Hazan et al. 2019 in the discounted setting. For this type of exploration, we propose an algorithm based on a game theoretic representation that has O(HSA/eps^2) sample complexity thus improving the eps- and S-dependence of Hazan, where S is a number of states, A is a number of actions, H is an episode length, and eps is a desired accuracy. The second type of entropy we study is the trajectory entropy. This objective function is closely related to the entropy-regularized MDPs, and we propose a simple modification of the UCBVI algorithm that has a sample complexity of order 1/eps ignoring dependence in S,A,H. Interestingly enough, it is the first theoretical result in RL literature establishing that the exploration problem for the regularized MDPs can be statistically strictly easier (in terms of sample complexity) than for the ordinary MDPs.


Arthur Goldman (HSE University)

Theoretical guarantees for neural control variates in MCMC

In this paper, we propose a variance reduction approach for Markov chains based on additive control variates and the minimization of an appropriate estimate for the asymptotic variance. We focus on the particular case when control variates are represented as deep neural networks. We derive the optimal convergence rate of the asymptotic variance under various ergodicity assumptions on the underlying Markov chain. The proposed approach relies upon recent results on the stochastic errors of variance reduction algorithms and function approximation theory.


Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)

Sharp deviation bounds for quadratic forms and their use in machine learning

Under a mild assumption of linearity on the log-likelihood function, the tools of empirical processes are not necessary, the analysis of a general MLE can be reduced to a deviation bound for a quadratic form. This paper explains how the recent advances in Laplace approximation from [Spokoiny, 2022, 2023] can be used for obtaining sharp Gaussian-like finite sample bounds and for stating the prominent concentration phenomenon for the squared norm of a sub-gaussian vector. Some extensions and open problems will be discussed as well.


Yana Hassan (HSE University)

Fast Rates for Maximum Entropy Exploration

Nowadays it is becoming extremely popular to tune the large pertained language model with experts comments using reinforcement learning with human feedback methodology. In my talk I would focus on a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). The analysis from the paper shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. The results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. In my talk I am going to dicsuss the unified analysis the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and illustrate the first sample complexity bound for max-entropy IRL.


Alexander Podkopaev (Carnegie Mellon University)

Nonparametric two-sample and independence testing by betting

Two-sample and independence testing are two fundamental problems of statistical inference. In the former, given observations from two distributions, the goal is to test the null hypothesis that the distributions are equal against the alternative that they are not. In the latter, given observations drawn from an unknown joint distribution, the goal is to test the null that the underlying random variables are independent against the alternative that they are not. Both problems have been extensively studied in the batch setting when the sample size is specified before collecting data. The main issue is that in general composite nonparametric settings, even if the null is false, it is unknown beforehand collecting how much data will be ''enough'' to reject. Sequential testing is a complementary approach when an incoming data stream is analyzed online. In this talk, I will discuss our recent sequential two-sample and independence tests that (a) continuously monitor the data while controlling the false alarm rate, (b) are consistent, meaning that they are guaranteed to stop and reject if the null is false, and (c) provably adapt to the complexity of a problem and stop earlier on easy tasks / later on harder ones.


Ulyana Vinogradova (HSE University)

Leveraging covariate adjustments at scale in online A/B testing

Various companies run randomized online experiments to estimate the “causal impact” of new adopted features on performance of metrics. Difference in means estimator is still one of the most commonly used method in many online A/B testing platforms due to its simplicity. However, this estimator often fails to detect small effects of the intervention despite the fact that the experiment contains thousands or millions of customers. A potential solution to this issue is to develop an alternative estimator, that leverages the abundance of additional data (covariates) collected during the experiments. In this talk, I will discuss recently proposed general-purpose algorithm for the estimation of causal effects with covariates to the setting of online A/B testing. Also I will talk about benefits, costs and risks of allowing experimenters to leverage more complicated estimators that use covariates to estimate causal effects.


Ivan Safonov (HSE University)

Tensorization of flows: a tool for variational inference

Consider the classical problem of learning the approximation from a given class of distributions to a given distribution up to a normalizing constant. To solve this problem, you can apply flow normalization. The unimodality of the reference Gaussian distribution used in the normalizing flow may cause difficulties in studying multimodal distributions. We introduce an extension of normalizing flows, in which the Gaussian reference value is replaced by a reference distribution (tensor flow), which is constructed using a tensor network. TF combined with NF gives better results than each of them individually. In this talk, I will talk about the recently proposed tensorizing flow, how to extract a sample from it, how to study it, and how to apply it to VI together with NF. The tensor flow construction is similar to the construction described in Sergey Dolgov's articles, but it has some interesting differences that we will discuss.


Ivan Peshekhonov and Alexey Arzhantsev (HSE University)

Exploring the Benefits of Riemannian Optimization on Shared Factor Tensor Manifolds

Factorization of a matrix into a product of two rectangular ones, called factors, is a classic tool in various machine learning applications. Tensor factorizations generalize this concept to more than two dimensions. In certain applications, where some of the tensor dimensions encode the same objects (e.g., knowledge base graphs with entity-relation-entity 3D tensors), it may also be beneficial for the respective factors to be shared. In this paper, we consider a well-known Tucker tensor factorization and study its properties under the shared factor constraint. We call it a shared-factor Tucker factorization (SF-Tucker). We prove that a manifold of fixed rank SF-Tucker tensors forms a smooth manifold and develop efficient algorithms for the main ingredients of Riemannian optimization: projection to a tangent plane for general-purpose loss functions and retraction. Having these tools at hand, we also implement a Riemannian optimization method with momentum and showcase the benefits of our approach on several machine learning tasks.


Irina Goloborodko (HSE University)

Jacobian regularization of tensor decompositions in neural networks

Tensor decompositions can be used as a compression tool for neural networks. For this purpose the tensor of weights of a full-connected or convolutional layer is replaced in the architecture by its own tensor decomposition. The disadvantage of this approach is the instability of the compressed neural network for some types tensor decompositions. In order to improve the stability of the compressed neural network, randomized Jacobian spectral regularizer can be applied. The regularizer relies on matvec functions of the Jacobian and its transposed and can be efficiently computed using automatic differentiation tools. This results in great versatility of this method and makes it possible to apply proposed technique in different scenarios.


Nikita Morozov (HSE University)

Introduction to Generative Flow Networks (GFlowNets)


Fall 2022
Date Speaker Title Abstract


Denis Belomestny (HSE University, Duisburg Essen University)

Statistics of McKean-Vlasov equations

In this talk we study the problem of semiparametric estimation for a class of McKean-Vlasov stochastic differential equations. Our aim is to estimate the drift coefficient of a MV-SDE based on observations of the corresponding particle system. We propose a semiparametric estimation procedure and derive the rates of convergence for the resulting estimator. We further prove that the obtained rates are essentially optimal in the minimax sense.


Alexander Tarakanov (HSE University)

Multi-Level Unbiased MCMC Estimators for Discretized Models

ODE and PDE models routinely in different applications. Significant part of practical questions can be reduced to integration over the distribution of ODE/ODE parameters spaces. The practical way for solving such problem is MCMC integration coupled with spatial/temporal discretization of ODE/PDE model. Such discretization introduces bias to MCMC integration. In the present talk the novel scheme for Multi-Level Unbiased MCMC method is discussed. The technique concerned is provides unbiased estimator with finite variance and finite expected computational cost for a given Quantity of Interest under reasonable smoothness constraints. Application examples are provided.


Yana Khassan (HSE University)

Is Q-learning Provably Efficient? (Article overview)

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn. The theoretical question of “whether model-free algorithms can be made sample efficient” is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions. The paper proves that, in an episodic MDP setting, Q-learning with UCB exploration achieves almost optimal regret. This is the first analysis in the model-free setting that establishes O(√T) regret without requiring access to a “simulator.”


Dmitry Molchanov (HSE University)

Twice stochastic variational inference with semi-implicit and non-invariant distributions

Doubly stochastic variational inference (DSVI) is one of the main methods of approximate inference and learning in modern deep probabilistic models. However, this method imposes significant restrictions on the distributions used and has a limited area of applicability. The main goal of my work is to relax these limitations and expand the tools for working with deep probabilistic models.
Using a variation dropout model as an example, we show how removing restrictions on variation parameters allows us to open up new modes of model operation - the rarefaction mode and the "variance network" mode.
Then, we generalize DSVI to the case of models with so-called semi-implicit distributions. The proposed approach, Doubly Semi-Implicit Variational Inference, allows inference in new classes of models, refine inference in existing models, refine marginal likelihood estimation, and accelerate inference in models with mixture-like distributions.
Translated with www.DeepL.com/Translator (free version)


Evgenii Nikishin (Mila, Université de Montréal)

The Primacy Bias in Deep Reinforcement Learning

This work identifies a common flaw of deep reinforcement learning (RL) algorithms: a tendency to rely on early interactions and ignore useful evidence encountered later. Because of training on progressively growing datasets, deep RL agents incur a risk of overfitting to earlier experiences, negatively affecting the rest of the learning process. Inspired by cognitive science, we refer to this effect as the primacy bias. Through a series of experiments, we dissect the algorithmic aspects of deep RL that exacerbate this bias. We then propose a simple yet generally-applicable mechanism that tackles the primacy bias by periodically resetting a part of the agent. We apply this mechanism to algorithms in both discrete (Atari 100k) and continuous action (DeepMind Control Suite) domains, consistently improving their performance.


Valeriia Shcherbakova (HSE University)

A Contrastive Approach to Online Change Point Detection

We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.


Autumn School

Fall into ML 2022



Artur Goldman (HSE University)

Theoretical guarantees for approximate sampling from smooth and log-concave densities

Sampling from various kinds of distributions is an issue of paramount importance in statistics. In many situations, the exact sampling from a given distribution is impossible or computationally expensive and, therefore, one needs to resort to approximate sampling strategies. However, there is no well-developed theory providing meaningful nonasymptotic guarantees for the approximate sampling procedures, especially in the high-dimensional problems. This paper makes some progress in this direction by considering the problem of sampling from a distribution having a smooth and log-concave density. Authors establish nonasymptotic bounds for the error of approximating the target distribution by the one obtained by the Langevin Monte Carlo method and its variants, and illustrate the effectiveness of the established guarantees with various experiments.


Daria Demidova (HSE University)

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

This paper studies the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution V = e-f on ℝn. Authors prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming that V satisfies a log-Sobolev inequality and the Hessian of f is bounded. Notably, convexity or bounds on higher derivatives are not assumed. Authors also prove convergence guarantees in Rényi divergence of order q > 1 assuming the limit of ULA satisfies either the log-Sobolev or Poincaré inequality. Authors also prove a bound on the bias of the limiting distribution of ULA assuming third-order smoothness of f, without requiring isoperimetry.

21.11.2022 and 28.11.2022

Anton Zubekhin (HSE University)

Diffusion Probabilistic Models explosion

Diffusion probabilistic models are state-of-the-art method for image generation. More than that, diffusion models also perform good working with data of different modalities, such as speech and text. Text-to-image diffusions are the most well-known types of such models with Dalle-2, Imagen and Stable Diffusion being famous even outside of ML community. Recent advances in machine learning show even wider applicability of diffusion models - they can be used for video generation, 3D reconstructions, meta-learning and even direct optimization of other neural networks.


Varvara Rudenko (HSE University)

Log-concave sampling: Metropolis-Hastings algorithms are fast

The authors study the problem of sampling from a strongly log-concave density supported on R^d, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by simulating a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), their bounds show that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. And also demonstrate the gains of a modified version of MALA over ULA for weakly log-concave densities. Furthermore, they derive mixing time bounds for the Metropolized random walk (MRW) and obtain O(κ) mixing time slower than MALA. They provide numerical examples that support their theoretical findings, and demonstrate the benefits of Metropolis-Hastings adjustment for Langevin- type sampling algorithms.


Marina Sheshukova (HSE University)

Exponential Concentration Inequalities for Additive Functionals of Markov Chains

Using the renewal approach we prove exponential inequalities for additive functionals and empirical processes of ergodic Markov chains, thus obtaining counterparts of inequalities for sums of independent random variables. The inequalities do not require functions of the chain to be bounded and moreover all the involved constants are given by explicit formulas whenever the usual drift condition holds, which may be of interest in practical applications e.g. to MCMC algorithms.


Timofey Grinev (HSE University)

Sharper Bounds for Uniformly Stable Algorithms

Deriving generalization bounds for stable algorithms is a classical question in learning theory. This paper is devoted to two questions: firstly it provides a short proof of the moment bound that implies the generalization bound stronger than recent results. Secondly, it proves general lower bounds, showing that this moment bound is sharp (up to a logarithmic factor) unless some additional properties of the corresponding random variables are used. The main probabilistic result is a general concentration inequality for weakly correlated random variables, which may be of independent interest.

Fall 2021
Date Speaker Title Abstract


Vladimir Spokoiny (WIAS, IITP RAS, HSE University)

Bayesian inference in machine learning

Motivated by examples from nonparametric binary classification, ranking problems, analysis of random graphs, we consider the problem of statistical inference for binary data within a nonparametric Bernoulli model. In many situations the underlying Bernoulli parameter may approach the edges zero or one that leads to inference for non-regular models. Properties like asymptotic normality of the MLE or of the posterior distribution are not valid in such irregular cases. The approach of this paper suggests to impose a Gaussian prior on the parameter of logistic regression function, or, analogously, to consider a penalized MLE of the canonical link with a quadratic penalty. A Gaussian prior can be viewed as a kind of a barrier preventing the canonical parameter to approach infinity and forcing it to stay within the region of regularity. Our main results state finite sample analogs of the classical statistical results like asymptotic normality of the posterior distribution or asymptotic normality of the penalized MLE for any configuration of Bernoulli parameters under the assumption that the prior is strong enough in direction. We also demonstrate how the results can be applied to ranking problems and nonparametric binary classification.


Nikita Puchkin (HSE University, IITP)

Rates of convergence for density estimation with GANs

We undertake a precise study of the non-asymptotic properties of vanilla generative adversarial networks (GANs) and derive theoretical guarantees in the problem of estimating an unknown $d$-dimensional density $p$ under a proper choice of the class of generators and discriminators. We prove that the resulting density estimate converges to $p$ in terms of Jensen-Shannon (JS) divergence at the rate $n^{-2\beta/(2\beta+d)}$ where $n$ is the sample size and $\beta$ determines the smoothness of $p$. This is the first result in the literature on density estimation using vanilla GANs with JS rates faster than $n^{-1/2}$ in the regime $\beta>d/2$.


Sergey Samsonov (HSE University)

On the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD Learning

This paper studies the exponential stability of random matrix products driven by a general (possibly unbounded) state space Markov chain. It is a cornerstone in the analysis of stochastic algorithms in machine learning (e.g. for parameter tracking in online-learning or reinforcement learning). The existing results impose strong conditions such as uniform boundedness of the matrix-valued functions and uniform ergodicity of the Markov chains. Our main contribution is an exponential stability result for the p-th moment of random matrix product, provided that (i) the underlying Markov chain satisfies a super-Lyapunov drift condition, (ii) the growth of the matrix-valued functions is controlled by an appropriately defined function (related to the drift condition). Using this result, we give finite-time p-th moment bounds for constant and decreasing stepsize linear stochastic approximation schemes with Markovian noise on general state space. We illustrate these findings for linear value-function estimation in reinforcement learning. We provide finite-time p-th moment bound for various members of temporal difference (TD) family of algorithms.


Dmitry Yarotsky (Skoltech)

Approximation with deep neural networks

In recent years, there has been quite a number of new results on approximation properties of deep neural networks. This emergent theory is rather different from the classical approximation theory of linear or shallow models (such as splines or Fourier expansion). In this talk, I will try to highlight some interesting ideas of this theory, and will also briefly indicate its connections to other topics such as information theory, dynamical systems, fewnomials and VC dimension.


Nikita Puchkin (HSE, IITP)

Exponential Savings in Agnostic Active Learning through Abstention

We show that in pool-based active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss 1/2 of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in pool-based active classification under the model misspecification.


Nikita Zhivotovskiy (ETH, Zurich)

Distribution-Free Robust Linear Regression

We consider the problem of random-design linear regression, in a distribution-free setting where no assumption is made on the distribution of the predictive/input variables. After surveying existing approaches and indicating some improvements, we explain why they fall short in our setting. We then identify the minimal assumption on the target/output under which guarantees are possible, and describe a nonlinear prediction procedure achieving the optimal error bound with high probability. Joint work with Jaouad Mourtada (CREST-ENSAE Paris ) and Tomas Vaškevičius (Oxford).


Maxim Panov (Skoltech)

Stochastic normalizing flows

Normalizing flows is one of the central concepts in modern Bayesian machine learning. They allow constructing approximations of complex distributions starting from simple ones (e.g., Gaussian) and transforming it by a series of deterministic mappings. Normalizing flows have found numerous applications being especially useful in variational approach to Bayesian inferences. In this talk, we will show that classical deterministic normalizing flows have severe limitations which can be overcome by adding stochasticity to the procedure. It can significantly improve the power of the method but requires much more complicated training procedures. We will discuss how one can construct variational learning procedures based on the coupling of normalizing flows with the Metropolis-Hastings accept/reject algorithm. We will also see numerous connections with other classical sampling approaches such as Annealed Importance Sampling (AIS) and Sequential Importance Sampling (SIS). We will illustrate the theory with applications to sampling and the construction of variational autoencoders.(Joint work with Eric Moulines and Achille Thin (Ecole Polytechnique), Alain Durmus (ENS Paris-Saclay), Arnaud Doucet (Uni Oxford) and Nikita Kotelevskii (Skoltech)).


Dmytro Perekrestenko (ETH Zurich)

Constructive Universal High-Dimensional Distribution Generation through Deep ReLU

We present an explicit deep neural network construction that transforms uniformly distributed one-dimensional noise into an arbitrarily close approximation of any two-dimensional Lipschitz-continuous target distribution. The key ingredient of our design is a generalization of the “space-filling” property of sawtooth functions discovered in (Bailey & Telgarsky, 2018). We elicit the importance of depth—in our neural network construction—in driving the Wasserstein distance between the target distribution and the approximation realized by the network to zero. We also will outline an extension of our method to output distributions of arbitrary dimension. Finally, we show that the proposed construction does not incur a cost—in terms of error measured in Wasserstein-distance—relative to generating d-dimensional target distributions from d independent random variables.


Daniil Tiapkin (HSE University)

Improved Complexity Bounds in the Wasserstein Barycenter Problem

In this paper, we focus on computational aspects of the Wasserstein barycenter problem. We provide two algorithms to compute Wasserstein barycenter of m discrete measures of size n with accuracy $\e$. The first algorithm, based on mirror prox with some specific norm, meets the complexity of celebrated accelerated iterative Bregman projections (IBP), that is $\widetilde O(mn^2\sqrt n/\e)$, however, with no limitations unlike (accelerated) IBP, that is numerically unstable when regularization parameter is small. The second algorithm, based on area-convexity and dual extrapolation, improves the previously best-known convergence rates for Wasserstein barycenter problem enjoying $\widetilde O(mn^2/\e)$ complexity.


Alexey Kroshnin (IITP RAS, HSE University)

Robust k-means clustering in metric spaces

In this talk we consider robust median of means based algorithms for the k-means clustering problem in a metric space. The main results are non-asymptotic excess distortion bounds that hold under the two bounded moments assumption in a general separable metric space. In the case of Hilbert space our bounds have the sub-Gaussian form depending on the probability mass of the lightest cluster of an optimal quantizer. In a special case of clustering in R^d we prove matching (up to constant factors) non-asymptotic upper and lower bounds on the excess distortion, which depend on the probability mass of the lightest cluster and on the variance. The talk is based on a joint work with Yegor Klochkov and Nikita Zhivotovskiy


Quentin Paris (HSE University)

Online learning with exponential weights in metric spaces

This paper addresses the problem of online learning in metric spaces using exponential weights. We extend the analysis of the exponentially weighted average forecaster, traditionally studied in a Euclidean settings, to a more abstract framework. Our results rely on the notion of barycenters, a suitable version of Jensen's inequality and a synthetic notion of lower curvature bound in metric spaces known as the measure contraction property. We also adapt the online-to-batch conversion principle to apply our results to a statistical learning framework.


Иван Оселедец (Сколтех)

Геометрия в моделях машинного обучения

Об использовании геометрического подхода к моделям машинного обучения, а также приведу обзор наших результатов последних лет, включая: а) применение гиперболических пространств в машинном обучении и создание рекомендательных моделей; б) построение интерпретируемых направлений в латентном пространстве генеративных сетей.



New Frontiers in High-dimensional Probability and Applications to Machine Learning



Egor Klochkov (Cambridge University)

Fast rates for strongly convex optimization via stability

The sharpest known high probability generalization bounds for uniformly stable algorithms (Feldman, Vondrák, 2018, 2019), (Bousquet, Klochkov, Zhivotovskiy, 2020) contain a generally inevitable sampling error term of order 1/sqrt(n). When applied to excess risk bounds, this leads to suboptimal results in several standard stochastic convex optimization problems. We show that if the so-called Bernstein condition is satisfied, the term O(1/sqrt(n)) can be avoided, and high probability excess risk bounds of order up to O(1/n) are possible via uniform stability. Using this result, we show a high probability excess risk bound with the rate O(log n / n) for strongly convex and Lipschitz losses valid for any empirical risk minimization method. This resolves a question of Shalev-Shwartz, Shamir, Srebro, and Sridharan (2009). We discuss how O(logn/n) high probability excess risk bounds are possible for projected gradient descent in the case of strongly convex and Lipschitz losses without the usual smoothness assumption. This is a joint work with Nikita Zhivotovskiy and Olivier Bousquet.


Sergey Samsonov (HSE University)

UVIP: Model-Free Approach to Evaluate Reinforcement Learning Algorithms

Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL). Yet even a precise knowledge of the value function V^{\pi} corresponding to a policy \pi does not provide reliable information on how far is the policy \pi from the optimal one. We present a novel model-free upper value iteration procedure (UVIP) that allows us to estimate the suboptimality gap V*(x) - V^{\pi}(x) from above and to construct confidence intervals for V*. Our approach relies on upper bounds to the solution of the Bellman optimality equation via martingale approach. We provide theoretical guarantees for UVIP under general assumptions and illustrate its performance on a number of benchmark RL problems. The talk is based on our recent work with Denis Belomestny, Ilya Levin, Eric Moulines, Alexey Naumov and Veronika Zorina.


Dmitrii Ostrovskii (University of Southern California)

Near-Optimal Model Discrimination with Non-Disclosure

In the standard setup of parametric M-estimation with convex loss, we consider two population risk minimizers associated with two possible distributions of a random observation, and then ask the following question: Given the value of one of the two minimizers and access to i.i.d. sampling from both distributions, what sample sizes allow to identify the “right” distribution, i.e., the one corresponding to the given value?Making the first steps towards answering this question in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide nearly matching upper and lower bounds on the sample complexity, showing it to be min{1/Δ^2,\sqrt{r}/Δ} up to a constant factor, where Δ is a measure of separation between the two distributions and r is the maximum (resp., minimum) of the ranks of the design covariance matrices in the case of the upper (resp., lower) bound. This bound is dimension-independent and rank-independent for large enough separation. We then extend this result in two directions: (i) for the general parametric setup in asymptotic regime; (ii) for generalized linear models in the small-sample regime n≤r and under weak moment assumptions. In both cases, we derive sample complexity bounds of a similar form, even under misspecification. Our testing procedures only access the known'' value through a certain functional of empirical risk. In addition, the number of observations that allows to reach statistical confidence for testing does not allow to "resolve" the two models -- that is, recover both minimizers up to O(Δ) prediction accuracy. These two properties open the prospect of applying our testing framework in practical tasks, where one would like to identify a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually inferred by the identifying agent. The talk will be following the recent paper arxiv:2012.02901.


Evgeny Spodarev (Ulm University, Germany)

Spectral classification of fullerenes

Joint work with A. Bille (Skoltech, Moscow/Ulm University), V. Buchstaber (Steklov Mathematical Institute, Moscow) and S. Coste (INRIA, Paris).

Fullerenes $C_n$ are 3-simple convex compact polytopes with $n$ vertices, $n\ge 20$ even, $n\neq 22$, 12 pentagons and $n/2-10$ hexagons as facets. They are mathematical models of famous carbon molecules which find wide applications in modern nanotechnologies. For the synthesis of $C_60$, R. Curl, H. Kroto and R. Smalley got a Nobel prize in chemistry (1996). Since the number of distinct fullerene isomers (which are combinatorially non—equivalent versions of $C_n$) grows as $O(n^9)$ for large $n$, their enumeration and classification is a non-trivial mathematical problem. We propose a solution to this problem using the Newton polynomials of the spectra of their dual graphs of hexagons. It appears that the degree of such polynomials has a logarithmic growth order with respect to $n$. If the time allows, the spectral linear ordering of characters of $C_n$ for different $n$ as well as the distribution of a random eigenvalue of Laplace-type matrices for the graphene, regular triangulation of the plane and infinite nanotubes with chiral vector $(p,q)$ will be touched upon.


Dmitry Vetrov (HSE University); Alexey Naumov (HSE University); Denis Derkach (HSE University); Mikhail Lazarev (HSE University); Nikita Kazeev (HSE University); Sergey Shirobokov (LCL, Twitter); Maxim Panov (Scoltech); Mikhail Burtsev (MIPT)

Third HSE-Yandex autumn school on generative models


Spring 2020






Nazar Buzun (Skoltech)

Gaussian approximation of the maximum of the sum of random vectors.

In this paper we consider a Gaussian approximation for the maximum sum of independent random vectors in high dimensionality. Our approach is based on approximation by probability. In this case, the random variables depend on each other. From approximation by probability, approximation by distribution follows directly. We obtained a theoretical estimate of the rate of convergence to the maximum normal vector, which is a significant improvement over those obtained earlier.


Alexandra Suvorikova (WIAS Berlin)

Shape-based domain adaptation via optimal transportation

Domain adaptation problem aims at learning a well performing model, trained on a source data S (images, vectors, e.t.c), applied then to different (but related) target sample T. Aside from being attractive due to obvious practical utility, the setting is challenging from theoretical point of view. In this work we introduce a novel approach to supervised domain adaptation consisting in a class-dependent fitting based on ideas from optimal transportation (OT) theory which considers S and T as two mixtures of distributions. A parametrized OT distance is used as a fidelity measure between S and T, providing a toolbox for modelling of possibly independent perturbations of mixture components. The method is than used for describing the adaptation of immune system in humans after moving to another climatic zone.


Sergey Bobkov (University of Minnesota, HSE University)

Gaussian concentration, spherical concentration, and their applications to randomized models in the central limit theorem

We plan to discuss the so-called concentration phenomenon of the measure and its corresponding Sobolev inequalities with respect to multidimensional Gaussian measures and uniform distributions on a unit sphere in Euclidean space. In addition to the classical results describing the increase in the degree of concentration with increasing dimensionality, new estimates for large deviations will also be presented, with improvements in the standard inequalities for smooth functionals on the sphere. As an illustration of possible applications, problems on the rate of convergence to the normal law for weighted sums of dependent quantities will be considered


Sergey Bobkov (University of Minnesota, HSE University)

Gaussian concentration, spherical concentration, and their applications to randomized models in the central limit theorem

We plan to discuss the so-called concentration phenomenon of the measure and its corresponding Sobolev inequalities with respect to multidimensional Gaussian measures and uniform distributions on a unit sphere in Euclidean space. In addition to the classical results describing the increase in the degree of concentration with increasing dimensionality, new estimates for large deviations will also be presented, with improvements in the standard inequalities for smooth functionals on the sphere. As an illustration of possible applications, problems on the rate of convergence to the normal law for weighted sums of dependent quantities will be considered

24.03.2020 (Zoom)

Maxim Kaledin (HSE University)

Risk-Sensitive Reinforcement Learning a review of PhD thesis by Aviv Tamar

The main issue which keeps the applications of reinforcement learning limited is the fact that robustness, stability and durability of the learned decision policies are hard to prove. This happens due to usage of expected utility criterion which is optimized with respect to policy in all famous algorithms of RL. The criterium arised from economics and game theory where expected utility theory was historically the first well-established concept of rationality. Being very popular and well-studied in both mathematics and applications, the concept itself is not the only possible way to define rationality. In particular, there is more general class of functionals called coherent risk measures introduced in Delbaen,Artzner1998. Coherent risk measures are widely known in finance for they address exactly the robustness problem and try to quantify risk differently; despite that, in RL coherent risk measures are not so known and there are not so many papers on its usage. My talk will be devoted to the review of PhD thesis by Aviv Tamar on exploiting risk measures in RL which represents in itself a very comprehensive study on how to construct learning algorithms on popular risk measures like CVAR and semi-deviation. 

3.04.2020 (Zoom)

Maxim Panov and Nikita Kotelevskiy (Skoltech)

MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference

In this contribution, we propose a new computa- tionally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to MetFlow, a novel family of MCMC algorithms we introduce, in which proposals are obtained us- ing Normalizing Flows. The marginal distribution produced by such MCMC algorithms is a mixture of flow-based distributions, thus drastically in- creasing the expressivity of the variational family. Unlike previous methods following this direction, our approach is amenable to the reparametrization trick and does not rely on computationally expen- sive reverse kernels. Extensive numerical experi- ments show clear computational and performance improvements over state-of-the art methods. 

21.04.2020 at 16 00 (Zoom)

Nikita Zhivotovskiy (Google Research, HSE University)

Fast classification rates in online and statistical learning with abstention.

This talk is devoted to a simple extension of statistical and online classification setups where the learner is allowed to abstain from a prediction by paying a cost (say, 0.49) only marginally smaller than the average cost 0.5 of a random bit prediction. We discuss how this model implies fast rates for the excess risk and the expected regret without any additional assumptions on the data generating mechanism and the loss function. Finally, we show the relations with recent results in aggregation theory and with the analysis of strongly convex losses. Based on recent works with Olivier Bousquet and Gergely Neu arXiv:1910.12756 and arxiv:2001.10623.

23.04.2020 at 17 00 (Zoom)

Anastasia Ivanova (MIPT, HSE University)


Fall 2019
Date Speaker Title Abstract


Michel Ledoux (Université de Toulouse)

On optimal matching of random samples

Optimal matching problems are random variational problems widely investigated in the mathematics and physics literature, and in probability theory and statistics in the study of rates of convergence of empirical measures. Two-dimensional matching of uniform samples gave in particular rise to deep results investigated from various view points (combinatorial, generic chaining, Fourier analysis). After a review of known results, we investigate more precisely the case of Gaussian samples, first in dimension one, on the basis of explicit representations of Kantorovich metrics and a sharp analysis of more general log-concave distributions in terms of their isoperimetric profile (joint work with S. Bobkov), and then in dimension two and higher following the PDE and transportation approach recently put forward by L. Ambrosio, F. Stra and D. Trevisan.



Recent advances in mass transportation



Alexander Goldenshluger (University of Haifa, HSE University)

Density deconvolution under general assumptions

In this paper we study the problem of density deconvolution under general assumptions on the measurement error distribution. Typically deconvolution estimators are constructed using Fourier transform techniques, and it is assumed that the characteristic function of the measurement errors does not have zeros on the real line. This assumption is rather strong and is not fulfilled in many cases of interest. In this paper we develop a methodology for constructing optimal density deconvolution estimators in the general setting that covers vanishing and non--vanishing characteristic functions of the measurement errors. We derive upper bounds on the risk of the proposed estimators and provide sufficient conditions under which zeros of the corresponding characteristic function have no effect on estimation accuracy. Moreover, we show that the derived conditions are also necessary in some specific problem instances.


Sergey Samsonov (HSE University)

Variance reduction for dependent sequences with applications to Stochastic Gradient MCMC

This talk will be devoted to a novel variance reduction approach for additive functionals of general dependent sequences. Our approach combines the use of control functions with the minimisation of an empirical asymptotic variance estimate. We analyse finite sample properties of the proposed method and derive convergence rates of the excess asymptotic variance to zero. We apply our methodology to Stochastic Gradient MCMC (SGMCMC) and discuss possible extensions for non-Markovian dependent sequences (SVRG-LD, SAGA-LD). Numerical evaluation of the algorithm will also be presented. The talk is based on a joint work with D. Belomestny, L. Iosipoy, E. Moulines and A. Naumov.


Quentin Paris (HSE University)

Some open questions at the "intersection" between optimal transport and statistical learning.

This talk will breafly describe the field of optimal transport and show that statistical learning can be represented as an optimal transport problem. We will investigate a few questions related to this representation and describe some potential research perspectives.

26 - 29.11.2019

Autumn School

Autumn School on Generative models



Nikita Puchkin (HSE University)

Sample complexity of learning a manifold with an unknown dimension

I will discuss the problem of learning a manifold from noisy observations in the case when the dimension of the true manifold is unknown. Many manifold learning procedures require the intrinsic dimension of the data as an input parameter. However, the existing literature provides theoretical guarantees on the dimension estimation only in the case, when the data lies exactly on the manifold. We study an algorithm, which adaptively chooses the manifold’s dimension and has an optimal sample complexity in terms of the Hausdorff loss. Besides, the procedure produces a manifold, such that its curvature does not exceed the curvature of the true manifold too much.

Spring 2019
Date Speaker Title Abstract

29.01.2019 at CS HSE

Ekaterina Krymova (Duisburg Essen University)

On estimation of the noise variance in high dimensional linear regression model

We consider a problem of unknown noise variance estimation in high-dimensional linear regression model. To estimate a nuisance vector of regression coefficients we use a family of spectral regularisers of the maximum likelihood estimator. Noise variance estimation is based on an adaptive normalisation of the prediction error. We derive an upper bound for concentration of the proposed method around an ideal estimator (in case of zero nuisance).


Vladimir Spokoiny (WIAS, HSE University)

Bayesian model selection

We consider different approaches to select a prior from an ordered family of Gaussian process priors (GPP) for a linear regression model . The empirical and full (hierarchic) Bayesian methods are studied with their benefits and drawbacks. A new method for the choice of a hyper prior is suggested. The method is based on the so called “propagation” condition and ensures a one-sided contraction and adaptivity of the procedure.


Alexander Goldenshluger (University of Haifa, HSE University)

Optimal stopping of a random sequence with unknown distribution

The subject of this talk is the problem of optimal stopping of a sequence of i.i.d. random variables. When the distribution governing the random sequence is known, the optimal solution is given by backward induction. In this talk we will focus on the case when the distribution of observations is unknown. We propose a stopping rule that is based on relative ranks and study its performance as measured by the maximal relative regret over suitable nonparametric classes of distributions. It is shown that the proposed rule is first order asymptotically optimal and nearly rate–optimal in terms of the rate at which the relative regret converges to zero. We also develop a general method for numerical solution of sequential stopping problems with no distributional information, and use it in order to implement the proposed stopping rule.

(Joint work with A. Zeevi)


Alexey Ustimenko (HSE University)

Direct Optimization of Ranking Quality Metrics via Stochastic Approximation

In this paper, we present a novel approach for consistent direct optimization of any ranking quality metric that significantly outperforms the state-of-the-art methods on popular learning-to-rank datasets. The framework is based on a smoothing procedure and a method of local stochastic approximation of the ranking function inspired by gradient-free methods for blackbox optimization. We also present novel unbiased and effective estimates for gradients of the smoothed ranking quality metrics that theoretically outperform likelihood ratio method while their time complexity remains the same due to combinatorial structure of the problem.


Denis Belomestny (Duisburg Essen University, HSE University)

MacKeen-Vlasov stochastic differential equations: simulation and estimation

In this talk we propose a novel projection-based particle method for solving McKean--Vlasov stochastic differential equations. This approach is based on a projection-type estimation of the marginal density of the solution in each time step. The projection-based particle method leads in many situations to a significant reduction of numerical complexity compared to the widely used kernel density estimation algorithms. We derive strong convergence rates and rates of density estimation. Furthermore we give explicit solutions for a class of McKean--Vlasov equations with affine drift. The performance of the proposed algorithm is illustrated by several numerical examples.


Ivan Panin (Skoltech, IITP RAS)

Least squares in noiseless setting and Sobol indices convergence for random design

We briefly review recent progress in noiseless nonparametric regression and consider its application to global sensitivity analysis and particularly for estimation of Sobol indicies. Sobol (sensitivity) indices allow to quantify respective effects of input random variables (or combinations thereof) onto variance of a physical or mathematical model response. We focus on the problem of Sobol indices estimation via metamodel approach in the case of random design of experiments; on both nonasymptotic and asymptotic behavior of Sobol indices estimates and the model selection for increasing computational budget. We also consider the question of an optimal rate of Sobol indices convergence.


Dmitry Yarotsky (Skoltech, IITP RAS)

Gradient descent in neural networks: mean field theory and optimal transportation

I will describe recent results in which gradient descent in shallow neural networks is approximated by an integro-differential equation appearing in the limit of infinite network width. In particular, I will discuss the connection to the optimal transport theory and will mention several modes in which the dynamics admits an approximate analytic solution.


Egor Kosov (MSU, HSE University)

Lower bounds for measures of deviation of polynomials from their expectations

In the talk we will discuss the recent results on lower estimates of small ball probabilities for random variables that are polynomials in Gaussian and general logarithmically concave random variables.


Nurlan Shagadatov (HSE University)

Martingale representations for Markov Chains with application to MCMC.

In this talk I will describe recent results of joint work with D. Belomestny and E. Moulines. I consider efficient variance reduction approach for Markov Сhain Monte Carlo algorithms relying on a new discrete time martingale representations. In particular, I will consider theoretical guarantees and generic method for Unadjusted Langevin Algorithm.


Evgenii Chzhen ( Laboratoire d'Analyse et de Mathématiques Appliquées)

Semi-supervised confidence set classification

An image annotation can be seen as a multi-class classification problem with a large number of classes. In this context, confusion between classes can occur, and single label classification may be misleading. We will talk about the multi-class classification with confidence sets of a controlled expected cardinal in non-parametric settings. We will propose a semi-supervised estimator which is either minimax optimal or optimal up to a log factor depending on the way we measure the risk. We will describe the regimes when there is no supervised estimator that can improve the proposed semi-supervised estimator and when the semi-supervised estimation does not help from the minimax point of view. (joint work with C. Denis and M. Hebiri)

16.04.2019, CS HSE, 205 at 17:00

Eric Moulines (Ecole Polytechnique, HSE University)

Mini course Introduction to reinforcement learning 1

The minicourse is an elementary introduction of reinforcement learning. We will formalize the problem of reinforcement learning using ideas from dynamical systems theory, specifically, as the optimal control of incompletely-known Markov decision processes. A learning agent must be able to sense the state of its environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. Markov decision processes are intended to include just these three aspects—sensation, action, and goal—in their simplest possible forms without trivializing any of them. Any method that is well suited to solving such problems we consider to be a reinforcement learning method. The course is an elementary introduction covering: — An introduction to reinforcement learning — Markov decision processes — Dynamic programming — Monte Carlo methods — Temporal difference learning We will follow closely the book “Reinforcement Learning: an introduction”, Barto Sutton, MIT Press and the use the pedagogical material proposed by David Silver.

23.04.2019, CS HSE, 503 at 17:00

Eric Moulines (Ecole Polytechnique)

Mini course Introduction to reinforcement learning 1

The minicourse is an elementary introduction of reinforcement learning. We will formalize the problem of reinforcement learning using ideas from dynamical systems theory, specifically, as the optimal control of incompletely-known Markov decision processes. A learning agent must be able to sense the state of its environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. Markov decision processes are intended to include just these three aspects—sensation, action, and goal—in their simplest possible forms without trivializing any of them. Any method that is well suited to solving such problems we consider to be a reinforcement learning method. The course is an elementary introduction covering: — An introduction to reinforcement learning — Markov decision processes — Dynamic programming — Monte Carlo methods — Temporal difference learning We will follow closely the book “Reinforcement Learning: an introduction”, Barto Sutton, MIT Press and the use the pedagogical material proposed by David Silver.


Nikolai Krylov (University of Minnesota)

Mini course (3 lectures)


  • Basics of harmonic polynomials and spherical functions

  • Variations on a theme of Jacobians: Brouwer’s fixed point theorem, Alexandrov’s maximum principle, solvability of Laplace’s equation, etc.

Schedule: 7.06, 16:00 - 18:00 14.06, 16:00 - 18:00 28.06, 16:00 - 18:00

Venue: Faculty of Mathematics HSE, Usacheva str, 5, room 108

Fall 2018






Qeuntin Paris (HSE University)

Learning functional minimizers on metric spaces.

In this talk, we will discuss recent developments motivated by statistical problems that arise in optimal transport. In particular we will discuss new results concerning the estimation of so called barycenters (or Frechet means) in the Wasserstein space. Before presenting our results, the talk will review some basic material from geometry and optimal transport. The end of the talk will discuss some open problems.


Sergey Samsonov (HSE University, Skoltech)

Tansportation and concentration of measure inequalities for Markov Chains

In this talk I am going to observe some concentration of measure inequalities, starting from rather classical ones by Bakri, Ledoux. Then I will cover some interesting concentration results for Markov Chains, related with transportation inequalities following papers by Djellot, Guillin, Wu 2004 and Joulin, Olivier 2010, and concentration results for quadratic forms of vectors with dependent entries following Adamczak 2015 with particular applications to Unadjusted Langevin Dynamics.


Sergey Samsonov (HSE University, Skoltech)

Concentration inequalities and their applications

In this talk I am going to cover some functional inequalities for dependent sequences of random variables, especially focusing on the results for Markov chains based on the work by Djellout, Guilin and Wu. As one of the applications we will consider Unadjusted Langevin Algorithm, following paper by Durmus, Moulines, and state some concentration results for it.


Maxim Kaledin (HSE University, Skoltech)

Dynamic Programming in Stochastic Optimal Control Problems.

Stochastic control problems are well-known in financial mathematics (option pricing, portfolio management) and in many related areas. These problems even in case of discrete state space and known transition density are subject to curse of dimensionality when using deterministic algorithms. Kean&Wolpin(1994) and Rust (1997) were the first who proved that this can be avoided in case of discrete Markov Decision Processes with stochastic algorithms. The modern challenge is to figure out what happens with complexity when the transition density can only be approximated (RL,ML, Finance applications). During my talk I will cover known and new complexity estimates for stochastic control problems as well as several algorithms to solve MDP.


Vladimir Ulyanov (HSE University, MSU)

Non-linear expectations: notion, facts, applications.

We give a short review on recent developments on problems of probability model under uncertainty by using the notion of nonlinear expectations and, in particular, sublinear expectations. We discuss as well a new law of large numbers (LLN) and central limit theorem (CLT) and provide non-asymptotic results. Classical LLN and CLT have been widely used in probability theory, statistics, data analysis as well as in many practical situations such as financial pricing and risk management. They provide a strong and convincing way to explain why in practice normal distributions are so widely utilized. But often a serious problem is that, in general, the “i.i.d.” condition is difficult to be satisfied. In practice, for the most real-time processes and data for which the classical trials and samplings become impossible, the uncertainty of probabilities and distributions can not be neglected. In fact the abuse of normal distributions in finance and many other industrial or commercial domains has been criticized. The new CLT does not need this strong “i.i.d.” assumption. Instead of fixing a probability measure P, it is introduced an uncertain subset of probability measures {P_a : a in A } and considered the corresponding sublinear expectation as sup{a in A} E_{a}X.



Nikita Zhivotovskiy (HSE University)

Covariance estimation: missing observations and heavy tails.

In this talk, I will discuss some recent results on covariance estimation together with some work in progress. I will start with the subgaussian case when some of the observations are missing: we observe N i.i.d. vectors and want to estimate the covariance matrix. The problem is that some of the components of each vector are missing, that is these components are set to be equal to zero with some probability. We show how to improve the state of the art results in this scenario. Next, we discuss how to apply dimension-free Bernstein type inequalities for vectors with heavy tails under mild moment assumptions.



Eric Moulines (Ecole Polytechnique, HSE University)

Low-rank Interaction with Sparse Additive Effects Model for Large Data Frames (NIPS paper)

Many applications of machine learning involve the analysis of large data frames — matrices collecting heterogeneous measurements (binary, numerical, counts, etc.) across samples — with missing values. Low-rank models, as studied by Udell et al. (2016), are popular in this framework for tasks such as visualization, clustering and missing value imputation. Yet, available methods with statistical guarantees and efficient optimization do not allow explicit modeling of main additive effects such as row and column, or covariate effects. In this paper, we introduce a low-rank interaction and sparse additive effects (LORIS) model which combines matrix regression on a dictionary and low-rank design, to estimate main effects and interactions simultaneously. We provide statistical guarantees in the form of upper bounds on the estimation error of both components. Then, we introduce a mixed coordinate gradient descent (MCGD) method which provably converges sub-linearly to an optimal solution and is computationally efficient for large scale data sets. We show on simulated and survey data that the method has a clear advantage over current practices.


Błażej Miasojedow (University of Warsaw)

Statistical inference for Continous Time Bayesian Networks.

Continuous Time Bayesian Networks (CTBN) are a class of multivariate Markov Jump Processes (MJP), where dependence between coordinates is described by a directed graph with possible cycles. CTBNs are a flexible tool for modelling phenomena from different areas such as chemistry, biology, social sciences etc. In the talk, I will discuss the problems of parameter learning and structure learning for CTBNs with a special attention to computational methods.


Alexey Naumov (HSE University)

Gaussian approximations for maxima of large number of quadratic forms of high-dimensional random vectors

Let X_1, …, X_n be i.i.d. random vectors taking values in R^d, d \geq 1, and Q_1, …, Q_p, p \geq 1, be symmetric positive definite matrices. We consider the distribution function of vector (Q_j S_n, S_n), j = 1, …, p, where S_n = n^{-1/2}(X_1 + … + X_n), and prove the rate of Gaussian approximation with explicit dependence on n, p and d. We also compare this result with results of Bentkus (2003) and Chernozhukov, Chetverikov, Kato (2016). Applications to change point detection and model selection will be discussed as well. The talk is based on the joint project with F. Goetze, V. Spokoiny and A. Tikhomirov.



Andzhey Koziuk (WIAS, Berlin)

Exact smoothning of a probability measure

In the talk an exact smoothing of indicator function is introduced and developed to allow differential calculus access the internal machinery behind a problem of measures comparison. Specifically, the problem of accuracy of non-classical analogue of Berry-Esseen inequality for a radnom elemets in Hilbert space is addressed with the help of the tool and aided with the classical Lindenberg chained sum construction.


Nikita Zhivotoskiy (Technion, HSE University)

Robust estimation of the mean of a random vector

n this talk, we discuss the estimation of the mean random of a vector from an i.i.d. sample of its observations. The latest results show that using a special procedure, the mean of a random vector, for which only the existence of a covariance matrix is ​​known, can be estimated as precisely as if the vector had a multidimensional normal distribution with the same covariance structure. The estimation of the covariance matrix with minimal constraints on the underlying distribution will be briefly discussed. The talk is based on the works of arxiv: 1702.00482 and arxiv: 1809.10462


Alexey Ustimenko (HSE University, Skoltech)

Convergence of Laplacian Spectra from Point Clouds

The Laplacian-Beltrami operator (LBO) on manifolds and its spectral structure has been widely used in many applications such as manifold learning, spectral clustering, dimensionality reduction and so on. Recently, Point Integral Method (PIM) was proposed to estimate the LBO by the discrete Laplace operator constructed from a set of sample points. We would discuss a paper “Convergence of Laplacian Spectra from Point Clouds” by Zuoqiang Shi, Jian Sun and their’s proof of the convergence of the eigenvalues and eigenvectors obtained by PIM to the eigenvalues and eigenfunctions of the LBO with the Neumann boundary.


Nikita Puchkin (HSE University)

Manifold Learning

I will discuss a problem of a manifold reconstruction from a noisy point cloud. There are plenty methods of manifold learning (ISOMAP, LLE, etc.) but only a few papers provide a theoretical justification of the proposed algorithms (Genovese et. al. (2012), Maggioni et. al. (2016), Aamari and Levrard (2017)). An approach based on a local polynomial estimation (Aamari, Levrard (2017)) reconstructs distinct parts of the manifold and the reconstructed manifold is often disconnected. Another approach (Genovese et. al. (2012)) tries to estimate a density with a support on the manifold. This approach requires a specific distribution of the noise, which is unlikely to hold in practice. A recently proposed method (Osher et. al. (2017)), based on an approximation of the Laplace-Beltrami operator, is aimed at a global reconstruction of the manifold and does not have the mentioned drawbacks. Unfortunately, the statistical properties of this method are not studied so far. In my talk, I will discuss some details concerning a theoretical justification of the procedure.


Valentina Shumovskaya (HSE University, Skoltech)

Towards Hypothesis Testing for Random Graphs with Community Structure

The analysis of random graphs and network analysis recently become an active area of research. We will dicuss the problem of testing between two populations of inhomogeneous random graphs defined on the same set of vertices, where the null hypothesis means that the underlying edge probability matrices coincide. We propose a new approach of random graphs testing based on community detection procedure: we introduce a structural assumption, which is that our graphs have community structures with k communities.


Alexander Goldenshluger (University of Haifa, HSE University)

Nonparamertic estimation in non-regular deconvolution problems.

I will discuss problems of nonparametric estimation in non-regular deconvolution models. In such settings standard estimation methods based on the Fourier transform are not applicable. I will review the existing literature on the topic, discuss general ideas for constructing optimal estimators, and present some preliminary results. Based on joint ongoing work with D. Belomestny.


Alexander Timofeev (MIPT)

Density-sensitive semisupervised inference

In the talk, we will review the article written by Martin Azizyan, Aarti Singh and Larry Wasserman. We will consider semisupervised inference for the regression problem, based on a density-sensitive metric and show that the proposed semisupervised method outperforms any supervised one for certain sets. At the end of the talk, We will demonstrate how to calculate the density-sensitive metric.


Katya Krymova (University of Duisburg-Essen)

On a sparse modification of projection approximation subspace tracking method

In this talk we revisit the well-known constrained projection approximation subspace tracking algorithm (CPAST) and derive non-asymptotic bounds on its error. Furthermore we introduce a novel sparse modification of CPAST which is able to exploit sparsity of the underlying covariance structure. We present a non-asymptotic analysis of the proposed algorithm and study its empirical performance on simulated and real data.


Steve Oudot (Ecole Polytechnique)

Statistics and Learning with topological descriptors

I this survey talk I will review some of the recent advances in Topological Data Analysis to define stable descriptors for data, to use these descriptors for learning tasks, and to do statistics on them. Among other topics, I will address finding suitable metrics, defining and computing means and barycenters, deriving confidence intervals, laws of large numbers, central limit theorems, and kernel mean embeddings.


Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.