Scientific seminar
Structural Learning Seminar is a seminar for researchers, Master’s and Ph.D. students of HSE, Skoltech, IITP. This seminar is devoted to the problems of Mathematics and Computer Science and covers areas such as Mathematical Statistics, Machine Learning, Random Matrix Theory, Optimization, Information Theory, etc.
Seminar organizers: Sergey Samsonov, Nikita Puchkin .
Follow the updates of the seminars in our telegram channel: HDI Lab junior team seminar
Those who wish to make a presentation at the Seminar are asked to email: npuchkin@hse.ru or svsamsonov@hse.ru with an indication of the topic of the presentation, an abstract in English (not more than 1/2 page) and a suitable date of presentation. The speaker should inform in advance about the chosen form of the report  on the chalk board or computer presentation (the preferred form of the report is on the chalk board). The length of the presentation is 1 hour (if necessary, the length of the presentation may be extended to 1.52 hours)
For the winter 2024 semester, the seminar takes place on Thursdays at 2:40 p.m.
2024
Speaker  Title  Abstract  
18.01.2024  Marina Sheshukova (HSE University) and Arthur Goldman (HSE University)  Part l: Stochastic Methods in Variational Inequalities: Ergodicity, Bias and Refinements  At the seminar, we will discuss the general formulation of the problem of variational inequalities. We will analyze two algorithms Stochastic Gradient Descent Ascent and Double Stepsize Stochastic Extra Gradient as well as their theoretical properties.
In Part ll, we will look at how it is possible to prove the convergence of measures according to the Waserstein metric through the construction of caplings. 
25.01.2024  Alexander Beznosichov (MIPT)  On Distributed Optimization Problems under Similarity: Optimal Algorithms and Fast Extensions  In this presentation, we consider distributed methods for solving optimization problems, namely variational inequalities, which are interesting both in themselves and because they include both classical minimization problems and minimax problems. In the distributed formulation, the target operator of a variational inequality is divided into parts, and each of these parts can be accessed only by a local agent/worker. We deal with the case where the local operators are "similar" to each other in some sense. Due to the "similarity" it is possible to achieve a significant acceleration of the theoretical guarantees of convergence of methods in terms of estimates on communication complexity. Besides the issue of convergence of algorithms and obtaining upper bounds, we touch upon lower complexity bounds and verify the optimality of the proposed methods. In the remaining time, we try to discuss the question of how we can "break through" the lower estimates and construct an even faster method, in particular, we additionally introduce the possibility of compressing the transmitted information, modify the proposed algorithms and obtain upper and lower bounds in a new formulation. 
15.02.2024  Ilya Levin (HSE University)  Central Limit Theorems for Stochastic Approximation   
22.02.2024  Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)  Gaussian Variational Inference  Variational inference (VI) considers the problem of approximating a given measure (posterior) by a member of a given parametric family by minimizing the related KLdivergence. Gaussian VI focuses on the Gaussian approximation. We revisit the recent results by Katsevich and Rigollet (2023) and demonstrate how this problem can be solved using some advanced tools from optimization (the basic lemma). 
29.02.2024  Yuri Svirshchevsky (HSE University)  Monte Carlo Variational AutoEncoders  Variational autoencoders are generative models that are trained by maximizing some lower estimate of the validity of the observed data (evidence lower bound or ELBO). The quality of the trained model depends on the accuracy of the assessment. The seminar will explore ways to obtain two ELBO based on Annealed Importance Sampling and Sequential Importance Sampling, their differentiation and application in VAE. 
07.03.2024  Nikita Puchkin (HSE University)  Dimensionfree Structured Covariance Estimation  Given a sample of i.i.d. highdimensional centered random vectors, we consider a problem of estimation of their covariance matrix $\Sigma$ with an additional assumption that $\Sigma$ can be represented as a sum of a few Kronecker products of smaller matrices. Under mild conditions, we derive the first nonasymptotic dimensionfree highprobability bound on the Frobenius distance between $\Sigma$ and a widely used penalized permuted least squares estimate. Because of the hidden structure, the established rate of convergence is faster than in the standard covariance estimation problem. 
14.03.2024  Alexander Osinsky (Skoltech, IVM RAS)  Existence and construction of column and cross approximations of matrices close to optimal: probabilistic estimates   
28.03.2024  ZhuoSong Zhang (Southern University of Science and Technology)  BerryEsseen bounds for generalized Ustatistics  We establish optimal BerryEsseen bounds for the generalized U statistics. The proof is based on a new BerryEsseen theorem for exchangeable pair approach by Stein’s method under a general linearity condition setting. As applications, an optimal convergence rate of the normal approximation for subgraph counts in ErdösRényi graphs and graphonrandom graph is obtained. 
04.04.2024  Vadim Popov (Huawei)  Advantages of additional training of diffusion probabilistic models using the Huber loss function   
16.05.2024  Andrey Kupavsky (MIPT)  Concentration inequalities for finding multicolored combinations   
23.05.2024  Konstantin Yakovlev (MIPT)  Optimal score estimation via empirical Bayes smoothing  The problem of estimating the score function of an unknown probability distribution from n independent and identically distributed observations in d dimensions is studied. Assuming that the true density is sub gaussian and has a Lipschitzcontinuous score function, the optimal rate for the estimation problem under the loss function commonly used in the score matching literature was established. This reveals the curse of dimensionality where sample complexity for accurate score estimation grows exponentially with the dimension. Leveraging key insights in empirical Bayes theory as well as a new convergence rate of smoothed empirical distribution in Hellinger distance, it was shown that a regularized score estimator based on a Gaussian kernel attains this rate, shown optimal by a matching minimax lower bound. The implications of these findings for the sample complexity of scorebased generative models are discussed. 
06.06.2024  Alexander Korotin (Skoltech), Nikita Gushchin (Skoltech), Sergey Kholkin (Skoltech)  Building light Schrödinger bridges  Schrödinger Bridges (SB) have recently gained attention of the ML community as a promising extension of classic diffusion models, which are also interconnected to the Entropic Optimal Transport (EOT). Despite the recent advances in the field of computational Schrödinger Bridges (SB), most existing SB solvers are still heavyweighted and require complex optimization of several neural networks. We address this issue and propose two novel light solvers for this problem: LightSB and LightSBM. Both utilize the optimal structure of Schrödinger Bridges but in different ways. The LightSB solver allows direct minimized KLdivergence with the groundtruth solution, knowing only start and end marginals. The LightSBM solver is based on bridge matching and introduces the new concept of "optimal projection," allowing the Schrödinger Bridge to be solved in one bridgematching iteration. 
13.06.2024  Maria Rozanova (HSE University)  Lowrank matrix methods for incremental learning in recommendation systems   
02.08.2024  Evgeny Golikov (EPFL)  A generalisation bound for nearly linear networks  We derive a novel generalisation bound for neural networks that becomes nonvacuous when the activation functions get close to linearity. To the best of our knowledge, our bound is the first nonvacuous generalisation bound which is computable prior to training. Our analysis is based on a novel concept of "proxy models" who exploit the weights of trained models with activation functions removed. 
15.08.2024  Nikolai Malkin (University of Edinburgh)  Amortising intractable inference with diffusion models and offpolicy RL  I will present recent and ongoing work relating diffusion models, variational inference, and reinforcement learning. Efficient inference of highdimensional variables under a diffusion model prior enables solutions to problems such as conditional generation, semantic segmentation, and combinatorial optimisation (https://arxiv.org/abs/2206.09012). A new family of amortised methods (https://arxiv.org/abs/2402.05098, https://arxiv.org/abs/2405.20971) places the problem of stochastic continuous control  including sampling of posterior distributions under neural SDE priors  in the theoretical framework of offpolicy entropyregularised reinforcement learning, via the formalism of continuous generative flow networks (https://arxiv.org/abs/2301.12594). This connection allows us to train diffusion models to sample a target density or energy function, i.e., to perform blackbox variational inference. It further lets us finetune pretrained diffusion models in a datafree manner for asymptotically unbiased posterior sampling, with applications to vision (classconditional generation, texttoimage generation), language (constrained generation in diffusion language models), and control/planning (KLconstrained policy extraction with a diffusion behaviour policy). These approaches are applicable in Bayesian machine learning and various scientific domains. 
05.09.2024  Paul Mangold (École Polytechnique)  Taming Heterogeneity in Federated Linear Stochastic Approximation  In federated learning, multiple agents collaboratively train a machine learning model without exchanging local data. To achieve this, each agent locally updates a global model, and the updated models are periodically aggregated. In this talk, I will focus on federated linear stochastic approximation (FedLSA), with a strong focus on agents heterogeneity. I will derive upper bounds on the sample and communication complexity of FedLSA, and present a new method to reduce communication cost using control variates. Particular attention will be put on the "linear speedup" phenomenon, showing that the sample complexity scales with the inverse of the number of agents in both methods. 
12.09.2024  Evgenу Stepanov (POMI RAS, Universita di Pisa, HSE University)  On some spectral methods of searching for hidden geometric structures in data   
Fall 2023
Date  Speaker  Title  Abstract 
14.09.2023  Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)  Deviation bounds for the norm of a random vector under exponential moment conditions with applications  HansonWright inequality provides a powerful tool for bounding the norm \ \xi \ of a centered stochastic vector \( \xi \) with subgaussian behavior. This paper extends the bounds to the case when \( \xi \) only has bounded exponential moments of the form \( \log E \exp \langle V^{1} \xi,u \rangle \leq \ u \^{2}/2 \), where \( V^{2} \geq \mathrm{Var}(\xi) \) and \( \ u \ \leq g \) for some fixed \( g \). For a linear mapping \( Q \), we present an upper quantile function \( z_{c}(B,x) \) ensuring \( P(\ Q \xi \ > z_{c}(B,x)) \leq 3 e^{x} \) with \( B = Q \, V^{2} Q^{T} \). The obtained results exhibit a phase transition effect: with a value \( x_{c} \) depending on \( g \) and \( B \), for \( x \leq x_{c} \), the function \( z_{c}(B,x) \) replicates the case of a Gaussian vector \( \xi \), that is, \( z_{c}^{2}(B,x) = \tr(B) + 2 \sqrt{x \tr(B^{2})} + 2 x \ B \ \). For \( x > x_{c} \), the function \( z_{c}(B,x) \) grows linearly in \( x \). The results are specified to the case of Bernoulli vector sums and to covariance estimation in Frobenius norm. 
21.09.2023  Arthur Goldman (HSE University)  RL with friends: how many agents does it take to solve an environment?  As part of the report, students will get acquainted with the main definitions and results of the field related to the study multiagent reinforcement learning и meanfield games. 
05.10.2023  Nikita Puchkin (HSE University)  Sharper dimensionfree bounds on the Frobenius distance between sample covariance and its expectation  We consider the problem of covariance estimation from a sample of highdimensional random vectors. We derive dimensionfree bounds on the squared Frobenius distance between sample covariance and its expectation, which significantly improve over the existing results. 
12.10.2023  Fedor Noskov (HSE University, MIPT, Skoltech)  Analysis of Boolean functions  At the seminar, we will discuss general techniques of Boolean analysis, as well as the consequences that these techniques lead to in learning theory, approximation theory, and combinatorics. The main subject of our discussion will be the Fourier analysis of Boolean functions. Also, we will touch a little on the phenomenon of hypercontractivity and its various applications. 
19.10.2023  Denis Rakitin (HSE University)  Generative models based on ODE/DE   
09.11.2023  Evgeny Frolov (HSE University, AIRI)  Learning from sequences using tensor decompositions  The report is devoted to the application of tensor decompositions and effective computational methods of linear algebra to create an analog of the selfattention model on sequences of user actions. This approach can be applied both in the standard task of predicting the next user actions, and in a wider range of tasks related to the analysis of incomplete multidimensional time series. 
14.12.2023  Alexander Tarakanov (HSE University, Odnoklassniki).  Application of machine learning methods for computing eigenfunctions of differential operators   
Spring 2023
Date  Speaker  Title  Abstract 
17.01.2023  Alexandra Senderovich (HSE University)  Norm and trace estimation with random rankone vectors (paper overview)  To estimate the norm of an arbitrary matrix or the trace of a symmetric nonnegatively defined matrix, several multiplications of the matrix by random vectors are often enough. Using rank 1 vectors, i.e. Kronecker products of smaller random vectors, it is possible to speed up the calculation of such an estimate by speeding up sampling and matrix multiplication. The article shows that with random vectors of rank 1 sampled from a normal distribution or a Rademacher distribution, the probabilities of getting good estimates for the norm deteriorate slightly compared to using ordinary vectors from the same distributions. The authors also conducted numerical experiments confirming the derived probabilities and showing the advantages of the approach with rank 1 vectors. 
31.01.2023  Dmitry Lishudi (HSE University)  "How benign is benign overfitting?"  On many datasets, neural networks when trained with SGD achieve almost zero error on training data and good generalizability on test data, even when the training partitioning is noisy. This effect is called benign overfitting. Nevertheless, such models are vulnerable to adversarial attacks. The first reason for this vulnerability is poor sampling. Thus, theoretical and empirical analysis shows that noise in data partitioning is one of the causes of adversarial vulnerability and robust models fail to achieve zero training error. However, removing incorrect labels does not achieve resilience to adversarial attacks. The paper suggests that it is also a matter of suboptimal representation learning. Using a simple theoretical statement, it is shown that the choice of representations can significantly affect adversarial stability. 
7.02.2023  12.02.2023 
 Сonference and Autumn School Fall into ML  https://cs.hse.ru/iai/hdilab/mml2023/ 
14.02.2023  Nikita Puchkin (HSE University)  Exploring Local Norms in Expconcave Statistical Learning  We consider the standard problem of stochastic convex optimization with expconcave losses using Empirical Risk Minimization in a convex class. Answering a question raised in several prior works, we provide a $O ( d/n + 1/n log ( 1/ \delta ) )$ excess risk bound valid for a wide class of bounded expconcave losses, where d is the dimension of the convex reference set, n is the sample size, and $\delta$ is the confidence level. Our result is based on a unified geometric assumption on the gradient of losses and the notion of local norms. 
21.02.2023  Denis Belomestny (HSE University, Duisburg Essen University)  A Reproducing Kernel Hilbert Space approach to singular local stochastic volatility McKeanVlasov models  Motivated by the challenges related to the calibration of financial models, we consider the problem of numerically solving a singular McKeanVlasov equation 
28.02.2023  Sergey Samsonov (HSE University)  Finitetime highprobability bounds for linear stochastic approximation  We discuss a finitetime analysis of linear stochastic approximation (LSA) algorithms, a core method in statistics and machine learning. LSA is used to compute approximate solutions of a ddimensional linear system, where both the system matrix and the righthand side can only be estimated through (asymptotically) unbiased observations. We consider here the case where the observations are driven either by sequence of i.i.d. noise variables or by uniformly geometrically ergodic Markov chain. We discuss the pmoments inequality and high probability bounds for the LSA iterates and its PolyakRuppert averaged version focusing on tight dependence upon the mixing time of the chain. We then prove finitetime instancedependent bounds on the PolyakRuppert averaged sequence of iterates. These results are sharp in the sense that the leading term we obtain matches the local asymptotic minimax limit. We also cover the applications to the temporaldifference methods in RL 
07.03.2023  Maxim Vasiliev (HSE University)  Approximation and sampling of multivariate probability distributions in the tensor train decomposition (papers overview)   
14.03.2023  Fedor Noskov (HSE University, MIPT, Skoltech)  Optimal Estimation in MixedMembership Stochastic Block Model  Community detection is one of the central problems in modern network science. It has numerous applications in the analysis of social and biological networks, designing network protocols and many other areas. Recently, much attention has been paid to the detection of overlapping communities, where each node in a network may belong to multiple groups. We consider the parameter estimation problem in Mixed Membership Stochastic Block Model (MMSB), which is a quite general instance of a random graph model allowing for overlapping community structure. We discuss different approaches to parameter estimation in this model, their theoretical properties and practical performance. Finally, we propose an approach to improve over existing estimates that allows obtaining minimax optimal rates of estimation for all the model parameters. 
23.03.2023  Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)  Inference in ErrorinOperator model  We consider the problem of recovering the source signal \( x \) from the noise observation \( Y \) given by the equation \( Y = A x + \epsilon \) in the situation when the operator \( A \) is not precisely known. Instead, a pilot estimate \( \hat{A} \} is available. New approach is suggested which allows to obtain finite sample nearly sharp accuracy guarantees and incorporate the smoothness of the signal \( x \) and the operator \( A \) in a rather general form. 
28.03.2023  Evgeny Lagutin (HSE University)  Stable Rank Normalization for Improved Generalization in Neural Networks and GANs (paper overview)  Exciting new work on generalization bounds for neural networks (NN) given by Bartlett et al. (2017); Neyshabur et al. (2018) closely depend on two parameter dependent quantities: the Lipschitz constant upper bound and the stable rank (a softer version of rank). Even though these bounds typically have minimal practical utility, they facilitate questions on whether controlling such quantities together could improve the generalization behaviour of NNs in practice. Spectral Normalization (Miyato et al., 2018) is a technique making it easy to restrict the Lipschitz constant upper bound and thus is a popular way to stabilize training of generative adversarial networks and is to make a general discriminative neural network robust. On the other hand Spectral Rank Normalization (Sanyal et al., 2020) tackles the problem of restricting a stable rank of neural network's layers. Motivated by the generalization bound view this technique in combination with Spectral Normalization further improve the robustness properties on classification tasks and the quality on image generation tasks. 
04.04.2023  Evgeny Nikishin (Mila, University of Montreal)  Deep Reinforcement Learning with Plasticity Injection  A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, and performance in RL. We introduce plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are twofold: first, as a diagnostic tool — if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent exhausted its plasticity and has to relearn from scratch or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance compared to alternative methods while being computationally efficient. 
20.04.2023  Daniil Tyapkin (HSE University)  Fast Rates for Maximum Entropy Exploration  We consider the reinforcement learning (RL) setting, in which the agent has to act in unknown environment driven by a Markov Decision Process (MDP) with sparse or even reward free signals. In this situation, exploration becomes the main challenge. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization that was previously considered by Hazan et al. 2019 in the discounted setting. For this type of exploration, we propose an algorithm based on a game theoretic representation that has O(HSA/eps^2) sample complexity thus improving the eps and Sdependence of Hazan, where S is a number of states, A is a number of actions, H is an episode length, and eps is a desired accuracy. The second type of entropy we study is the trajectory entropy. This objective function is closely related to the entropyregularized MDPs, and we propose a simple modification of the UCBVI algorithm that has a sample complexity of order 1/eps ignoring dependence in S,A,H. Interestingly enough, it is the first theoretical result in RL literature establishing that the exploration problem for the regularized MDPs can be statistically strictly easier (in terms of sample complexity) than for the ordinary MDPs. 
27.04.2023  Arthur Goldman (HSE University)  Theoretical guarantees for neural control variates in MCMC  In this paper, we propose a variance reduction approach for Markov chains based on additive control variates and the minimization of an appropriate estimate for the asymptotic variance. We focus on the particular case when control variates are represented as deep neural networks. We derive the optimal convergence rate of the asymptotic variance under various ergodicity assumptions on the underlying Markov chain. The proposed approach relies upon recent results on the stochastic errors of variance reduction algorithms and function approximation theory. 
01.05.2023  Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)  Sharp deviation bounds for quadratic forms and their use in machine learning  Under a mild assumption of linearity on the loglikelihood function, the tools of empirical processes are not necessary, the analysis of a general MLE can be reduced to a deviation bound for a quadratic form. This paper explains how the recent advances in Laplace approximation from [Spokoiny, 2022, 2023] can be used for obtaining sharp Gaussianlike finite sample bounds and for stating the prominent concentration phenomenon for the squared norm of a subgaussian vector. Some extensions and open problems will be discussed as well. 
04.05.2023  Yana Hassan (HSE University)  Fast Rates for Maximum Entropy Exploration  Nowadays it is becoming extremely popular to tune the large pertained language model with experts comments using reinforcement learning with human feedback methodology. In my talk I would focus on a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). The analysis from the paper shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the BradleyTerryLuce (BTL) model and the PlackettLuce (PL) model. The results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. In my talk I am going to dicsuss the unified analysis the problem of RLHF and maxentropy Inverse Reinforcement Learning (IRL), and illustrate the first sample complexity bound for maxentropy IRL. 
11.05.2023  Alexander Podkopaev (Carnegie Mellon University)  Nonparametric twosample and independence testing by betting  Twosample and independence testing are two fundamental problems of statistical inference. In the former, given observations from two distributions, the goal is to test the null hypothesis that the distributions are equal against the alternative that they are not. In the latter, given observations drawn from an unknown joint distribution, the goal is to test the null that the underlying random variables are independent against the alternative that they are not. Both problems have been extensively studied in the batch setting when the sample size is specified before collecting data. The main issue is that in general composite nonparametric settings, even if the null is false, it is unknown beforehand collecting how much data will be ''enough'' to reject. Sequential testing is a complementary approach when an incoming data stream is analyzed online. In this talk, I will discuss our recent sequential twosample and independence tests that (a) continuously monitor the data while controlling the false alarm rate, (b) are consistent, meaning that they are guaranteed to stop and reject if the null is false, and (c) provably adapt to the complexity of a problem and stop earlier on easy tasks / later on harder ones. 
18.05.2023  Ulyana Vinogradova (HSE University)  Leveraging covariate adjustments at scale in online A/B testing  Various companies run randomized online experiments to estimate the “causal impact” of new adopted features on performance of metrics. Difference in means estimator is still one of the most commonly used method in many online A/B testing platforms due to its simplicity. However, this estimator often fails to detect small effects of the intervention despite the fact that the experiment contains thousands or millions of customers. A potential solution to this issue is to develop an alternative estimator, that leverages the abundance of additional data (covariates) collected during the experiments. In this talk, I will discuss recently proposed generalpurpose algorithm for the estimation of causal effects with covariates to the setting of online A/B testing. Also I will talk about benefits, costs and risks of allowing experimenters to leverage more complicated estimators that use covariates to estimate causal effects. 
25.05.2023  Ivan Safonov (HSE University)  Tensorization of flows: a tool for variational inference  Consider the classical problem of learning the approximation from a given class of distributions to a given distribution up to a normalizing constant. To solve this problem, you can apply flow normalization. The unimodality of the reference Gaussian distribution used in the normalizing flow may cause difficulties in studying multimodal distributions. We introduce an extension of normalizing flows, in which the Gaussian reference value is replaced by a reference distribution (tensor flow), which is constructed using a tensor network. TF combined with NF gives better results than each of them individually. In this talk, I will talk about the recently proposed tensorizing flow, how to extract a sample from it, how to study it, and how to apply it to VI together with NF. The tensor flow construction is similar to the construction described in Sergey Dolgov's articles, but it has some interesting differences that we will discuss. 
08.06.2023  Ivan Peshekhonov and Alexey Arzhantsev (HSE University)  Exploring the Benefits of Riemannian Optimization on Shared Factor Tensor Manifolds  Factorization of a matrix into a product of two rectangular ones, called factors, is a classic tool in various machine learning applications. Tensor factorizations generalize this concept to more than two dimensions. In certain applications, where some of the tensor dimensions encode the same objects (e.g., knowledge base graphs with entityrelationentity 3D tensors), it may also be beneficial for the respective factors to be shared. In this paper, we consider a wellknown Tucker tensor factorization and study its properties under the shared factor constraint. We call it a sharedfactor Tucker factorization (SFTucker). We prove that a manifold of fixed rank SFTucker tensors forms a smooth manifold and develop efficient algorithms for the main ingredients of Riemannian optimization: projection to a tangent plane for generalpurpose loss functions and retraction. Having these tools at hand, we also implement a Riemannian optimization method with momentum and showcase the benefits of our approach on several machine learning tasks. 
15.06.2023  Irina Goloborodko (HSE University)  Jacobian regularization of tensor decompositions in neural networks  Tensor decompositions can be used as a compression tool for neural networks. For this purpose the tensor of weights of a fullconnected or convolutional layer is replaced in the architecture by its own tensor decomposition. The disadvantage of this approach is the instability of the compressed neural network for some types tensor decompositions. In order to improve the stability of the compressed neural network, randomized Jacobian spectral regularizer can be applied. The regularizer relies on matvec functions of the Jacobian and its transposed and can be efficiently computed using automatic differentiation tools. This results in great versatility of this method and makes it possible to apply proposed technique in different scenarios. 
29.06.2023  Nikita Morozov (HSE University)  Introduction to Generative Flow Networks (GFlowNets)   
Fall 2022
Date  Speaker  Title  Abstract 
19.09.2022  Denis Belomestny (HSE University, Duisburg Essen University)  Statistics of McKeanVlasov equations  In this talk we study the problem of semiparametric estimation for a class of McKeanVlasov stochastic differential equations. Our aim is to estimate the drift coefficient of a MVSDE based on observations of the corresponding particle system. We propose a semiparametric estimation procedure and derive the rates of convergence for the resulting estimator. We further prove that the obtained rates are essentially optimal in the minimax sense. 
26.09.2022  Alexander Tarakanov (HSE University)  MultiLevel Unbiased MCMC Estimators for Discretized Models  ODE and PDE models routinely in different applications. Significant part of practical questions can be reduced to integration over the distribution of ODE/ODE parameters spaces. The practical way for solving such problem is MCMC integration coupled with spatial/temporal discretization of ODE/PDE model. Such discretization introduces bias to MCMC integration. In the present talk the novel scheme for MultiLevel Unbiased MCMC method is discussed. The technique concerned is provides unbiased estimator with finite variance and finite expected computational cost for a given Quantity of Interest under reasonable smoothness constraints. Application examples are provided. 
03.10.2022  Yana Khassan (HSE University)  Is Qlearning Provably Efficient? (Article overview)  Modelfree reinforcement learning (RL) algorithms, such as Qlearning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than modelbased approaches. However, empirical work has suggested that modelfree algorithms may require more samples to learn. The theoretical question of “whether modelfree algorithms can be made sample efficient” is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions. The paper proves that, in an episodic MDP setting, Qlearning with UCB exploration achieves almost optimal regret. This is the first analysis in the modelfree setting that establishes O(√T) regret without requiring access to a “simulator.” 
10.10.2022  Dmitry Molchanov (HSE University)  Twice stochastic variational inference with semiimplicit and noninvariant distributions  Doubly stochastic variational inference (DSVI) is one of the main methods of approximate inference and learning in modern deep probabilistic models. However, this method imposes significant restrictions on the distributions used and has a limited area of applicability. The main goal of my work is to relax these limitations and expand the tools for working with deep probabilistic models. 
24.10.2022  Evgenii Nikishin (Mila, Université de Montréal)  The Primacy Bias in Deep Reinforcement Learning  This work identifies a common flaw of deep reinforcement learning (RL) algorithms: a tendency to rely on early interactions and ignore useful evidence encountered later. Because of training on progressively growing datasets, deep RL agents incur a risk of overfitting to earlier experiences, negatively affecting the rest of the learning process. Inspired by cognitive science, we refer to this effect as the primacy bias. Through a series of experiments, we dissect the algorithmic aspects of deep RL that exacerbate this bias. We then propose a simple yet generallyapplicable mechanism that tackles the primacy bias by periodically resetting a part of the agent. We apply this mechanism to algorithms in both discrete (Atari 100k) and continuous action (DeepMind Control Suite) domains, consistently improving their performance. 
31.10.2022  Valeriia Shcherbakova (HSE University)  A Contrastive Approach to Online Change Point Detection  We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from prechange and postchange distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove nonasymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and realworld data sets. 
1.11.20223.11.2022  Autumn School  Fall into ML 2022  
07.11.2022  Artur Goldman (HSE University)  Theoretical guarantees for approximate sampling from smooth and logconcave densities  Sampling from various kinds of distributions is an issue of paramount importance in statistics. In many situations, the exact sampling from a given distribution is impossible or computationally expensive and, therefore, one needs to resort to approximate sampling strategies. However, there is no welldeveloped theory providing meaningful nonasymptotic guarantees for the approximate sampling procedures, especially in the highdimensional problems. This paper makes some progress in this direction by considering the problem of sampling from a distribution having a smooth and logconcave density. Authors establish nonasymptotic bounds for the error of approximating the target distribution by the one obtained by the Langevin Monte Carlo method and its variants, and illustrate the effectiveness of the established guarantees with various experiments. 
14.11.2022  Daria Demidova (HSE University)  Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices  This paper studies the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution V = ef on ℝn. Authors prove a convergence guarantee in KullbackLeibler (KL) divergence assuming that V satisfies a logSobolev inequality and the Hessian of f is bounded. Notably, convexity or bounds on higher derivatives are not assumed. Authors also prove convergence guarantees in Rényi divergence of order q > 1 assuming the limit of ULA satisfies either the logSobolev or Poincaré inequality. Authors also prove a bound on the bias of the limiting distribution of ULA assuming thirdorder smoothness of f, without requiring isoperimetry. 
21.11.2022 and 28.11.2022  Anton Zubekhin (HSE University)  Diffusion Probabilistic Models explosion  Diffusion probabilistic models are stateoftheart method for image generation. More than that, diffusion models also perform good working with data of different modalities, such as speech and text. Texttoimage diffusions are the most wellknown types of such models with Dalle2, Imagen and Stable Diffusion being famous even outside of ML community. Recent advances in machine learning show even wider applicability of diffusion models  they can be used for video generation, 3D reconstructions, metalearning and even direct optimization of other neural networks. 
05.12.2022  Varvara Rudenko (HSE University)  Logconcave sampling: MetropolisHastings algorithms are fast  The authors study the problem of sampling from a strongly logconcave density supported on R^d, and prove a nonasymptotic upper bound on the mixing time of the Metropolisadjusted Langevin algorithm (MALA). The method draws samples by simulating a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an acceptreject step. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), their bounds show that the use of an acceptreject step in MALA leads to an exponentially improved dependence on the errortolerance. And also demonstrate the gains of a modified version of MALA over ULA for weakly logconcave densities. Furthermore, they derive mixing time bounds for the Metropolized random walk (MRW) and obtain O(κ) mixing time slower than MALA. They provide numerical examples that support their theoretical findings, and demonstrate the benefits of MetropolisHastings adjustment for Langevin type sampling algorithms. 
12.12.2022  Marina Sheshukova (HSE University)  Exponential Concentration Inequalities for Additive Functionals of Markov Chains  Using the renewal approach we prove exponential inequalities for additive functionals and empirical processes of ergodic Markov chains, thus obtaining counterparts of inequalities for sums of independent random variables. The inequalities do not require functions of the chain to be bounded and moreover all the involved constants are given by explicit formulas whenever the usual drift condition holds, which may be of interest in practical applications e.g. to MCMC algorithms. 
19.12.2022  Timofey Grinev (HSE University)  Sharper Bounds for Uniformly Stable Algorithms  Deriving generalization bounds for stable algorithms is a classical question in learning theory. This paper is devoted to two questions: firstly it provides a short proof of the moment bound that implies the generalization bound stronger than recent results. Secondly, it proves general lower bounds, showing that this moment bound is sharp (up to a logarithmic factor) unless some additional properties of the corresponding random variables are used. The main probabilistic result is a general concentration inequality for weakly correlated random variables, which may be of independent interest. 
Date  Speaker  Title  Abstract 
13.01.2021  Vladimir Spokoiny (WIAS, IITP RAS, HSE University)  Bayesian inference in machine learning  Motivated by examples from nonparametric binary classification, ranking problems, analysis of random graphs, we consider the problem of statistical inference for binary data within a nonparametric Bernoulli model. In many situations the underlying Bernoulli parameter may approach the edges zero or one that leads to inference for nonregular models. Properties like asymptotic normality of the MLE or of the posterior distribution are not valid in such irregular cases. The approach of this paper suggests to impose a Gaussian prior on the parameter of logistic regression function, or, analogously, to consider a penalized MLE of the canonical link with a quadratic penalty. A Gaussian prior can be viewed as a kind of a barrier preventing the canonical parameter to approach infinity and forcing it to stay within the region of regularity. Our main results state finite sample analogs of the classical statistical results like asymptotic normality of the posterior distribution or asymptotic normality of the penalized MLE for any configuration of Bernoulli parameters under the assumption that the prior is strong enough in direction. We also demonstrate how the results can be applied to ranking problems and nonparametric binary classification. 
09.02.2021  Nikita Puchkin (HSE University, IITP)  Rates of convergence for density estimation with GANs  We undertake a precise study of the nonasymptotic properties of vanilla generative adversarial networks (GANs) and derive theoretical guarantees in the problem of estimating an unknown $d$dimensional density $p$ under a proper choice of the class of generators and discriminators. We prove that the resulting density estimate converges to $p$ in terms of JensenShannon (JS) divergence at the rate $n^{2\beta/(2\beta+d)}$ where $n$ is the sample size and $\beta$ determines the smoothness of $p$. This is the first result in the literature on density estimation using vanilla GANs with JS rates faster than $n^{1/2}$ in the regime $\beta>d/2$. 
16.02.2021  Sergey Samsonov (HSE University)  On the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD Learning  This paper studies the exponential stability of random matrix products driven by a general (possibly unbounded) state space Markov chain. It is a cornerstone in the analysis of stochastic algorithms in machine learning (e.g. for parameter tracking in onlinelearning or reinforcement learning). The existing results impose strong conditions such as uniform boundedness of the matrixvalued functions and uniform ergodicity of the Markov chains. Our main contribution is an exponential stability result for the pth moment of random matrix product, provided that (i) the underlying Markov chain satisfies a superLyapunov drift condition, (ii) the growth of the matrixvalued functions is controlled by an appropriately defined function (related to the drift condition). Using this result, we give finitetime pth moment bounds for constant and decreasing stepsize linear stochastic approximation schemes with Markovian noise on general state space. We illustrate these findings for linear valuefunction estimation in reinforcement learning. We provide finitetime pth moment bound for various members of temporal difference (TD) family of algorithms. 
02.03.2021  Dmitry Yarotsky (Skoltech)  Approximation with deep neural networks  In recent years, there has been quite a number of new results on approximation properties of deep neural networks. This emergent theory is rather different from the classical approximation theory of linear or shallow models (such as splines or Fourier expansion). In this talk, I will try to highlight some interesting ideas of this theory, and will also briefly indicate its connections to other topics such as information theory, dynamical systems, fewnomials and VC dimension. 
09.03.2021  Nikita Puchkin (HSE, IITP)  Exponential Savings in Agnostic Active Learning through Abstention  We show that in poolbased active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss 1/2 of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in poolbased active classification under the model misspecification. 
16.03.2021  Nikita Zhivotovskiy (ETH, Zurich)  DistributionFree Robust Linear Regression  We consider the problem of randomdesign linear regression, in a distributionfree setting where no assumption is made on the distribution of the predictive/input variables. After surveying existing approaches and indicating some improvements, we explain why they fall short in our setting. We then identify the minimal assumption on the target/output under which guarantees are possible, and describe a nonlinear prediction procedure achieving the optimal error bound with high probability. Joint work with Jaouad Mourtada (CRESTENSAE Paris ) and Tomas Vaškevičius (Oxford). 
23.03.2021  Maxim Panov (Skoltech)  Stochastic normalizing flows  Normalizing flows is one of the central concepts in modern Bayesian machine learning. They allow constructing approximations of complex distributions starting from simple ones (e.g., Gaussian) and transforming it by a series of deterministic mappings. Normalizing flows have found numerous applications being especially useful in variational approach to Bayesian inferences. In this talk, we will show that classical deterministic normalizing flows have severe limitations which can be overcome by adding stochasticity to the procedure. It can significantly improve the power of the method but requires much more complicated training procedures. We will discuss how one can construct variational learning procedures based on the coupling of normalizing flows with the MetropolisHastings accept/reject algorithm. We will also see numerous connections with other classical sampling approaches such as Annealed Importance Sampling (AIS) and Sequential Importance Sampling (SIS). We will illustrate the theory with applications to sampling and the construction of variational autoencoders.(Joint work with Eric Moulines and Achille Thin (Ecole Polytechnique), Alain Durmus (ENS ParisSaclay), Arnaud Doucet (Uni Oxford) and Nikita Kotelevskii (Skoltech)). 
30.03.2021  Dmytro Perekrestenko (ETH Zurich)  Constructive Universal HighDimensional Distribution Generation through Deep ReLU  We present an explicit deep neural network construction that transforms uniformly distributed onedimensional noise into an arbitrarily close approximation of any twodimensional Lipschitzcontinuous target distribution. The key ingredient of our design is a generalization of the “spacefilling” property of sawtooth functions discovered in (Bailey & Telgarsky, 2018). We elicit the importance of depth—in our neural network construction—in driving the Wasserstein distance between the target distribution and the approximation realized by the network to zero. We also will outline an extension of our method to output distributions of arbitrary dimension. Finally, we show that the proposed construction does not incur a cost—in terms of error measured in Wassersteindistance—relative to generating ddimensional target distributions from d independent random variables. 
06.04.2021  Daniil Tiapkin (HSE University)  Improved Complexity Bounds in the Wasserstein Barycenter Problem  In this paper, we focus on computational aspects of the Wasserstein barycenter problem. We provide two algorithms to compute Wasserstein barycenter of m discrete measures of size n with accuracy $\e$. The first algorithm, based on mirror prox with some specific norm, meets the complexity of celebrated accelerated iterative Bregman projections (IBP), that is $\widetilde O(mn^2\sqrt n/\e)$, however, with no limitations unlike (accelerated) IBP, that is numerically unstable when regularization parameter is small. The second algorithm, based on areaconvexity and dual extrapolation, improves the previously bestknown convergence rates for Wasserstein barycenter problem enjoying $\widetilde O(mn^2/\e)$ complexity. 
13.04.2021  Alexey Kroshnin (IITP RAS, HSE University)  Robust kmeans clustering in metric spaces  In this talk we consider robust median of means based algorithms for the kmeans clustering problem in a metric space. The main results are nonasymptotic excess distortion bounds that hold under the two bounded moments assumption in a general separable metric space. In the case of Hilbert space our bounds have the subGaussian form depending on the probability mass of the lightest cluster of an optimal quantizer. In a special case of clustering in R^d we prove matching (up to constant factors) nonasymptotic upper and lower bounds on the excess distortion, which depend on the probability mass of the lightest cluster and on the variance. The talk is based on a joint work with Yegor Klochkov and Nikita Zhivotovskiy 
20.04.2021  Quentin Paris (HSE University)  Online learning with exponential weights in metric spaces  This paper addresses the problem of online learning in metric spaces using exponential weights. We extend the analysis of the exponentially weighted average forecaster, traditionally studied in a Euclidean settings, to a more abstract framework. Our results rely on the notion of barycenters, a suitable version of Jensen's inequality and a synthetic notion of lower curvature bound in metric spaces known as the measure contraction property. We also adapt the onlinetobatch conversion principle to apply our results to a statistical learning framework. 
27.04.2021  Иван Оселедец (Сколтех)  Геометрия в моделях машинного обучения  Об использовании геометрического подхода к моделям машинного обучения, а также приведу обзор наших результатов последних лет, включая: а) применение гиперболических пространств в машинном обучении и создание рекомендательных моделей; б) построение интерпретируемых направлений в латентном пространстве генеративных сетей. 
12.05.202115.05.2021  Conference  New Frontiers in Highdimensional Probability and Applications to Machine Learning  
18.05.2021  Egor Klochkov (Cambridge University)  Fast rates for strongly convex optimization via stability  The sharpest known high probability generalization bounds for uniformly stable algorithms (Feldman, Vondrák, 2018, 2019), (Bousquet, Klochkov, Zhivotovskiy, 2020) contain a generally inevitable sampling error term of order 1/sqrt(n). When applied to excess risk bounds, this leads to suboptimal results in several standard stochastic convex optimization problems. We show that if the socalled Bernstein condition is satisfied, the term O(1/sqrt(n)) can be avoided, and high probability excess risk bounds of order up to O(1/n) are possible via uniform stability. Using this result, we show a high probability excess risk bound with the rate O(log n / n) for strongly convex and Lipschitz losses valid for any empirical risk minimization method. This resolves a question of ShalevShwartz, Shamir, Srebro, and Sridharan (2009). We discuss how O(logn/n) high probability excess risk bounds are possible for projected gradient descent in the case of strongly convex and Lipschitz losses without the usual smoothness assumption. This is a joint work with Nikita Zhivotovskiy and Olivier Bousquet. 
01.06.2021  Sergey Samsonov (HSE University)  UVIP: ModelFree Approach to Evaluate Reinforcement Learning Algorithms  Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL). Yet even a precise knowledge of the value function V^{\pi} corresponding to a policy \pi does not provide reliable information on how far is the policy \pi from the optimal one. We present a novel modelfree upper value iteration procedure (UVIP) that allows us to estimate the suboptimality gap V*(x)  V^{\pi}(x) from above and to construct confidence intervals for V*. Our approach relies on upper bounds to the solution of the Bellman optimality equation via martingale approach. We provide theoretical guarantees for UVIP under general assumptions and illustrate its performance on a number of benchmark RL problems. The talk is based on our recent work with Denis Belomestny, Ilya Levin, Eric Moulines, Alexey Naumov and Veronika Zorina. 
22.06.2021  Dmitrii Ostrovskii (University of Southern California)  NearOptimal Model Discrimination with NonDisclosure  In the standard setup of parametric Mestimation with convex loss, we consider two population risk minimizers associated with two possible distributions of a random observation, and then ask the following question: Given the value of one of the two minimizers and access to i.i.d. sampling from both distributions, what sample sizes allow to identify the “right” distribution, i.e., the one corresponding to the given value?Making the first steps towards answering this question in full generality, we first consider the case of a wellspecified linear model with squared loss. Here we provide nearly matching upper and lower bounds on the sample complexity, showing it to be min{1/Δ^2,\sqrt{r}/Δ} up to a constant factor, where Δ is a measure of separation between the two distributions and r is the maximum (resp., minimum) of the ranks of the design covariance matrices in the case of the upper (resp., lower) bound. This bound is dimensionindependent and rankindependent for large enough separation. We then extend this result in two directions: (i) for the general parametric setup in asymptotic regime; (ii) for generalized linear models in the smallsample regime n≤r and under weak moment assumptions. In both cases, we derive sample complexity bounds of a similar form, even under misspecification. Our testing procedures only access the known'' value through a certain functional of empirical risk. In addition, the number of observations that allows to reach statistical confidence for testing does not allow to "resolve" the two models  that is, recover both minimizers up to O(Δ) prediction accuracy. These two properties open the prospect of applying our testing framework in practical tasks, where one would like to identify a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually inferred by the identifying agent. The talk will be following the recent paper arxiv:2012.02901. 
12.10.2021  Evgeny Spodarev (Ulm University, Germany)  Spectral classification of fullerenes  Joint work with A. Bille (Skoltech, Moscow/Ulm University), V. Buchstaber (Steklov Mathematical Institute, Moscow) and S. Coste (INRIA, Paris). Fullerenes $C_n$ are 3simple convex compact polytopes with $n$ vertices, $n\ge 20$ even, $n\neq 22$, 12 pentagons and $n/210$ hexagons as facets. They are mathematical models of famous carbon molecules which find wide applications in modern nanotechnologies. For the synthesis of $C_60$, R. Curl, H. Kroto and R. Smalley got a Nobel prize in chemistry (1996). Since the number of distinct fullerene isomers (which are combinatorially non—equivalent versions of $C_n$) grows as $O(n^9)$ for large $n$, their enumeration and classification is a nontrivial mathematical problem. We propose a solution to this problem using the Newton polynomials of the spectra of their dual graphs of hexagons. It appears that the degree of such polynomials has a logarithmic growth order with respect to $n$. If the time allows, the spectral linear ordering of characters of $C_n$ for different $n$ as well as the distribution of a random eigenvalue of Laplacetype matrices for the graphene, regular triangulation of the plane and infinite nanotubes with chiral vector $(p,q)$ will be touched upon. 
23.11.202126.11.2021  Dmitry Vetrov (HSE University); Alexey Naumov (HSE University); Denis Derkach (HSE University); Mikhail Lazarev (HSE University); Nikita Kazeev (HSE University); Sergey Shirobokov (LCL, Twitter); Maxim Panov (Scoltech); Mikhail Burtsev (MIPT)  Third HSEYandex autumn school on generative models 
Spring 2020
Date  Speaker  Title  Abstract 
3.03.2020  Nazar Buzun (Skoltech)  Gaussian approximation of the maximum of the sum of random vectors.  In this paper we consider a Gaussian approximation for the maximum sum of independent random vectors in high dimensionality. Our approach is based on approximation by probability. In this case, the random variables depend on each other. From approximation by probability, approximation by distribution follows directly. We obtained a theoretical estimate of the rate of convergence to the maximum normal vector, which is a significant improvement over those obtained earlier. 
6.03.2020  Alexandra Suvorikova (WIAS Berlin)  Shapebased domain adaptation via optimal transportation  Domain adaptation problem aims at learning a well performing model, trained on a source data S (images, vectors, e.t.c), applied then to different (but related) target sample T. Aside from being attractive due to obvious practical utility, the setting is challenging from theoretical point of view. In this work we introduce a novel approach to supervised domain adaptation consisting in a classdependent fitting based on ideas from optimal transportation (OT) theory which considers S and T as two mixtures of distributions. A parametrized OT distance is used as a fidelity measure between S and T, providing a toolbox for modelling of possibly independent perturbations of mixture components. The method is than used for describing the adaptation of immune system in humans after moving to another climatic zone. 
11.03.2020  Sergey Bobkov (University of Minnesota, HSE University)  Gaussian concentration, spherical concentration, and their applications to randomized models in the central limit theorem  We plan to discuss the socalled concentration phenomenon of the measure and its corresponding Sobolev inequalities with respect to multidimensional Gaussian measures and uniform distributions on a unit sphere in Euclidean space. In addition to the classical results describing the increase in the degree of concentration with increasing dimensionality, new estimates for large deviations will also be presented, with improvements in the standard inequalities for smooth functionals on the sphere. As an illustration of possible applications, problems on the rate of convergence to the normal law for weighted sums of dependent quantities will be considered 
18.03.2020  Sergey Bobkov (University of Minnesota, HSE University)  Gaussian concentration, spherical concentration, and their applications to randomized models in the central limit theorem  We plan to discuss the socalled concentration phenomenon of the measure and its corresponding Sobolev inequalities with respect to multidimensional Gaussian measures and uniform distributions on a unit sphere in Euclidean space. In addition to the classical results describing the increase in the degree of concentration with increasing dimensionality, new estimates for large deviations will also be presented, with improvements in the standard inequalities for smooth functionals on the sphere. As an illustration of possible applications, problems on the rate of convergence to the normal law for weighted sums of dependent quantities will be considered 
24.03.2020 (Zoom)  Maxim Kaledin (HSE University)  RiskSensitive Reinforcement Learning a review of PhD thesis by Aviv Tamar  The main issue which keeps the applications of reinforcement learning limited is the fact that robustness, stability and durability of the learned decision policies are hard to prove. This happens due to usage of expected utility criterion which is optimized with respect to policy in all famous algorithms of RL. The criterium arised from economics and game theory where expected utility theory was historically the first wellestablished concept of rationality. Being very popular and wellstudied in both mathematics and applications, the concept itself is not the only possible way to define rationality. In particular, there is more general class of functionals called coherent risk measures introduced in Delbaen,Artzner1998. Coherent risk measures are widely known in finance for they address exactly the robustness problem and try to quantify risk differently; despite that, in RL coherent risk measures are not so known and there are not so many papers on its usage. My talk will be devoted to the review of PhD thesis by Aviv Tamar on exploiting risk measures in RL which represents in itself a very comprehensive study on how to construct learning algorithms on popular risk measures like CVAR and semideviation. 
3.04.2020 (Zoom)  Maxim Panov and Nikita Kotelevskiy (Skoltech)  MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference  In this contribution, we propose a new computa tionally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to MetFlow, a novel family of MCMC algorithms we introduce, in which proposals are obtained us ing Normalizing Flows. The marginal distribution produced by such MCMC algorithms is a mixture of flowbased distributions, thus drastically in creasing the expressivity of the variational family. Unlike previous methods following this direction, our approach is amenable to the reparametrization trick and does not rely on computationally expen sive reverse kernels. Extensive numerical experi ments show clear computational and performance improvements over stateofthe art methods. 
21.04.2020 at 16 00 (Zoom)  Nikita Zhivotovskiy (Google Research, HSE University)  Fast classification rates in online and statistical learning with abstention.  This talk is devoted to a simple extension of statistical and online classification setups where the learner is allowed to abstain from a prediction by paying a cost (say, 0.49) only marginally smaller than the average cost 0.5 of a random bit prediction. We discuss how this model implies fast rates for the excess risk and the expected regret without any additional assumptions on the data generating mechanism and the loss function. Finally, we show the relations with recent results in aggregation theory and with the analysis of strongly convex losses. Based on recent works with Olivier Bousquet and Gergely Neu arXiv:1910.12756 and arxiv:2001.10623. 
23.04.2020 at 17 00 (Zoom)  Anastasia Ivanova (MIPT, HSE University) 
Fall 2019
Date  Speaker  Title  Abstract 
17.09.2019  Michel Ledoux (Université de Toulouse)  On optimal matching of random samples  Optimal matching problems are random variational problems widely investigated in the mathematics and physics literature, and in probability theory and statistics in the study of rates of convergence of empirical measures. Twodimensional matching of uniform samples gave in particular rise to deep results investigated from various view points (combinatorial, generic chaining, Fourier analysis). After a review of known results, we investigate more precisely the case of Gaussian samples, first in dimension one, on the basis of explicit representations of Kantorovich metrics and a sharp analysis of more general logconcave distributions in terms of their isoperimetric profile (joint work with S. Bobkov), and then in dimension two and higher following the PDE and transportation approach recently put forward by L. Ambrosio, F. Stra and D. Trevisan. 
2327.09.2019  Workshop  
01.10.2019  Alexander Goldenshluger (University of Haifa, HSE University)  Density deconvolution under general assumptions  In this paper we study the problem of density deconvolution under general assumptions on the measurement error distribution. Typically deconvolution estimators are constructed using Fourier transform techniques, and it is assumed that the characteristic function of the measurement errors does not have zeros on the real line. This assumption is rather strong and is not fulfilled in many cases of interest. In this paper we develop a methodology for constructing optimal density deconvolution estimators in the general setting that covers vanishing and nonvanishing characteristic functions of the measurement errors. We derive upper bounds on the risk of the proposed estimators and provide sufficient conditions under which zeros of the corresponding characteristic function have no effect on estimation accuracy. Moreover, we show that the derived conditions are also necessary in some specific problem instances. 
15.10.2019  Sergey Samsonov (HSE University)  Variance reduction for dependent sequences with applications to Stochastic Gradient MCMC  This talk will be devoted to a novel variance reduction approach for additive functionals of general dependent sequences. Our approach combines the use of control functions with the minimisation of an empirical asymptotic variance estimate. We analyse finite sample properties of the proposed method and derive convergence rates of the excess asymptotic variance to zero. We apply our methodology to Stochastic Gradient MCMC (SGMCMC) and discuss possible extensions for nonMarkovian dependent sequences (SVRGLD, SAGALD). Numerical evaluation of the algorithm will also be presented. The talk is based on a joint work with D. Belomestny, L. Iosipoy, E. Moulines and A. Naumov. 
22.10.2019  Quentin Paris (HSE University)  Some open questions at the "intersection" between optimal transport and statistical learning.  This talk will breafly describe the field of optimal transport and show that statistical learning can be represented as an optimal transport problem. We will investigate a few questions related to this representation and describe some potential research perspectives. 
26  29.11.2019  Autumn School  Autumn School on Generative models  
3.12.2019  Nikita Puchkin (HSE University)  Sample complexity of learning a manifold with an unknown dimension  I will discuss the problem of learning a manifold from noisy observations in the case when the dimension of the true manifold is unknown. Many manifold learning procedures require the intrinsic dimension of the data as an input parameter. However, the existing literature provides theoretical guarantees on the dimension estimation only in the case, when the data lies exactly on the manifold. We study an algorithm, which adaptively chooses the manifold’s dimension and has an optimal sample complexity in terms of the Hausdorff loss. Besides, the procedure produces a manifold, such that its curvature does not exceed the curvature of the true manifold too much. 
Date  Speaker  Title  Abstract 
29.01.2019 at CS HSE  Ekaterina Krymova (Duisburg Essen University)  On estimation of the noise variance in high dimensional linear regression model  We consider a problem of unknown noise variance estimation in highdimensional linear regression model. To estimate a nuisance vector of regression coefficients we use a family of spectral regularisers of the maximum likelihood estimator. Noise variance estimation is based on an adaptive normalisation of the prediction error. We derive an upper bound for concentration of the proposed method around an ideal estimator (in case of zero nuisance). 
05.02.2019  Vladimir Spokoiny (WIAS, HSE University)  Bayesian model selection  We consider different approaches to select a prior from an ordered family of Gaussian process priors (GPP) for a linear regression model . The empirical and full (hierarchic) Bayesian methods are studied with their benefits and drawbacks. A new method for the choice of a hyper prior is suggested. The method is based on the so called “propagation” condition and ensures a onesided contraction and adaptivity of the procedure. 
12.02.2019  Alexander Goldenshluger (University of Haifa, HSE University)  Optimal stopping of a random sequence with unknown distribution  The subject of this talk is the problem of optimal stopping of a sequence of i.i.d. random variables. When the distribution governing the random sequence is known, the optimal solution is given by backward induction. In this talk we will focus on the case when the distribution of observations is unknown. We propose a stopping rule that is based on relative ranks and study its performance as measured by the maximal relative regret over suitable nonparametric classes of distributions. It is shown that the proposed rule is first order asymptotically optimal and nearly rate–optimal in terms of the rate at which the relative regret converges to zero. We also develop a general method for numerical solution of sequential stopping problems with no distributional information, and use it in order to implement the proposed stopping rule. (Joint work with A. Zeevi) 
19.02.2019  Alexey Ustimenko (HSE University)  Direct Optimization of Ranking Quality Metrics via Stochastic Approximation  In this paper, we present a novel approach for consistent direct optimization of any ranking quality metric that significantly outperforms the stateoftheart methods on popular learningtorank datasets. The framework is based on a smoothing procedure and a method of local stochastic approximation of the ranking function inspired by gradientfree methods for blackbox optimization. We also present novel unbiased and effective estimates for gradients of the smoothed ranking quality metrics that theoretically outperform likelihood ratio method while their time complexity remains the same due to combinatorial structure of the problem. 
26.02.2019  Denis Belomestny (Duisburg Essen University, HSE University)  MacKeenVlasov stochastic differential equations: simulation and estimation  In this talk we propose a novel projectionbased particle method for solving McKeanVlasov stochastic differential equations. This approach is based on a projectiontype estimation of the marginal density of the solution in each time step. The projectionbased particle method leads in many situations to a significant reduction of numerical complexity compared to the widely used kernel density estimation algorithms. We derive strong convergence rates and rates of density estimation. Furthermore we give explicit solutions for a class of McKeanVlasov equations with affine drift. The performance of the proposed algorithm is illustrated by several numerical examples. 
5.03.2019  Ivan Panin (Skoltech, IITP RAS)  Least squares in noiseless setting and Sobol indices convergence for random design  We briefly review recent progress in noiseless nonparametric regression and consider its application to global sensitivity analysis and particularly for estimation of Sobol indicies. Sobol (sensitivity) indices allow to quantify respective effects of input random variables (or combinations thereof) onto variance of a physical or mathematical model response. We focus on the problem of Sobol indices estimation via metamodel approach in the case of random design of experiments; on both nonasymptotic and asymptotic behavior of Sobol indices estimates and the model selection for increasing computational budget. We also consider the question of an optimal rate of Sobol indices convergence. 
12.03.2019  Dmitry Yarotsky (Skoltech, IITP RAS)  Gradient descent in neural networks: mean field theory and optimal transportation  I will describe recent results in which gradient descent in shallow neural networks is approximated by an integrodifferential equation appearing in the limit of infinite network width. In particular, I will discuss the connection to the optimal transport theory and will mention several modes in which the dynamics admits an approximate analytic solution. 
19.03.2019  Egor Kosov (MSU, HSE University)  Lower bounds for measures of deviation of polynomials from their expectations  In the talk we will discuss the recent results on lower estimates of small ball probabilities for random variables that are polynomials in Gaussian and general logarithmically concave random variables. 
26.03.2019  Nurlan Shagadatov (HSE University)  Martingale representations for Markov Chains with application to MCMC.  In this talk I will describe recent results of joint work with D. Belomestny and E. Moulines. I consider efficient variance reduction approach for Markov Сhain Monte Carlo algorithms relying on a new discrete time martingale representations. In particular, I will consider theoretical guarantees and generic method for Unadjusted Langevin Algorithm. 
09.04.2019  Evgenii Chzhen ( Laboratoire d'Analyse et de Mathématiques Appliquées)  Semisupervised confidence set classification  An image annotation can be seen as a multiclass classification problem with a large number of classes. In this context, confusion between classes can occur, and single label classification may be misleading. We will talk about the multiclass classification with confidence sets of a controlled expected cardinal in nonparametric settings. We will propose a semisupervised estimator which is either minimax optimal or optimal up to a log factor depending on the way we measure the risk. We will describe the regimes when there is no supervised estimator that can improve the proposed semisupervised estimator and when the semisupervised estimation does not help from the minimax point of view. (joint work with C. Denis and M. Hebiri) 
16.04.2019, CS HSE, 205 at 17:00  Eric Moulines (Ecole Polytechnique, HSE University)  Mini course Introduction to reinforcement learning 1  The minicourse is an elementary introduction of reinforcement learning. We will formalize the problem of reinforcement learning using ideas from dynamical systems theory, specifically, as the optimal control of incompletelyknown Markov decision processes. A learning agent must be able to sense the state of its environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. Markov decision processes are intended to include just these three aspects—sensation, action, and goal—in their simplest possible forms without trivializing any of them. Any method that is well suited to solving such problems we consider to be a reinforcement learning method. The course is an elementary introduction covering: — An introduction to reinforcement learning — Markov decision processes — Dynamic programming — Monte Carlo methods — Temporal difference learning We will follow closely the book “Reinforcement Learning: an introduction”, Barto Sutton, MIT Press and the use the pedagogical material proposed by David Silver. 
23.04.2019, CS HSE, 503 at 17:00  Eric Moulines (Ecole Polytechnique)  Mini course Introduction to reinforcement learning 1  The minicourse is an elementary introduction of reinforcement learning. We will formalize the problem of reinforcement learning using ideas from dynamical systems theory, specifically, as the optimal control of incompletelyknown Markov decision processes. A learning agent must be able to sense the state of its environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. Markov decision processes are intended to include just these three aspects—sensation, action, and goal—in their simplest possible forms without trivializing any of them. Any method that is well suited to solving such problems we consider to be a reinforcement learning method. The course is an elementary introduction covering: — An introduction to reinforcement learning — Markov decision processes — Dynamic programming — Monte Carlo methods — Temporal difference learning We will follow closely the book “Reinforcement Learning: an introduction”, Barto Sutton, MIT Press and the use the pedagogical material proposed by David Silver. 
3.06.20197.06.2019  Nikolai Krylov (University of Minnesota)  Mini course (3 lectures)  Topics:
Schedule: 7.06, 16:00  18:00 14.06, 16:00  18:00 28.06, 16:00  18:00 Venue: Faculty of Mathematics HSE, Usacheva str, 5, room 108 
Date  Speaker  Title  Abstract 
26.12.2018  Qeuntin Paris (HSE University)  Learning functional minimizers on metric spaces.  In this talk, we will discuss recent developments motivated by statistical problems that arise in optimal transport. In particular we will discuss new results concerning the estimation of so called barycenters (or Frechet means) in the Wasserstein space. Before presenting our results, the talk will review some basic material from geometry and optimal transport. The end of the talk will discuss some open problems. 
26.12.2018  Sergey Samsonov (HSE University, Skoltech)  Tansportation and concentration of measure inequalities for Markov Chains  In this talk I am going to observe some concentration of measure inequalities, starting from rather classical ones by Bakri, Ledoux. Then I will cover some interesting concentration results for Markov Chains, related with transportation inequalities following papers by Djellot, Guillin, Wu 2004 and Joulin, Olivier 2010, and concentration results for quadratic forms of vectors with dependent entries following Adamczak 2015 with particular applications to Unadjusted Langevin Dynamics. 
13.12.2018  Sergey Samsonov (HSE University, Skoltech)  Concentration inequalities and their applications  In this talk I am going to cover some functional inequalities for dependent sequences of random variables, especially focusing on the results for Markov chains based on the work by Djellout, Guilin and Wu. As one of the applications we will consider Unadjusted Langevin Algorithm, following paper by Durmus, Moulines, and state some concentration results for it. 
16.12.2018  Maxim Kaledin (HSE University, Skoltech)  Dynamic Programming in Stochastic Optimal Control Problems.  Stochastic control problems are wellknown in financial mathematics (option pricing, portfolio management) and in many related areas. These problems even in case of discrete state space and known transition density are subject to curse of dimensionality when using deterministic algorithms. Kean&Wolpin(1994) and Rust (1997) were the first who proved that this can be avoided in case of discrete Markov Decision Processes with stochastic algorithms. The modern challenge is to figure out what happens with complexity when the transition density can only be approximated (RL,ML, Finance applications). During my talk I will cover known and new complexity estimates for stochastic control problems as well as several algorithms to solve MDP. 
29.11.2018  Vladimir Ulyanov (HSE University, MSU)  Nonlinear expectations: notion, facts, applications.  We give a short review on recent developments on problems of probability model under uncertainty by using the notion of nonlinear expectations and, in particular, sublinear expectations. We discuss as well a new law of large numbers (LLN) and central limit theorem (CLT) and provide nonasymptotic results. Classical LLN and CLT have been widely used in probability theory, statistics, data analysis as well as in many practical situations such as financial pricing and risk management. They provide a strong and convincing way to explain why in practice normal distributions are so widely utilized. But often a serious problem is that, in general, the “i.i.d.” condition is difficult to be satisfied. In practice, for the most realtime processes and data for which the classical trials and samplings become impossible, the uncertainty of probabilities and distributions can not be neglected. In fact the abuse of normal distributions in finance and many other industrial or commercial domains has been criticized. The new CLT does not need this strong “i.i.d.” assumption. Instead of fixing a probability measure P, it is introduced an uncertain subset of probability measures {P_a : a in A } and considered the corresponding sublinear expectation as sup{a in A} E_{a}X.

22.11.2018  Nikita Zhivotovskiy (HSE University)  Covariance estimation: missing observations and heavy tails.  In this talk, I will discuss some recent results on covariance estimation together with some work in progress. I will start with the subgaussian case when some of the observations are missing: we observe N i.i.d. vectors and want to estimate the covariance matrix. The problem is that some of the components of each vector are missing, that is these components are set to be equal to zero with some probability. We show how to improve the state of the art results in this scenario. Next, we discuss how to apply dimensionfree Bernstein type inequalities for vectors with heavy tails under mild moment assumptions.

14.11.2018  Eric Moulines (Ecole Polytechnique, HSE University)  Lowrank Interaction with Sparse Additive Effects Model for Large Data Frames (NIPS paper)  Many applications of machine learning involve the analysis of large data frames — matrices collecting heterogeneous measurements (binary, numerical, counts, etc.) across samples — with missing values. Lowrank models, as studied by Udell et al. (2016), are popular in this framework for tasks such as visualization, clustering and missing value imputation. Yet, available methods with statistical guarantees and efficient optimization do not allow explicit modeling of main additive effects such as row and column, or covariate effects. In this paper, we introduce a lowrank interaction and sparse additive effects (LORIS) model which combines matrix regression on a dictionary and lowrank design, to estimate main effects and interactions simultaneously. We provide statistical guarantees in the form of upper bounds on the estimation error of both components. Then, we introduce a mixed coordinate gradient descent (MCGD) method which provably converges sublinearly to an optimal solution and is computationally efficient for large scale data sets. We show on simulated and survey data that the method has a clear advantage over current practices. 
14.11.2018  Błażej Miasojedow (University of Warsaw)  Statistical inference for Continous Time Bayesian Networks.  Continuous Time Bayesian Networks (CTBN) are a class of multivariate Markov Jump Processes (MJP), where dependence between coordinates is described by a directed graph with possible cycles. CTBNs are a flexible tool for modelling phenomena from different areas such as chemistry, biology, social sciences etc. In the talk, I will discuss the problems of parameter learning and structure learning for CTBNs with a special attention to computational methods. 
8.11.2018  Alexey Naumov (HSE University)  Gaussian approximations for maxima of large number of quadratic forms of highdimensional random vectors  Let X_1, …, X_n be i.i.d. random vectors taking values in R^d, d \geq 1, and Q_1, …, Q_p, p \geq 1, be symmetric positive definite matrices. We consider the distribution function of vector (Q_j S_n, S_n), j = 1, …, p, where S_n = n^{1/2}(X_1 + … + X_n), and prove the rate of Gaussian approximation with explicit dependence on n, p and d. We also compare this result with results of Bentkus (2003) and Chernozhukov, Chetverikov, Kato (2016). Applications to change point detection and model selection will be discussed as well. The talk is based on the joint project with F. Goetze, V. Spokoiny and A. Tikhomirov.

1.11.2018  Andzhey Koziuk (WIAS, Berlin)  Exact smoothning of a probability measure  In the talk an exact smoothing of indicator function is introduced and developed to allow differential calculus access the internal machinery behind a problem of measures comparison. Specifically, the problem of accuracy of nonclassical analogue of BerryEsseen inequality for a radnom elemets in Hilbert space is addressed with the help of the tool and aided with the classical Lindenberg chained sum construction. 
25.10.2018  Nikita Zhivotoskiy (Technion, HSE University)  Robust estimation of the mean of a random vector  n this talk, we discuss the estimation of the mean random of a vector from an i.i.d. sample of its observations. The latest results show that using a special procedure, the mean of a random vector, for which only the existence of a covariance matrix is known, can be estimated as precisely as if the vector had a multidimensional normal distribution with the same covariance structure. The estimation of the covariance matrix with minimal constraints on the underlying distribution will be briefly discussed. The talk is based on the works of arxiv: 1702.00482 and arxiv: 1809.10462 
18.10.2018  Alexey Ustimenko (HSE University, Skoltech)  Convergence of Laplacian Spectra from Point Clouds  The LaplacianBeltrami operator (LBO) on manifolds and its spectral structure has been widely used in many applications such as manifold learning, spectral clustering, dimensionality reduction and so on. Recently, Point Integral Method (PIM) was proposed to estimate the LBO by the discrete Laplace operator constructed from a set of sample points. We would discuss a paper “Convergence of Laplacian Spectra from Point Clouds” by Zuoqiang Shi, Jian Sun and their’s proof of the convergence of the eigenvalues and eigenvectors obtained by PIM to the eigenvalues and eigenfunctions of the LBO with the Neumann boundary. 
11.10.2018  Nikita Puchkin (HSE University)  Manifold Learning  I will discuss a problem of a manifold reconstruction from a noisy point cloud. There are plenty methods of manifold learning (ISOMAP, LLE, etc.) but only a few papers provide a theoretical justification of the proposed algorithms (Genovese et. al. (2012), Maggioni et. al. (2016), Aamari and Levrard (2017)). An approach based on a local polynomial estimation (Aamari, Levrard (2017)) reconstructs distinct parts of the manifold and the reconstructed manifold is often disconnected. Another approach (Genovese et. al. (2012)) tries to estimate a density with a support on the manifold. This approach requires a specific distribution of the noise, which is unlikely to hold in practice. A recently proposed method (Osher et. al. (2017)), based on an approximation of the LaplaceBeltrami operator, is aimed at a global reconstruction of the manifold and does not have the mentioned drawbacks. Unfortunately, the statistical properties of this method are not studied so far. In my talk, I will discuss some details concerning a theoretical justification of the procedure. 
4.10.2018  Valentina Shumovskaya (HSE University, Skoltech)  Towards Hypothesis Testing for Random Graphs with Community Structure  The analysis of random graphs and network analysis recently become an active area of research. We will dicuss the problem of testing between two populations of inhomogeneous random graphs defined on the same set of vertices, where the null hypothesis means that the underlying edge probability matrices coincide. We propose a new approach of random graphs testing based on community detection procedure: we introduce a structural assumption, which is that our graphs have community structures with k communities. 
27.09.2018  Alexander Goldenshluger (University of Haifa, HSE University)  Nonparamertic estimation in nonregular deconvolution problems.  I will discuss problems of nonparametric estimation in nonregular deconvolution models. In such settings standard estimation methods based on the Fourier transform are not applicable. I will review the existing literature on the topic, discuss general ideas for constructing optimal estimators, and present some preliminary results. Based on joint ongoing work with D. Belomestny. 
20.09.2018  Alexander Timofeev (MIPT)  Densitysensitive semisupervised inference  In the talk, we will review the article written by Martin Azizyan, Aarti Singh and Larry Wasserman. We will consider semisupervised inference for the regression problem, based on a densitysensitive metric and show that the proposed semisupervised method outperforms any supervised one for certain sets. At the end of the talk, We will demonstrate how to calculate the densitysensitive metric. 
13.09.2018  Katya Krymova (University of DuisburgEssen)  On a sparse modification of projection approximation subspace tracking method  In this talk we revisit the wellknown constrained projection approximation subspace tracking algorithm (CPAST) and derive nonasymptotic bounds on its error. Furthermore we introduce a novel sparse modification of CPAST which is able to exploit sparsity of the underlying covariance structure. We present a nonasymptotic analysis of the proposed algorithm and study its empirical performance on simulated and real data. 
10.09.2018  Steve Oudot (Ecole Polytechnique)  Statistics and Learning with topological descriptors  I this survey talk I will review some of the recent advances in Topological Data Analysis to define stable descriptors for data, to use these descriptors for learning tasks, and to do statistics on them. Among other topics, I will address finding suitable metrics, defining and computing means and barycenters, deriving confidence intervals, laws of large numbers, central limit theorems, and kernel mean embeddings. 
Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.