Scientific seminar

Structural Learning Seminar is a seminar for researchers, Master’s and Ph.D. students of HSE, Skoltech, IITP. This seminar is devoted to the problems of Mathematics and Computer Science and covers areas such as Mathematical Statistics, Machine Learning, Random Matrix Theory, Optimization, Information Theory, etc.

Seminar organizers: Sergey Samsonov, Nikita Puchkin

Follow the updates of the seminars in our telegram channel: HDI Lab junior team seminar

Those who wish to make a presentation at the Seminar are asked to e-mail: npuchkin@hse.ru or svsamsonov@hse.ru with an indication of the topic of the presentation, an abstract in English (not more than 1/2 page) and a suitable date of presentation. The speaker should inform in advance about the chosen form of the report - on the chalk board or computer presentation (the preferred form of the report is on the chalk board). The length of the presentation is 1 hour (if necessary, the length of the presentation may be extended to 1.5-2 hours)

For the winter 2024 semester, the seminar takes place on Thursdays at 4:20 p.m.

2025

	Speaker	Title	Abstract
06.02.2025	Timofey Gritsaev (Constructor University, Bremen and HSE University)	Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization	Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects with probabilities proportional to a given reward function. The key concept behind GFlowNets is the use of two stochastic policies: a forward policy, which incrementally constructs compositional objects, and a backward policy, which sequentially deconstructs them. Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning (RL) problems with a particular reward design. However, this connection applies only in the setting of a fixed backward policy, which might be a significant limitation. As a remedy to this problem, we introduce a simple backward policy optimization algorithm that involves direct maximization of the value function in an entropy-regularized Markov Decision Process (MDP) over intermediate rewards. We provide an extensive experimental evaluation of the proposed approach across various benchmarks in combination with both RL and GFlowNet algorithms and demonstrate its faster convergence and mode discovery in complex environments.
13.02.2025	Konstantin Yakovlev (HSE University)	Generalization error bound for denoising score matching under relaxed manifold assumption	We investigate theoretical aspects of denoising score matching estimation within a nonparametric Gaussian mixture framework. We significantly relax the standard manifold assumption allowing the samples step away from the manifold. We derive non-asymptotic bounds on the approximation and generalization errors of the denoising score matching estimate, which is chosen among the class of ReLU-networks with a tailored architecture. The rates of convergence are determined by the intrinsic dimension. Notably, our bounds remain valid even under conditions where the ambient dimension increases polynomially with the sample size, thereby extending the applicability of denoising score matching to a broader range of high-dimensional scenarios
20.02.2025	Marina Sheshukova (HSE University and Skoltech)	Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson–Romberg Extrapolation	We address the problem of solving strongly convex and smooth minimization problems using stochastic gradient descent (SGD) algorithm with a constant step size. Previous works suggested to combine the Polyak-Ruppert averaging procedure with the Richardson-Romberg extrapolation technique to reduce the asymptotic bias of SGD at the expense of a mild increase of the variance. We significantly extend previous results by providing an expansion of the mean-squared error of the resulting estimator with respect to the number of iterations n. More precisely, we show that the mean-squared error can be decomposed into the sum of two terms: a leading one of order O(n^{−1/2}) with explicit dependence on a minimax-optimal asymptotic covariance matrix, and a second-order term of order O(n^{−3/4}) where the power 3/4 is best known. We also extend this result to the p-th moment bound keeping optimal scaling of the remainders with respect to n. Our analysis relies on the properties of the SGD iterates viewed as a time-homogeneous Markov chain. In particular, we establish that this chain is geometrically ergodic with respect to a suitably defined weighted Wasserstein semimetric
27.02.2025	Egor Gladin (HSE University)	Accuracy Certificates for Convex Minimization with Inexact Oracle	Accuracy certificates for convex minimization problems allow for online verification of the accuracy of approximate solutions and provide a theoretically valid online stopping criterion. When solving the Lagrange dual problem, accuracy certificates produce a simple way to recover an approximate primal solution and estimate its accuracy. In this talk, we generalize accuracy certificates for the setting of an inexact first-order oracle, including the setting of primal and Lagrange dual pair of problems. We further propose an explicit way to construct accuracy certificates for a large class of cutting plane methods based on polytopes. As a by-product, we show that the considered cutting plane methods can be efficiently used with a noisy oracle even though they were originally designed to be equipped with an exact oracle. Finally, we illustrate the work of the proposed certificates in the numerical experiments highlighting that our certificates provide a tight upper bound on the objective residual
13.03.2025	Daniil Tyapkin (Ecole Polytechnique and HSE University)	On Teacher Hacking in Language Model Distillation	Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs
27.03.2025	Fyodor Noskov (HSE University)	Dimension-free bounds in high-dimensional linear regression via error-in-operator approach	We consider a problem of high-dimensional linear regression with random design. We suggest a novel approach referred to as error-in-operator which does not estimate the design covariance $\Sigma$ directly but incorporates it into empirical risk minimization. We provide an expansion of the excess prediction risk and derive non-asymptotic dimension-free bounds on the leading term and the remainder. This helps us to show that auxiliary variables do not increase the effective dimension of the problem, provided that parameters of the procedure are tuned properly. We also discuss computational aspects of our method and illustrate its performance with numerical experiments
03.04.2025	Denis Belomestny (HSE University, Duisburg Essen University)	On statistical complexity of the Inverse Optimal Control	Inverse optimal control, also known as inverse reinforcement learning (IRL), aims to deduce the cost function that an expert optimizes based on their observed behavior. In this talk, we rigorously examine the complexity of this inverse problem for linearly-solvable Markov decision processes (LMDPs) connected to Linear Quadratic Stochastic Control problems. We assume a discretized trajectory of an optimally controlled process is observed over an extended period and analyze the error in reconstructing the terminal cost functional. Our findings demonstrate that the rate of convergence, in terms of the number of time steps, is logarithmic and cannot be generally improved
10.04.2025	Artyom Logachev (S.L. Sobolev Institute of Mathematics SB RAS, HSE University)	Stochastic dynamics near critical points in additive stochastic gradient descent	-
17.04.2025	Askar Tsyganov (HSE University)	The Two-To-Infinity Norm And Singular Subspace Geometry With Applications To High-Dimensional Statistics	-
24.04.2025	Alain Durmus (Ecole Polytechnique)	Piecewise deterministic generative models	We will introduce in this talk a novel class of generative models based on piecewise deterministic Markov processes (PDMPs), a family of non-diffusive stochastic processes consisting of deterministic motion and random jumps at random times. Similarly to diffusions, such Markov processes admit time reversals that turn out to be PDMPs as well. We apply this observation to three PDMPs considered in the literature: the Zig-Zag process, Bouncy Particle Sampler, and Randomised Hamiltonian Monte Carlo. For these three particular instances, we show that the jump rates and kernels of the corresponding time reversals admit explicit expressions depending on some conditional densities of the PDMP under consideration before and after a jump. Based on these results, we propose efficient training procedures to learn these characteristics and consider methods to approximately simulate the reverse process. Finally, we provide bounds in the total variation distance between the data distribution and the resulting distribution of our model in the case where the base distribution is the standard d-dimensional Gaussian distribution. Promising numerical simulations support further investigations into this class of models.
22.05.2025	Maxim Shuklin (HSE University)	Streaming PCA: Independent and Markovian Data	This talk will focus on the theoretical analysis of Streaming Principal Component Analysis (Streaming PCA), an online method for estimating the top eigenvectors of a covariance matrix without storing the full dataset. We will first review classical convergence guarantees for batch PCA and then discuss finite-sample convergence bounds for Oja’s algorithm in the streaming setting. The analysis will cover both the standard case of independent observations and the more general setting where data points are sampled from an irreducible, aperiodic, and reversible Markov chain
29.05.2025	Evgenia Romanenkova (Skoltech)	Representation learning methods for optimal change point detection procedures	In sequential data analysis, a change point is a moment of abrupt regime switch in data streams. Detecting these changes quickly and accurately is crucial in various scenarios, from industrial sensor monitoring to challenging video surveillance. However, classic approaches for change point detection (CPD) often struggle with high-dimensional and semi-structured data due to their intricate nature, as they cannot process such data without a proper representation. Moreover, they often work only under strong assumptions and fail to adapt to specific domains with peculiarities. In this talk, we propose a comprehensive framework for CPD via representation learning suitable for a diverse range of scenarios. For low-dimensional industrial data, an introduced two-stage system for effective CPD can work given expert annotations. Alternatively, we have a model that provides universal representations for unsupervised settings. To handle complex high-dimensional data, we propose a novel theoretical-grounded loss function that allows representation learning for deep models. Our loss function balances the trade-off between detection delay and time to false alarm, providing an asymptotic lower bound for established rigorous solutions while remaining differentiable. Under realistic assumptions, it serves as a tight approximation with a power-law of convergence rate. Through comprehensive experiments on synthetic sequences, real-world sensor data, and high-dimensional video datasets, we demonstrate the importance of meaningful representations tailored to the CPD task.
12.09.2025	Denis Belomestny (HSE University)	Inverse Entropy-regularized RL	In this talk we consider entropy regularised RL and aim to solve the inverse statistical problem of recovering the rewards from a sample from the expert based on the optimal policy. We propose an estimator and study its convergence.
19.09.2025	Yuriy Dorn (MSU)	Multi-armed bandit	-
26.09.2025	Vladimir Spokoiny (HSE University, WIAS, Humboldt University)	Ranking from pairwise comparison	The recent paper Gao et al. (2023) on estimation and inference for top-ranking problem in Bradley-Terry-Luce (BTL) model presented a surprising result: component-wise estimation and inference can be done under much weaker conditions on the number of comparison then it is required for the full dimensional estimation. The present paper revisits this finding from completely different viewpoint. Namely, we show how a theoretical study of estimation in sup-norm can be reduced to the analysis of plug-in semiparametric estimation. For the latter, we adopt and extend the general approach from Spokoiny (2025b) to high-dimensional estimation and inference. The main tool of the analysis is a theory of perturbed marginal optimization when an objective function depends on a low-dimensional target parameter along with a high-dimensional nuisance parameter. A particular focus of the study is the critical dimension condition. Full-dimensional estimation requires in general the condition n ≫ 𝕡 between the effective parameter dimension 𝕡 and the effective sample size corresponding to the smallest eigenvalue of the Fisher information matrix . Inference on the estimated parameter is even more demanding: the condition n ≫ 𝕡^2 cannot be generally avoided; see Spokoiny (2025b). However, for the sup-norm estimation, the critical dimension condition can be reduced to n ≥ C log(p).

2024

	Speaker	Title	Abstract
18.01.2024	Marina Sheshukova (HSE University) and Arthur Goldman (HSE University)	Part l: Stochastic Methods in Variational Inequalities: Ergodicity, Bias and Refinements Part ll: Reflection Coupling for Wasserstein geometric convergence	At the seminar, we will discuss the general formulation of the problem of variational inequalities. We will analyze two algorithms Stochastic Gradient Descent Ascent and Double Step-size Stochastic Extra Gradient as well as their theoretical properties. In Part ll, we will look at how it is possible to prove the convergence of measures according to the Waserstein metric through the construction of caplings.
25.01.2024	Alexander Beznosichov (MIPT)	On Distributed Optimization Problems under Similarity: Optimal Algorithms and Fast Extensions	In this presentation, we consider distributed methods for solving optimization problems, namely variational inequalities, which are interesting both in themselves and because they include both classical minimization problems and minimax problems. In the distributed formulation, the target operator of a variational inequality is divided into parts, and each of these parts can be accessed only by a local agent/worker. We deal with the case where the local operators are "similar" to each other in some sense. Due to the "similarity" it is possible to achieve a significant acceleration of the theoretical guarantees of convergence of methods in terms of estimates on communication complexity. Besides the issue of convergence of algorithms and obtaining upper bounds, we touch upon lower complexity bounds and verify the optimality of the proposed methods. In the remaining time, we try to discuss the question of how we can "break through" the lower estimates and construct an even faster method, in particular, we additionally introduce the possibility of compressing the transmitted information, modify the proposed algorithms and obtain upper and lower bounds in a new formulation.
15.02.2024	Ilya Levin (HSE University)	Central Limit Theorems for Stochastic Approximation	-
22.02.2024	Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)	Gaussian Variational Inference	Variational inference (VI) considers the problem of approximating a given measure (posterior) by a member of a given parametric family by minimizing the related KL-divergence. Gaussian VI focuses on the Gaussian approximation. We revisit the recent results by Katsevich and Rigollet (2023) and demonstrate how this problem can be solved using some advanced tools from optimization (the basic lemma).
29.02.2024	Yuri Svirshchevsky (HSE University)	Monte Carlo Variational Auto-Encoders	Variational autoencoders are generative models that are trained by maximizing some lower estimate of the validity of the observed data (evidence lower bound or ELBO). The quality of the trained model depends on the accuracy of the assessment. The seminar will explore ways to obtain two ELBO based on Annealed Importance Sampling and Sequential Importance Sampling, their differentiation and application in VAE.
07.03.2024	Nikita Puchkin (HSE University)	Dimension-free Structured Covariance Estimation	Given a sample of i.i.d. high-dimensional centered random vectors, we consider a problem of estimation of their covariance matrix $\Sigma$ with an additional assumption that $\Sigma$ can be represented as a sum of a few Kronecker products of smaller matrices. Under mild conditions, we derive the first non-asymptotic dimension-free high-probability bound on the Frobenius distance between $\Sigma$ and a widely used penalized permuted least squares estimate. Because of the hidden structure, the established rate of convergence is faster than in the standard covariance estimation problem.
14.03.2024	Alexander Osinsky (Skoltech, IVM RAS)	Existence and construction of column and cross approximations of matrices close to optimal: probabilistic estimates	-
28.03.2024	Zhuo-Song Zhang (Southern University of Science and Technology)	Berry-Esseen bounds for generalized U-statistics	We establish optimal Berry-Esseen bounds for the generalized U- statistics. The proof is based on a new Berry-Esseen theorem for exchangeable pair approach by Stein’s method under a general linearity condition setting. As applications, an optimal convergence rate of the normal approximation for subgraph counts in Erdös-Rényi graphs and graphon-random graph is obtained.
04.04.2024	Vadim Popov (Huawei)	Advantages of additional training of diffusion probabilistic models using the Huber loss function	-
16.05.2024	Andrey Kupavsky (MIPT)	Concentration inequalities for finding multi-colored combinations	-
23.05.2024	Konstantin Yakovlev (MIPT)	Optimal score estimation via empirical Bayes smoothing	The problem of estimating the score function of an unknown probability distribution from n independent and identically distributed observations in d dimensions is studied. Assuming that the true density is sub gaussian and has a Lipschitz-continuous score function, the optimal rate for the estimation problem under the loss function commonly used in the score matching literature was established. This reveals the curse of dimensionality where sample complexity for accurate score estimation grows exponentially with the dimension. Leveraging key insights in empirical Bayes theory as well as a new convergence rate of smoothed empirical distribution in Hellinger distance, it was shown that a regularized score estimator based on a Gaussian kernel attains this rate, shown optimal by a matching minimax lower bound. The implications of these findings for the sample complexity of score-based generative models are discussed.
06.06.2024	Alexander Korotin (Skoltech), Nikita Gushchin (Skoltech), Sergey Kholkin (Skoltech)	Building light Schrödinger bridges	Schrödinger Bridges (SB) have recently gained attention of the ML community as a promising extension of classic diffusion models, which are also interconnected to the Entropic Optimal Transport (EOT). Despite the recent advances in the field of computational Schrödinger Bridges (SB), most existing SB solvers are still heavy-weighted and require complex optimization of several neural networks. We address this issue and propose two novel light solvers for this problem: LightSB and LightSBM. Both utilize the optimal structure of Schrödinger Bridges but in different ways. The LightSB solver allows direct minimized KL-divergence with the ground-truth solution, knowing only start and end marginals. The LightSBM solver is based on bridge matching and introduces the new concept of "optimal projection," allowing the Schrödinger Bridge to be solved in one bridge-matching iteration.
13.06.2024	Maria Rozanova (HSE University)	Low-rank matrix methods for incremental learning in recommendation systems	-
02.08.2024	Evgeny Golikov (EPFL)	A generalisation bound for nearly linear networks	We derive a novel generalisation bound for neural networks that becomes nonvacuous when the activation functions get close to linearity. To the best of our knowledge, our bound is the first nonvacuous generalisation bound which is computable prior to training. Our analysis is based on a novel concept of "proxy models" who exploit the weights of trained models with activation functions removed.
15.08.2024	Nikolai Malkin (University of Edinburgh)	Amortising intractable inference with diffusion models and off-policy RL	I will present recent and ongoing work relating diffusion models, variational inference, and reinforcement learning. Efficient inference of high-dimensional variables under a diffusion model prior enables solutions to problems such as conditional generation, semantic segmentation, and combinatorial optimisation (https://arxiv.org/abs/2206.09012). A new family of amortised methods (https://arxiv.org/abs/2402.05098, https://arxiv.org/abs/2405.20971) places the problem of stochastic continuous control -- including sampling of posterior distributions under neural SDE priors -- in the theoretical framework of off-policy entropy-regularised reinforcement learning, via the formalism of continuous generative flow networks (https://arxiv.org/abs/2301.12594). This connection allows us to train diffusion models to sample a target density or energy function, i.e., to perform black-box variational inference. It further lets us fine-tune pretrained diffusion models in a data-free manner for asymptotically unbiased posterior sampling, with applications to vision (class-conditional generation, text-to-image generation), language (constrained generation in diffusion language models), and control/planning (KL-constrained policy extraction with a diffusion behaviour policy). These approaches are applicable in Bayesian machine learning and various scientific domains.
05.09.2024	Paul Mangold (École Polytechnique)	Taming Heterogeneity in Federated Linear Stochastic Approximation	In federated learning, multiple agents collaboratively train a machine learning model without exchanging local data. To achieve this, each agent locally updates a global model, and the updated models are periodically aggregated. In this talk, I will focus on federated linear stochastic approximation (FedLSA), with a strong focus on agents heterogeneity. I will derive upper bounds on the sample and communication complexity of FedLSA, and present a new method to reduce communication cost using control variates. Particular attention will be put on the "linear speed-up" phenomenon, showing that the sample complexity scales with the inverse of the number of agents in both methods.
12.09.2024	Evgenу Stepanov (POMI RAS, Universita di Pisa, HSE University)	On some spectral methods of searching for hidden geometric structures in data	-
19.09.2024	Ekaterina Grishina (HSE University) and Alexandra Ivanova (HSE University, AIRI)	Part I: Accurate and effective estimation of the spectral norm of convolutional layers of neural networks Part II: Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing	-
26.09.2024	Hoi To Wai (Chinese University of Hong Kong, CUHK)	Stochastic Approximation Algorithms with Decision-Dependent Data: The Case of Performative Prediction	Stochastic approximation (SA) is the working horse behind many online algorithms for decision making under uncertainty. Especially, it has sparked renewed interests in recent years in the dynamical environment setting where the streaming data is not i.i.d., but is correlated and/or decision dependent. This reignited interests are due to its use in many contemporary applications, e.g., reinforcement learning, performative prediction, fine-tuning of large language models (LLMs). This talk will concentrate on the SA algorithms applied on stochastic optimization problems with decision-dependent distribution(s), also known as the performative prediction problem(s). We will first motivate the problem using an example of strategic classification and demonstrate that a natural implementation of the "stochastic gradient" algorithm with "greedy deployment" leads to an SA scheme with non-stochastic-gradient updates. Then, we present recent results on the convergence of such algorithm under the convex and non-convex settings, as well as stateful and non-stateful agent environments. Finally, we demonstrate a few interesting extensions of the model towards the case of multi-agent learning, non-cooperative network games, fine-tuning of LLMs, etc.
03.10.2024	Vladimir Spokoiny (HSE University, WIAS Berlin)	Inference for nonlinear inverse problems	New approach to statistical inference is introduced and illustrated by the examples of nonlinear regression, error-in-operator problem and nonlinear elliptic PDE.
10.10.2024	Anna Markovich (HSE University)	Real-time detection of time series irregularities based on prediction using expert strategies	-
17.10.2024	Denis Rakitin (HSE University)	Regularized Distribution Matching Distillation for One-step Unpaired Image-to-image Translation	-
07.11.2024	Evgeny Zheng (CNRS, Université Paris-Saclay)	Small Total-Cost Constraints in Contextual Bandits with Knapsacks	I will talk about some recent developments in the literature of contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The goal is to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. In this setting, total cost constraints had so far to be at least of order T^{3/4} where T is the number of rounds, and were even typically assumed to depend linearly on T. Elaborating on the main technical challenge and drawback of the previous approaches, I will present a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of T^{1/2} up to poly-logarithmic terms. This strategy is direct, and it relies on a careful, adaptive, tuning of the step size. The approach is inspired by a parameter-free-type algorithms arising from convex (online) optimization literature, which I also briefly review. The talk is based on joint works with C. Giraud, Z. Li, and G. Stoltz.
14.11.2024	Askar Tsyganov (HSE University) and Artemy Rubtsov (HSE University)	Part I: synchronous Stochastic Approximation and Q-Learning Part lI: Optimal Query Complexities for Dynamic Trace Estimation	-
28.11.2024	Yazid Janati (École Polytechnique)	Variational Diffusion Posterior Sampling with Midpoint Guidance	Diffusion models have recently shown considerable potential in solving Bayesian inverse problems when used as priors. However, sampling from the resulting denoising posterior distributions remains a challenge as it involves intractable terms. To tackle this issue, state-of-the-art approaches formulate the problem as that of sampling from a surrogate diffusion model targeting the posterior and decompose its scores into two terms: the prior score and an intractable guidance term. While the former is replaced by the pre-trained score of the considered diffusion model, the guidance term has to be estimated. In this paper, we propose a novel approach that utilises a decomposition of the transitions which, in contrast to previous methods, allows a trade-off between the complexity of the intractable guidance term and that of the prior transitions. We also show how the proposed algorithm can be extended to handle the sampling of arbitrary unnormalised densities. We validate the proposed approach through extensive experiments on linear and nonlinear inverse problems, including challenging cases with latent diffusion models as priors

Fall 2023

Date	Speaker	Title	Abstract
14.09.2023	Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)	Deviation bounds for the norm of a random vector under exponential moment conditions with applications	Hanson-Wright inequality provides a powerful tool for bounding the norm \\| \xi \\| of a centered stochastic vector $ \xi $ with sub-gaussian behavior. This paper extends the bounds to the case when $ \xi $ only has bounded exponential moments of the form $ \log E \exp \langle V^{-1} \xi,u \rangle \leq \\| u \\|^{2}/2 $, where $ V^{2} \geq \mathrm{Var}(\xi) $ and $ \\| u \\| \leq g $ for some fixed $ g $. For a linear mapping $ Q $, we present an upper quantile function $ z_{c}(B,x) $ ensuring $ P(\\| Q \xi \\| > z_{c}(B,x)) \leq 3 e^{-x} $ with $ B = Q \, V^{2} Q^{T} $. The obtained results exhibit a phase transition effect: with a value $ x_{c} $ depending on $ g $ and $ B $, for $ x \leq x_{c} $, the function $ z_{c}(B,x) $ replicates the case of a Gaussian vector $ \xi $, that is, $ z_{c}^{2}(B,x) = \tr(B) + 2 \sqrt{x \tr(B^{2})} + 2 x \\| B \\| $. For $ x > x_{c} $, the function $ z_{c}(B,x) $ grows linearly in $ x $. The results are specified to the case of Bernoulli vector sums and to covariance estimation in Frobenius norm.
21.09.2023	Arthur Goldman (HSE University)	RL with friends: how many agents does it take to solve an environment?	As part of the report, students will get acquainted with the main definitions and results of the field related to the study multi-agent reinforcement learning и mean-field games.
05.10.2023	Nikita Puchkin (HSE University)	Sharper dimension-free bounds on the Frobenius distance between sample covariance and its expectation	We consider the problem of covariance estimation from a sample of high-dimensional random vectors. We derive dimension-free bounds on the squared Frobenius distance between sample covariance and its expectation, which significantly improve over the existing results.
12.10.2023	Fedor Noskov (HSE University, MIPT, Skoltech)	Analysis of Boolean functions	At the seminar, we will discuss general techniques of Boolean analysis, as well as the consequences that these techniques lead to in learning theory, approximation theory, and combinatorics. The main subject of our discussion will be the Fourier analysis of Boolean functions. Also, we will touch a little on the phenomenon of hypercontractivity and its various applications.
19.10.2023	Denis Rakitin (HSE University)	Generative models based on ODE/DE	-
09.11.2023	Evgeny Frolov (HSE University, AIRI)	Learning from sequences using tensor decompositions	The report is devoted to the application of tensor decompositions and effective computational methods of linear algebra to create an analog of the self-attention model on sequences of user actions. This approach can be applied both in the standard task of predicting the next user actions, and in a wider range of tasks related to the analysis of incomplete multidimensional time series.
14.12.2023	Alexander Tarakanov (HSE University, Odnoklassniki).	Application of machine learning methods for computing eigenfunctions of differential operators	-

Spring 2023

Date	Speaker	Title	Abstract
17.01.2023	Alexandra Senderovich (HSE University)	Norm and trace estimation with random rank-one vectors (paper overview)	To estimate the norm of an arbitrary matrix or the trace of a symmetric nonnegatively defined matrix, several multiplications of the matrix by random vectors are often enough. Using rank 1 vectors, i.e. Kronecker products of smaller random vectors, it is possible to speed up the calculation of such an estimate by speeding up sampling and matrix multiplication. The article shows that with random vectors of rank 1 sampled from a normal distribution or a Rademacher distribution, the probabilities of getting good estimates for the norm deteriorate slightly compared to using ordinary vectors from the same distributions. The authors also conducted numerical experiments confirming the derived probabilities and showing the advantages of the approach with rank 1 vectors.
31.01.2023	Dmitry Lishudi (HSE University)	"How benign is benign overfitting?"	On many datasets, neural networks when trained with SGD achieve almost zero error on training data and good generalizability on test data, even when the training partitioning is noisy. This effect is called benign overfitting. Nevertheless, such models are vulnerable to adversarial attacks. The first reason for this vulnerability is poor sampling. Thus, theoretical and empirical analysis shows that noise in data partitioning is one of the causes of adversarial vulnerability and robust models fail to achieve zero training error. However, removing incorrect labels does not achieve resilience to adversarial attacks. The paper suggests that it is also a matter of suboptimal representation learning. Using a simple theoretical statement, it is shown that the choice of representations can significantly affect adversarial stability.
7.02.2023 - 12.02.2023		Сonference and Autumn School Fall into ML	https://cs.hse.ru/iai/hdilab/mml2023/
14.02.2023	Nikita Puchkin (HSE University)	Exploring Local Norms in Exp-concave Statistical Learning	We consider the standard problem of stochastic convex optimization with exp-concave losses using Empirical Risk Minimization in a convex class. Answering a question raised in several prior works, we provide a $O ( d/n + 1/n log ( 1/ \delta ) )$ excess risk bound valid for a wide class of bounded exp-concave losses, where d is the dimension of the convex reference set, n is the sample size, and $\delta$ is the confidence level. Our result is based on a unified geometric assumption on the gradient of losses and the notion of local norms.
21.02.2023	Denis Belomestny (HSE University, Duisburg Essen University)	A Reproducing Kernel Hilbert Space approach to singular local stochastic volatility McKean-Vlasov models	Motivated by the challenges related to the calibration of financial models, we consider the problem of numerically solving a singular McKean-Vlasov equation This equation can be considered as a singular local stochastic volatility model. Whilst such models are quite popular among practitioners, unfortunately, its well-posedness has not been fully understood yet and, in general, is possibly not guaranteed at all. We develop a novel regularization approach based on the reproducing kernel Hilbert space (RKHS) technique and show that the regularized model is well-posed. Furthermore, we prove propagation of chaos. We demonstrate numerically that a thus regularized model is able to perfectly replicate option prices due to typical local volatility models. Our results are also applicable to more general McKean--Vlasov equations.
28.02.2023	Sergey Samsonov (HSE University)	Finite-time high-probability bounds for linear stochastic approximation	We discuss a finite-time analysis of linear stochastic approximation (LSA) algorithms, a core method in statistics and machine learning. LSA is used to compute approximate solutions of a d-dimensional linear system, where both the system matrix and the right-hand side can only be estimated through (asymptotically) unbiased observations. We consider here the case where the observations are driven either by sequence of i.i.d. noise variables or by uniformly geometrically ergodic Markov chain. We discuss the p-moments inequality and high probability bounds for the LSA iterates and its Polyak-Ruppert averaged version focusing on tight dependence upon the mixing time of the chain. We then prove finite-time instance-dependent bounds on the Polyak-Ruppert averaged sequence of iterates. These results are sharp in the sense that the leading term we obtain matches the local asymptotic minimax limit. We also cover the applications to the temporal-difference methods in RL
07.03.2023	Maxim Vasiliev (HSE University)	Approximation and sampling of multivariate probability distributions in the tensor train decomposition (papers overview)	-
14.03.2023	Fedor Noskov (HSE University, MIPT, Skoltech)	Optimal Estimation in Mixed-Membership Stochastic Block Model	Community detection is one of the central problems in modern network science. It has numerous applications in the analysis of social and biological networks, designing network protocols and many other areas. Recently, much attention has been paid to the detection of overlapping communities, where each node in a network may belong to multiple groups. We consider the parameter estimation problem in Mixed Membership Stochastic Block Model (MMSB), which is a quite general instance of a random graph model allowing for overlapping community structure. We discuss different approaches to parameter estimation in this model, their theoretical properties and practical performance. Finally, we propose an approach to improve over existing estimates that allows obtaining minimax optimal rates of estimation for all the model parameters.
23.03.2023	Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)	Inference in Error-in-Operator model	We consider the problem of recovering the source signal $ x $ from the noise observation $ Y $ given by the equation $ Y = A x + \epsilon $ in the situation when the operator $ A $ is not precisely known. Instead, a pilot estimate $ \hat{A} \} is available. New approach is suggested which allows to obtain finite sample nearly sharp accuracy guarantees and incorporate the smoothness of the signal \( x $ and the operator $ A $ in a rather general form.
28.03.2023	Evgeny Lagutin (HSE University)	Stable Rank Normalization for Improved Generalization in Neural Networks and GANs (paper overview)	Exciting new work on generalization bounds for neural networks (NN) given by Bartlett et al. (2017); Neyshabur et al. (2018) closely depend on two parameter- dependent quantities: the Lipschitz constant upper bound and the stable rank (a softer version of rank). Even though these bounds typically have minimal practical utility, they facilitate questions on whether controlling such quantities together could improve the generalization behaviour of NNs in practice. Spectral Normalization (Miyato et al., 2018) is a technique making it easy to restrict the Lipschitz constant upper bound and thus is a popular way to stabilize training of generative adversarial networks and is to make a general discriminative neural network robust. On the other hand Spectral Rank Normalization (Sanyal et al., 2020) tackles the problem of restricting a stable rank of neural network's layers. Motivated by the generalization bound view this technique in combination with Spectral Normalization further improve the robustness properties on classification tasks and the quality on image generation tasks.
04.04.2023	Evgeny Nikishin (Mila, University of Montreal)	Deep Reinforcement Learning with Plasticity Injection	A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, and performance in RL. We introduce plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are two-fold: first, as a diagnostic tool — if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent exhausted its plasticity and has to re-learn from scratch or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance compared to alternative methods while being computationally efficient.
20.04.2023	Daniil Tyapkin (HSE University)	Fast Rates for Maximum Entropy Exploration	We consider the reinforcement learning (RL) setting, in which the agent has to act in unknown environment driven by a Markov Decision Process (MDP) with sparse or even reward free signals. In this situation, exploration becomes the main challenge. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization that was previously considered by Hazan et al. 2019 in the discounted setting. For this type of exploration, we propose an algorithm based on a game theoretic representation that has O(HSA/eps^2) sample complexity thus improving the eps- and S-dependence of Hazan, where S is a number of states, A is a number of actions, H is an episode length, and eps is a desired accuracy. The second type of entropy we study is the trajectory entropy. This objective function is closely related to the entropy-regularized MDPs, and we propose a simple modification of the UCBVI algorithm that has a sample complexity of order 1/eps ignoring dependence in S,A,H. Interestingly enough, it is the first theoretical result in RL literature establishing that the exploration problem for the regularized MDPs can be statistically strictly easier (in terms of sample complexity) than for the ordinary MDPs.
27.04.2023	Arthur Goldman (HSE University)	Theoretical guarantees for neural control variates in MCMC	In this paper, we propose a variance reduction approach for Markov chains based on additive control variates and the minimization of an appropriate estimate for the asymptotic variance. We focus on the particular case when control variates are represented as deep neural networks. We derive the optimal convergence rate of the asymptotic variance under various ergodicity assumptions on the underlying Markov chain. The proposed approach relies upon recent results on the stochastic errors of variance reduction algorithms and function approximation theory.
01.05.2023	Vladimir Spokoiny (HSE University, IITP RAS, Humboldt University and WIAS Berlin)	Sharp deviation bounds for quadratic forms and their use in machine learning	Under a mild assumption of linearity on the log-likelihood function, the tools of empirical processes are not necessary, the analysis of a general MLE can be reduced to a deviation bound for a quadratic form. This paper explains how the recent advances in Laplace approximation from [Spokoiny, 2022, 2023] can be used for obtaining sharp Gaussian-like finite sample bounds and for stating the prominent concentration phenomenon for the squared norm of a sub-gaussian vector. Some extensions and open problems will be discussed as well.
04.05.2023	Yana Hassan (HSE University)	Fast Rates for Maximum Entropy Exploration	Nowadays it is becoming extremely popular to tune the large pertained language model with experts comments using reinforcement learning with human feedback methodology. In my talk I would focus on a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). The analysis from the paper shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. The results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. In my talk I am going to dicsuss the unified analysis the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and illustrate the first sample complexity bound for max-entropy IRL.
11.05.2023	Alexander Podkopaev (Carnegie Mellon University)	Nonparametric two-sample and independence testing by betting	Two-sample and independence testing are two fundamental problems of statistical inference. In the former, given observations from two distributions, the goal is to test the null hypothesis that the distributions are equal against the alternative that they are not. In the latter, given observations drawn from an unknown joint distribution, the goal is to test the null that the underlying random variables are independent against the alternative that they are not. Both problems have been extensively studied in the batch setting when the sample size is specified before collecting data. The main issue is that in general composite nonparametric settings, even if the null is false, it is unknown beforehand collecting how much data will be ''enough'' to reject. Sequential testing is a complementary approach when an incoming data stream is analyzed online. In this talk, I will discuss our recent sequential two-sample and independence tests that (a) continuously monitor the data while controlling the false alarm rate, (b) are consistent, meaning that they are guaranteed to stop and reject if the null is false, and (c) provably adapt to the complexity of a problem and stop earlier on easy tasks / later on harder ones.
18.05.2023	Ulyana Vinogradova (HSE University)	Leveraging covariate adjustments at scale in online A/B testing	Various companies run randomized online experiments to estimate the “causal impact” of new adopted features on performance of metrics. Difference in means estimator is still one of the most commonly used method in many online A/B testing platforms due to its simplicity. However, this estimator often fails to detect small effects of the intervention despite the fact that the experiment contains thousands or millions of customers. A potential solution to this issue is to develop an alternative estimator, that leverages the abundance of additional data (covariates) collected during the experiments. In this talk, I will discuss recently proposed general-purpose algorithm for the estimation of causal effects with covariates to the setting of online A/B testing. Also I will talk about benefits, costs and risks of allowing experimenters to leverage more complicated estimators that use covariates to estimate causal effects.
25.05.2023	Ivan Safonov (HSE University)	Tensorization of flows: a tool for variational inference	Consider the classical problem of learning the approximation from a given class of distributions to a given distribution up to a normalizing constant. To solve this problem, you can apply flow normalization. The unimodality of the reference Gaussian distribution used in the normalizing flow may cause difficulties in studying multimodal distributions. We introduce an extension of normalizing flows, in which the Gaussian reference value is replaced by a reference distribution (tensor flow), which is constructed using a tensor network. TF combined with NF gives better results than each of them individually. In this talk, I will talk about the recently proposed tensorizing flow, how to extract a sample from it, how to study it, and how to apply it to VI together with NF. The tensor flow construction is similar to the construction described in Sergey Dolgov's articles, but it has some interesting differences that we will discuss.
08.06.2023	Ivan Peshekhonov and Alexey Arzhantsev (HSE University)	Exploring the Benefits of Riemannian Optimization on Shared Factor Tensor Manifolds	Factorization of a matrix into a product of two rectangular ones, called factors, is a classic tool in various machine learning applications. Tensor factorizations generalize this concept to more than two dimensions. In certain applications, where some of the tensor dimensions encode the same objects (e.g., knowledge base graphs with entity-relation-entity 3D tensors), it may also be beneficial for the respective factors to be shared. In this paper, we consider a well-known Tucker tensor factorization and study its properties under the shared factor constraint. We call it a shared-factor Tucker factorization (SF-Tucker). We prove that a manifold of fixed rank SF-Tucker tensors forms a smooth manifold and develop efficient algorithms for the main ingredients of Riemannian optimization: projection to a tangent plane for general-purpose loss functions and retraction. Having these tools at hand, we also implement a Riemannian optimization method with momentum and showcase the benefits of our approach on several machine learning tasks.
15.06.2023	Irina Goloborodko (HSE University)	Jacobian regularization of tensor decompositions in neural networks	Tensor decompositions can be used as a compression tool for neural networks. For this purpose the tensor of weights of a full-connected or convolutional layer is replaced in the architecture by its own tensor decomposition. The disadvantage of this approach is the instability of the compressed neural network for some types tensor decompositions. In order to improve the stability of the compressed neural network, randomized Jacobian spectral regularizer can be applied. The regularizer relies on matvec functions of the Jacobian and its transposed and can be efficiently computed using automatic differentiation tools. This results in great versatility of this method and makes it possible to apply proposed technique in different scenarios.
29.06.2023	Nikita Morozov (HSE University)	Introduction to Generative Flow Networks (GFlowNets)	-

Fall 2022

Date	Speaker	Title	Abstract
19.09.2022	Denis Belomestny (HSE University, Duisburg Essen University)	Statistics of McKean-Vlasov equations	In this talk we study the problem of semiparametric estimation for a class of McKean-Vlasov stochastic differential equations. Our aim is to estimate the drift coefficient of a MV-SDE based on observations of the corresponding particle system. We propose a semiparametric estimation procedure and derive the rates of convergence for the resulting estimator. We further prove that the obtained rates are essentially optimal in the minimax sense.
26.09.2022	Alexander Tarakanov (HSE University)	Multi-Level Unbiased MCMC Estimators for Discretized Models	ODE and PDE models routinely in different applications. Significant part of practical questions can be reduced to integration over the distribution of ODE/ODE parameters spaces. The practical way for solving such problem is MCMC integration coupled with spatial/temporal discretization of ODE/PDE model. Such discretization introduces bias to MCMC integration. In the present talk the novel scheme for Multi-Level Unbiased MCMC method is discussed. The technique concerned is provides unbiased estimator with finite variance and finite expected computational cost for a given Quantity of Interest under reasonable smoothness constraints. Application examples are provided.
03.10.2022	Yana Khassan (HSE University)	Is Q-learning Provably Efficient? (Article overview)	Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn. The theoretical question of “whether model-free algorithms can be made sample efficient” is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions. The paper proves that, in an episodic MDP setting, Q-learning with UCB exploration achieves almost optimal regret. This is the first analysis in the model-free setting that establishes O(√T) regret without requiring access to a “simulator.”
10.10.2022	Dmitry Molchanov (HSE University)	Twice stochastic variational inference with semi-implicit and non-invariant distributions	Doubly stochastic variational inference (DSVI) is one of the main methods of approximate inference and learning in modern deep probabilistic models. However, this method imposes significant restrictions on the distributions used and has a limited area of applicability. The main goal of my work is to relax these limitations and expand the tools for working with deep probabilistic models. Using a variation dropout model as an example, we show how removing restrictions on variation parameters allows us to open up new modes of model operation - the rarefaction mode and the "variance network" mode. Then, we generalize DSVI to the case of models with so-called semi-implicit distributions. The proposed approach, Doubly Semi-Implicit Variational Inference, allows inference in new classes of models, refine inference in existing models, refine marginal likelihood estimation, and accelerate inference in models with mixture-like distributions. Translated with www.DeepL.com/Translator (free version)
24.10.2022	Evgenii Nikishin (Mila, Université de Montréal)	The Primacy Bias in Deep Reinforcement Learning	This work identifies a common flaw of deep reinforcement learning (RL) algorithms: a tendency to rely on early interactions and ignore useful evidence encountered later. Because of training on progressively growing datasets, deep RL agents incur a risk of overfitting to earlier experiences, negatively affecting the rest of the learning process. Inspired by cognitive science, we refer to this effect as the primacy bias. Through a series of experiments, we dissect the algorithmic aspects of deep RL that exacerbate this bias. We then propose a simple yet generally-applicable mechanism that tackles the primacy bias by periodically resetting a part of the agent. We apply this mechanism to algorithms in both discrete (Atari 100k) and continuous action (DeepMind Control Suite) domains, consistently improving their performance.
31.10.2022	Valeriia Shcherbakova (HSE University)	A Contrastive Approach to Online Change Point Detection	We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.
1.11.2022-3.11.2022	Autumn School	Fall into ML 2022	https://cs.hse.ru/ml2022
07.11.2022	Artur Goldman (HSE University)	Theoretical guarantees for approximate sampling from smooth and log-concave densities	Sampling from various kinds of distributions is an issue of paramount importance in statistics. In many situations, the exact sampling from a given distribution is impossible or computationally expensive and, therefore, one needs to resort to approximate sampling strategies. However, there is no well-developed theory providing meaningful nonasymptotic guarantees for the approximate sampling procedures, especially in the high-dimensional problems. This paper makes some progress in this direction by considering the problem of sampling from a distribution having a smooth and log-concave density. Authors establish nonasymptotic bounds for the error of approximating the target distribution by the one obtained by the Langevin Monte Carlo method and its variants, and illustrate the effectiveness of the established guarantees with various experiments.
14.11.2022	Daria Demidova (HSE University)	Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices	This paper studies the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution V = e-f on ℝn. Authors prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming that V satisfies a log-Sobolev inequality and the Hessian of f is bounded. Notably, convexity or bounds on higher derivatives are not assumed. Authors also prove convergence guarantees in Rényi divergence of order q > 1 assuming the limit of ULA satisfies either the log-Sobolev or Poincaré inequality. Authors also prove a bound on the bias of the limiting distribution of ULA assuming third-order smoothness of f, without requiring isoperimetry.
21.11.2022 and 28.11.2022	Anton Zubekhin (HSE University)	Diffusion Probabilistic Models explosion	Diffusion probabilistic models are state-of-the-art method for image generation. More than that, diffusion models also perform good working with data of different modalities, such as speech and text. Text-to-image diffusions are the most well-known types of such models with Dalle-2, Imagen and Stable Diffusion being famous even outside of ML community. Recent advances in machine learning show even wider applicability of diffusion models - they can be used for video generation, 3D reconstructions, meta-learning and even direct optimization of other neural networks.
05.12.2022	Varvara Rudenko (HSE University)	Log-concave sampling: Metropolis-Hastings algorithms are fast	The authors study the problem of sampling from a strongly log-concave density supported on R^d, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by simulating a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), their bounds show that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. And also demonstrate the gains of a modified version of MALA over ULA for weakly log-concave densities. Furthermore, they derive mixing time bounds for the Metropolized random walk (MRW) and obtain O(κ) mixing time slower than MALA. They provide numerical examples that support their theoretical findings, and demonstrate the benefits of Metropolis-Hastings adjustment for Langevin- type sampling algorithms.
12.12.2022	Marina Sheshukova (HSE University)	Exponential Concentration Inequalities for Additive Functionals of Markov Chains	Using the renewal approach we prove exponential inequalities for additive functionals and empirical processes of ergodic Markov chains, thus obtaining counterparts of inequalities for sums of independent random variables. The inequalities do not require functions of the chain to be bounded and moreover all the involved constants are given by explicit formulas whenever the usual drift condition holds, which may be of interest in practical applications e.g. to MCMC algorithms.
19.12.2022	Timofey Grinev (HSE University)	Sharper Bounds for Uniformly Stable Algorithms	Deriving generalization bounds for stable algorithms is a classical question in learning theory. This paper is devoted to two questions: firstly it provides a short proof of the moment bound that implies the generalization bound stronger than recent results. Secondly, it proves general lower bounds, showing that this moment bound is sharp (up to a logarithmic factor) unless some additional properties of the corresponding random variables are used. The main probabilistic result is a general concentration inequality for weakly correlated random variables, which may be of independent interest.

Fall 2021
Date	Speaker	Title	Abstract
13.01.2021	Vladimir Spokoiny (WIAS, IITP RAS, HSE University)	Bayesian inference in machine learning	Motivated by examples from nonparametric binary classification, ranking problems, analysis of random graphs, we consider the problem of statistical inference for binary data within a nonparametric Bernoulli model. In many situations the underlying Bernoulli parameter may approach the edges zero or one that leads to inference for non-regular models. Properties like asymptotic normality of the MLE or of the posterior distribution are not valid in such irregular cases. The approach of this paper suggests to impose a Gaussian prior on the parameter of logistic regression function, or, analogously, to consider a penalized MLE of the canonical link with a quadratic penalty. A Gaussian prior can be viewed as a kind of a barrier preventing the canonical parameter to approach infinity and forcing it to stay within the region of regularity. Our main results state finite sample analogs of the classical statistical results like asymptotic normality of the posterior distribution or asymptotic normality of the penalized MLE for any configuration of Bernoulli parameters under the assumption that the prior is strong enough in direction. We also demonstrate how the results can be applied to ranking problems and nonparametric binary classification.
09.02.2021	Nikita Puchkin (HSE University, IITP)	Rates of convergence for density estimation with GANs	We undertake a precise study of the non-asymptotic properties of vanilla generative adversarial networks (GANs) and derive theoretical guarantees in the problem of estimating an unknown $d$-dimensional density $p$ under a proper choice of the class of generators and discriminators. We prove that the resulting density estimate converges to $p$ in terms of Jensen-Shannon (JS) divergence at the rate $n^{-2\beta/(2\beta+d)}$ where $n$ is the sample size and $\beta$ determines the smoothness of $p$. This is the first result in the literature on density estimation using vanilla GANs with JS rates faster than $n^{-1/2}$ in the regime $\beta>d/2$.
16.02.2021	Sergey Samsonov (HSE University)	On the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD Learning	This paper studies the exponential stability of random matrix products driven by a general (possibly unbounded) state space Markov chain. It is a cornerstone in the analysis of stochastic algorithms in machine learning (e.g. for parameter tracking in online-learning or reinforcement learning). The existing results impose strong conditions such as uniform boundedness of the matrix-valued functions and uniform ergodicity of the Markov chains. Our main contribution is an exponential stability result for the p-th moment of random matrix product, provided that (i) the underlying Markov chain satisfies a super-Lyapunov drift condition, (ii) the growth of the matrix-valued functions is controlled by an appropriately defined function (related to the drift condition). Using this result, we give finite-time p-th moment bounds for constant and decreasing stepsize linear stochastic approximation schemes with Markovian noise on general state space. We illustrate these findings for linear value-function estimation in reinforcement learning. We provide finite-time p-th moment bound for various members of temporal difference (TD) family of algorithms.
02.03.2021	Dmitry Yarotsky (Skoltech)	Approximation with deep neural networks	In recent years, there has been quite a number of new results on approximation properties of deep neural networks. This emergent theory is rather different from the classical approximation theory of linear or shallow models (such as splines or Fourier expansion). In this talk, I will try to highlight some interesting ideas of this theory, and will also briefly indicate its connections to other topics such as information theory, dynamical systems, fewnomials and VC dimension.
09.03.2021	Nikita Puchkin (HSE, IITP)	Exponential Savings in Agnostic Active Learning through Abstention	We show that in pool-based active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss 1/2 of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in pool-based active classification under the model misspecification.
16.03.2021	Nikita Zhivotovskiy (ETH, Zurich)	Distribution-Free Robust Linear Regression	We consider the problem of random-design linear regression, in a distribution-free setting where no assumption is made on the distribution of the predictive/input variables. After surveying existing approaches and indicating some improvements, we explain why they fall short in our setting. We then identify the minimal assumption on the target/output under which guarantees are possible, and describe a nonlinear prediction procedure achieving the optimal error bound with high probability. Joint work with Jaouad Mourtada (CREST-ENSAE Paris ) and Tomas Vaškevičius (Oxford).
23.03.2021	Maxim Panov (Skoltech)	Stochastic normalizing flows	Normalizing flows is one of the central concepts in modern Bayesian machine learning. They allow constructing approximations of complex distributions starting from simple ones (e.g., Gaussian) and transforming it by a series of deterministic mappings. Normalizing flows have found numerous applications being especially useful in variational approach to Bayesian inferences. In this talk, we will show that classical deterministic normalizing flows have severe limitations which can be overcome by adding stochasticity to the procedure. It can significantly improve the power of the method but requires much more complicated training procedures. We will discuss how one can construct variational learning procedures based on the coupling of normalizing flows with the Metropolis-Hastings accept/reject algorithm. We will also see numerous connections with other classical sampling approaches such as Annealed Importance Sampling (AIS) and Sequential Importance Sampling (SIS). We will illustrate the theory with applications to sampling and the construction of variational autoencoders.(Joint work with Eric Moulines and Achille Thin (Ecole Polytechnique), Alain Durmus (ENS Paris-Saclay), Arnaud Doucet (Uni Oxford) and Nikita Kotelevskii (Skoltech)).
30.03.2021	Dmytro Perekrestenko (ETH Zurich)	Constructive Universal High-Dimensional Distribution Generation through Deep ReLU	We present an explicit deep neural network construction that transforms uniformly distributed one-dimensional noise into an arbitrarily close approximation of any two-dimensional Lipschitz-continuous target distribution. The key ingredient of our design is a generalization of the “space-filling” property of sawtooth functions discovered in (Bailey & Telgarsky, 2018). We elicit the importance of depth—in our neural network construction—in driving the Wasserstein distance between the target distribution and the approximation realized by the network to zero. We also will outline an extension of our method to output distributions of arbitrary dimension. Finally, we show that the proposed construction does not incur a cost—in terms of error measured in Wasserstein-distance—relative to generating d-dimensional target distributions from d independent random variables.
06.04.2021	Daniil Tiapkin (HSE University)	Improved Complexity Bounds in the Wasserstein Barycenter Problem	In this paper, we focus on computational aspects of the Wasserstein barycenter problem. We provide two algorithms to compute Wasserstein barycenter of m discrete measures of size n with accuracy $\e$. The first algorithm, based on mirror prox with some specific norm, meets the complexity of celebrated accelerated iterative Bregman projections (IBP), that is $\widetilde O(mn^2\sqrt n/\e)$, however, with no limitations unlike (accelerated) IBP, that is numerically unstable when regularization parameter is small. The second algorithm, based on area-convexity and dual extrapolation, improves the previously best-known convergence rates for Wasserstein barycenter problem enjoying $\widetilde O(mn^2/\e)$ complexity.
13.04.2021	Alexey Kroshnin (IITP RAS, HSE University)	Robust k-means clustering in metric spaces	In this talk we consider robust median of means based algorithms for the k-means clustering problem in a metric space. The main results are non-asymptotic excess distortion bounds that hold under the two bounded moments assumption in a general separable metric space. In the case of Hilbert space our bounds have the sub-Gaussian form depending on the probability mass of the lightest cluster of an optimal quantizer. In a special case of clustering in R^d we prove matching (up to constant factors) non-asymptotic upper and lower bounds on the excess distortion, which depend on the probability mass of the lightest cluster and on the variance. The talk is based on a joint work with Yegor Klochkov and Nikita Zhivotovskiy
20.04.2021	Quentin Paris (HSE University)	Online learning with exponential weights in metric spaces	This paper addresses the problem of online learning in metric spaces using exponential weights. We extend the analysis of the exponentially weighted average forecaster, traditionally studied in a Euclidean settings, to a more abstract framework. Our results rely on the notion of barycenters, a suitable version of Jensen's inequality and a synthetic notion of lower curvature bound in metric spaces known as the measure contraction property. We also adapt the online-to-batch conversion principle to apply our results to a statistical learning framework.
27.04.2021	Иван Оселедец (Сколтех)	Геометрия в моделях машинного обучения	Об использовании геометрического подхода к моделям машинного обучения, а также приведу обзор наших результатов последних лет, включая: а) применение гиперболических пространств в машинном обучении и создание рекомендательных моделей; б) построение интерпретируемых направлений в латентном пространстве генеративных сетей.
12.05.2021-15.05.2021	Conference	New Frontiers in High-dimensional Probability and Applications to Machine Learning	https://cs.hse.ru/iai/hdilab/hdpa2021
18.05.2021	Egor Klochkov (Cambridge University)	Fast rates for strongly convex optimization via stability	The sharpest known high probability generalization bounds for uniformly stable algorithms (Feldman, Vondrák, 2018, 2019), (Bousquet, Klochkov, Zhivotovskiy, 2020) contain a generally inevitable sampling error term of order 1/sqrt(n). When applied to excess risk bounds, this leads to suboptimal results in several standard stochastic convex optimization problems. We show that if the so-called Bernstein condition is satisfied, the term O(1/sqrt(n)) can be avoided, and high probability excess risk bounds of order up to O(1/n) are possible via uniform stability. Using this result, we show a high probability excess risk bound with the rate O(log n / n) for strongly convex and Lipschitz losses valid for any empirical risk minimization method. This resolves a question of Shalev-Shwartz, Shamir, Srebro, and Sridharan (2009). We discuss how O(logn/n) high probability excess risk bounds are possible for projected gradient descent in the case of strongly convex and Lipschitz losses without the usual smoothness assumption. This is a joint work with Nikita Zhivotovskiy and Olivier Bousquet.
01.06.2021	Sergey Samsonov (HSE University)	UVIP: Model-Free Approach to Evaluate Reinforcement Learning Algorithms	Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL). Yet even a precise knowledge of the value function V^{\pi} corresponding to a policy \pi does not provide reliable information on how far is the policy \pi from the optimal one. We present a novel model-free upper value iteration procedure (UVIP) that allows us to estimate the suboptimality gap V(x) - V^{\pi}(x) from above and to construct confidence intervals for V. Our approach relies on upper bounds to the solution of the Bellman optimality equation via martingale approach. We provide theoretical guarantees for UVIP under general assumptions and illustrate its performance on a number of benchmark RL problems. The talk is based on our recent work with Denis Belomestny, Ilya Levin, Eric Moulines, Alexey Naumov and Veronika Zorina.
22.06.2021	Dmitrii Ostrovskii (University of Southern California)	Near-Optimal Model Discrimination with Non-Disclosure	In the standard setup of parametric M-estimation with convex loss, we consider two population risk minimizers associated with two possible distributions of a random observation, and then ask the following question: Given the value of one of the two minimizers and access to i.i.d. sampling from both distributions, what sample sizes allow to identify the “right” distribution, i.e., the one corresponding to the given value?Making the first steps towards answering this question in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide nearly matching upper and lower bounds on the sample complexity, showing it to be min{1/Δ^2,\sqrt{r}/Δ} up to a constant factor, where Δ is a measure of separation between the two distributions and r is the maximum (resp., minimum) of the ranks of the design covariance matrices in the case of the upper (resp., lower) bound. This bound is dimension-independent and rank-independent for large enough separation. We then extend this result in two directions: (i) for the general parametric setup in asymptotic regime; (ii) for generalized linear models in the small-sample regime n≤r and under weak moment assumptions. In both cases, we derive sample complexity bounds of a similar form, even under misspecification. Our testing procedures only access the known'' value through a certain functional of empirical risk. In addition, the number of observations that allows to reach statistical confidence for testing does not allow to "resolve" the two models -- that is, recover both minimizers up to O(Δ) prediction accuracy. These two properties open the prospect of applying our testing framework in practical tasks, where one would like to identify a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually inferred by the identifying agent. The talk will be following the recent paper arxiv:2012.02901.
12.10.2021	Evgeny Spodarev (Ulm University, Germany)	Spectral classification of fullerenes	Joint work with A. Bille (Skoltech, Moscow/Ulm University), V. Buchstaber (Steklov Mathematical Institute, Moscow) and S. Coste (INRIA, Paris). Fullerenes $C_n$ are 3-simple convex compact polytopes with $n$ vertices, $n\ge 20$ even, $n\neq 22$, 12 pentagons and $n/2-10$ hexagons as facets. They are mathematical models of famous carbon molecules which find wide applications in modern nanotechnologies. For the synthesis of $C_60$, R. Curl, H. Kroto and R. Smalley got a Nobel prize in chemistry (1996). Since the number of distinct fullerene isomers (which are combinatorially non—equivalent versions of $C_n$) grows as $O(n^9)$ for large $n$, their enumeration and classification is a non-trivial mathematical problem. We propose a solution to this problem using the Newton polynomials of the spectra of their dual graphs of hexagons. It appears that the degree of such polynomials has a logarithmic growth order with respect to $n$. If the time allows, the spectral linear ordering of characters of $C_n$ for different $n$ as well as the distribution of a random eigenvalue of Laplace-type matrices for the graphene, regular triangulation of the plane and infinite nanotubes with chiral vector $(p,q)$ will be touched upon.
23.11.2021-26.11.2021	Dmitry Vetrov (HSE University); Alexey Naumov (HSE University); Denis Derkach (HSE University); Mikhail Lazarev (HSE University); Nikita Kazeev (HSE University); Sergey Shirobokov (LCL, Twitter); Maxim Panov (Scoltech); Mikhail Burtsev (MIPT)	Third HSE-Yandex autumn school on generative models	https://indico.cern.ch/event/1082512/

Spring 2020

Date	Speaker	Title	Abstract
3.03.2020	Nazar Buzun (Skoltech)	Gaussian approximation of the maximum of the sum of random vectors.	In this paper we consider a Gaussian approximation for the maximum sum of independent random vectors in high dimensionality. Our approach is based on approximation by probability. In this case, the random variables depend on each other. From approximation by probability, approximation by distribution follows directly. We obtained a theoretical estimate of the rate of convergence to the maximum normal vector, which is a significant improvement over those obtained earlier.
6.03.2020	Alexandra Suvorikova (WIAS Berlin)	Shape-based domain adaptation via optimal transportation	Domain adaptation problem aims at learning a well performing model, trained on a source data S (images, vectors, e.t.c), applied then to different (but related) target sample T. Aside from being attractive due to obvious practical utility, the setting is challenging from theoretical point of view. In this work we introduce a novel approach to supervised domain adaptation consisting in a class-dependent fitting based on ideas from optimal transportation (OT) theory which considers S and T as two mixtures of distributions. A parametrized OT distance is used as a fidelity measure between S and T, providing a toolbox for modelling of possibly independent perturbations of mixture components. The method is than used for describing the adaptation of immune system in humans after moving to another climatic zone.
11.03.2020	Sergey Bobkov (University of Minnesota, HSE University)	Gaussian concentration, spherical concentration, and their applications to randomized models in the central limit theorem	We plan to discuss the so-called concentration phenomenon of the measure and its corresponding Sobolev inequalities with respect to multidimensional Gaussian measures and uniform distributions on a unit sphere in Euclidean space. In addition to the classical results describing the increase in the degree of concentration with increasing dimensionality, new estimates for large deviations will also be presented, with improvements in the standard inequalities for smooth functionals on the sphere. As an illustration of possible applications, problems on the rate of convergence to the normal law for weighted sums of dependent quantities will be considered
18.03.2020	Sergey Bobkov (University of Minnesota, HSE University)	Gaussian concentration, spherical concentration, and their applications to randomized models in the central limit theorem	We plan to discuss the so-called concentration phenomenon of the measure and its corresponding Sobolev inequalities with respect to multidimensional Gaussian measures and uniform distributions on a unit sphere in Euclidean space. In addition to the classical results describing the increase in the degree of concentration with increasing dimensionality, new estimates for large deviations will also be presented, with improvements in the standard inequalities for smooth functionals on the sphere. As an illustration of possible applications, problems on the rate of convergence to the normal law for weighted sums of dependent quantities will be considered
24.03.2020 (Zoom)	Maxim Kaledin (HSE University)	Risk-Sensitive Reinforcement Learning a review of PhD thesis by Aviv Tamar	The main issue which keeps the applications of reinforcement learning limited is the fact that robustness, stability and durability of the learned decision policies are hard to prove. This happens due to usage of expected utility criterion which is optimized with respect to policy in all famous algorithms of RL. The criterium arised from economics and game theory where expected utility theory was historically the first well-established concept of rationality. Being very popular and well-studied in both mathematics and applications, the concept itself is not the only possible way to define rationality. In particular, there is more general class of functionals called coherent risk measures introduced in Delbaen,Artzner1998. Coherent risk measures are widely known in finance for they address exactly the robustness problem and try to quantify risk differently; despite that, in RL coherent risk measures are not so known and there are not so many papers on its usage. My talk will be devoted to the review of PhD thesis by Aviv Tamar on exploiting risk measures in RL which represents in itself a very comprehensive study on how to construct learning algorithms on popular risk measures like CVAR and semi-deviation.
3.04.2020 (Zoom)	Maxim Panov and Nikita Kotelevskiy (Skoltech)	MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference	In this contribution, we propose a new computa- tionally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to MetFlow, a novel family of MCMC algorithms we introduce, in which proposals are obtained us- ing Normalizing Flows. The marginal distribution produced by such MCMC algorithms is a mixture of flow-based distributions, thus drastically in- creasing the expressivity of the variational family. Unlike previous methods following this direction, our approach is amenable to the reparametrization trick and does not rely on computationally expen- sive reverse kernels. Extensive numerical experi- ments show clear computational and performance improvements over state-of-the art methods.
21.04.2020 at 16 00 (Zoom)	Nikita Zhivotovskiy (Google Research, HSE University)	Fast classification rates in online and statistical learning with abstention.	This talk is devoted to a simple extension of statistical and online classification setups where the learner is allowed to abstain from a prediction by paying a cost (say, 0.49) only marginally smaller than the average cost 0.5 of a random bit prediction. We discuss how this model implies fast rates for the excess risk and the expected regret without any additional assumptions on the data generating mechanism and the loss function. Finally, we show the relations with recent results in aggregation theory and with the analysis of strongly convex losses. Based on recent works with Olivier Bousquet and Gergely Neu arXiv:1910.12756 and arxiv:2001.10623.
23.04.2020 at 17 00 (Zoom)	Anastasia Ivanova (MIPT, HSE University)

Fall 2019

Date	Speaker	Title	Abstract
17.09.2019	Michel Ledoux (Université de Toulouse)	On optimal matching of random samples	Optimal matching problems are random variational problems widely investigated in the mathematics and physics literature, and in probability theory and statistics in the study of rates of convergence of empirical measures. Two-dimensional matching of uniform samples gave in particular rise to deep results investigated from various view points (combinatorial, generic chaining, Fourier analysis). After a review of known results, we investigate more precisely the case of Gaussian samples, first in dimension one, on the basis of explicit representations of Kantorovich metrics and a sharp analysis of more general log-concave distributions in terms of their isoperimetric profile (joint work with S. Bobkov), and then in dimension two and higher following the PDE and transportation approach recently put forward by L. Ambrosio, F. Stra and D. Trevisan.
23-27.09.2019	Workshop	Recent advances in mass transportation
01.10.2019	Alexander Goldenshluger (University of Haifa, HSE University)	Density deconvolution under general assumptions	In this paper we study the problem of density deconvolution under general assumptions on the measurement error distribution. Typically deconvolution estimators are constructed using Fourier transform techniques, and it is assumed that the characteristic function of the measurement errors does not have zeros on the real line. This assumption is rather strong and is not fulfilled in many cases of interest. In this paper we develop a methodology for constructing optimal density deconvolution estimators in the general setting that covers vanishing and non--vanishing characteristic functions of the measurement errors. We derive upper bounds on the risk of the proposed estimators and provide sufficient conditions under which zeros of the corresponding characteristic function have no effect on estimation accuracy. Moreover, we show that the derived conditions are also necessary in some specific problem instances.
15.10.2019	Sergey Samsonov (HSE University)	Variance reduction for dependent sequences with applications to Stochastic Gradient MCMC	This talk will be devoted to a novel variance reduction approach for additive functionals of general dependent sequences. Our approach combines the use of control functions with the minimisation of an empirical asymptotic variance estimate. We analyse finite sample properties of the proposed method and derive convergence rates of the excess asymptotic variance to zero. We apply our methodology to Stochastic Gradient MCMC (SGMCMC) and discuss possible extensions for non-Markovian dependent sequences (SVRG-LD, SAGA-LD). Numerical evaluation of the algorithm will also be presented. The talk is based on a joint work with D. Belomestny, L. Iosipoy, E. Moulines and A. Naumov.
22.10.2019	Quentin Paris (HSE University)	Some open questions at the "intersection" between optimal transport and statistical learning.	This talk will breafly describe the field of optimal transport and show that statistical learning can be represented as an optimal transport problem. We will investigate a few questions related to this representation and describe some potential research perspectives.
26 - 29.11.2019	Autumn School	Autumn School on Generative models
3.12.2019	Nikita Puchkin (HSE University)	Sample complexity of learning a manifold with an unknown dimension	I will discuss the problem of learning a manifold from noisy observations in the case when the dimension of the true manifold is unknown. Many manifold learning procedures require the intrinsic dimension of the data as an input parameter. However, the existing literature provides theoretical guarantees on the dimension estimation only in the case, when the data lies exactly on the manifold. We study an algorithm, which adaptively chooses the manifold’s dimension and has an optimal sample complexity in terms of the Hausdorff loss. Besides, the procedure produces a manifold, such that its curvature does not exceed the curvature of the true manifold too much.

Spring 2019
Date	Speaker	Title	Abstract
29.01.2019 at CS HSE	Ekaterina Krymova (Duisburg Essen University)	On estimation of the noise variance in high dimensional linear regression model	We consider a problem of unknown noise variance estimation in high-dimensional linear regression model. To estimate a nuisance vector of regression coefficients we use a family of spectral regularisers of the maximum likelihood estimator. Noise variance estimation is based on an adaptive normalisation of the prediction error. We derive an upper bound for concentration of the proposed method around an ideal estimator (in case of zero nuisance).
05.02.2019	Vladimir Spokoiny (WIAS, HSE University)	Bayesian model selection	We consider different approaches to select a prior from an ordered family of Gaussian process priors (GPP) for a linear regression model . The empirical and full (hierarchic) Bayesian methods are studied with their benefits and drawbacks. A new method for the choice of a hyper prior is suggested. The method is based on the so called “propagation” condition and ensures a one-sided contraction and adaptivity of the procedure.
12.02.2019	Alexander Goldenshluger (University of Haifa, HSE University)	Optimal stopping of a random sequence with unknown distribution	The subject of this talk is the problem of optimal stopping of a sequence of i.i.d. random variables. When the distribution governing the random sequence is known, the optimal solution is given by backward induction. In this talk we will focus on the case when the distribution of observations is unknown. We propose a stopping rule that is based on relative ranks and study its performance as measured by the maximal relative regret over suitable nonparametric classes of distributions. It is shown that the proposed rule is first order asymptotically optimal and nearly rate–optimal in terms of the rate at which the relative regret converges to zero. We also develop a general method for numerical solution of sequential stopping problems with no distributional information, and use it in order to implement the proposed stopping rule. (Joint work with A. Zeevi)
19.02.2019	Alexey Ustimenko (HSE University)	Direct Optimization of Ranking Quality Metrics via Stochastic Approximation	In this paper, we present a novel approach for consistent direct optimization of any ranking quality metric that significantly outperforms the state-of-the-art methods on popular learning-to-rank datasets. The framework is based on a smoothing procedure and a method of local stochastic approximation of the ranking function inspired by gradient-free methods for blackbox optimization. We also present novel unbiased and effective estimates for gradients of the smoothed ranking quality metrics that theoretically outperform likelihood ratio method while their time complexity remains the same due to combinatorial structure of the problem.
26.02.2019	Denis Belomestny (Duisburg Essen University, HSE University)	MacKeen-Vlasov stochastic differential equations: simulation and estimation	In this talk we propose a novel projection-based particle method for solving McKean--Vlasov stochastic differential equations. This approach is based on a projection-type estimation of the marginal density of the solution in each time step. The projection-based particle method leads in many situations to a significant reduction of numerical complexity compared to the widely used kernel density estimation algorithms. We derive strong convergence rates and rates of density estimation. Furthermore we give explicit solutions for a class of McKean--Vlasov equations with affine drift. The performance of the proposed algorithm is illustrated by several numerical examples.
5.03.2019	Ivan Panin (Skoltech, IITP RAS)	Least squares in noiseless setting and Sobol indices convergence for random design	We briefly review recent progress in noiseless nonparametric regression and consider its application to global sensitivity analysis and particularly for estimation of Sobol indicies. Sobol (sensitivity) indices allow to quantify respective effects of input random variables (or combinations thereof) onto variance of a physical or mathematical model response. We focus on the problem of Sobol indices estimation via metamodel approach in the case of random design of experiments; on both nonasymptotic and asymptotic behavior of Sobol indices estimates and the model selection for increasing computational budget. We also consider the question of an optimal rate of Sobol indices convergence.
12.03.2019	Dmitry Yarotsky (Skoltech, IITP RAS)	Gradient descent in neural networks: mean field theory and optimal transportation	I will describe recent results in which gradient descent in shallow neural networks is approximated by an integro-differential equation appearing in the limit of infinite network width. In particular, I will discuss the connection to the optimal transport theory and will mention several modes in which the dynamics admits an approximate analytic solution.
19.03.2019	Egor Kosov (MSU, HSE University)	Lower bounds for measures of deviation of polynomials from their expectations	In the talk we will discuss the recent results on lower estimates of small ball probabilities for random variables that are polynomials in Gaussian and general logarithmically concave random variables.
26.03.2019	Nurlan Shagadatov (HSE University)	Martingale representations for Markov Chains with application to MCMC.	In this talk I will describe recent results of joint work with D. Belomestny and E. Moulines. I consider efficient variance reduction approach for Markov Сhain Monte Carlo algorithms relying on a new discrete time martingale representations. In particular, I will consider theoretical guarantees and generic method for Unadjusted Langevin Algorithm.
09.04.2019	Evgenii Chzhen ( Laboratoire d'Analyse et de Mathématiques Appliquées)	Semi-supervised confidence set classification	An image annotation can be seen as a multi-class classification problem with a large number of classes. In this context, confusion between classes can occur, and single label classification may be misleading. We will talk about the multi-class classification with confidence sets of a controlled expected cardinal in non-parametric settings. We will propose a semi-supervised estimator which is either minimax optimal or optimal up to a log factor depending on the way we measure the risk. We will describe the regimes when there is no supervised estimator that can improve the proposed semi-supervised estimator and when the semi-supervised estimation does not help from the minimax point of view. (joint work with C. Denis and M. Hebiri)
16.04.2019, CS HSE, 205 at 17:00	Eric Moulines (Ecole Polytechnique, HSE University)	Mini course Introduction to reinforcement learning 1	The minicourse is an elementary introduction of reinforcement learning. We will formalize the problem of reinforcement learning using ideas from dynamical systems theory, specifically, as the optimal control of incompletely-known Markov decision processes. A learning agent must be able to sense the state of its environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. Markov decision processes are intended to include just these three aspects—sensation, action, and goal—in their simplest possible forms without trivializing any of them. Any method that is well suited to solving such problems we consider to be a reinforcement learning method. The course is an elementary introduction covering: — An introduction to reinforcement learning — Markov decision processes — Dynamic programming — Monte Carlo methods — Temporal difference learning We will follow closely the book “Reinforcement Learning: an introduction”, Barto Sutton, MIT Press and the use the pedagogical material proposed by David Silver.
23.04.2019, CS HSE, 503 at 17:00	Eric Moulines (Ecole Polytechnique)	Mini course Introduction to reinforcement learning 1	The minicourse is an elementary introduction of reinforcement learning. We will formalize the problem of reinforcement learning using ideas from dynamical systems theory, specifically, as the optimal control of incompletely-known Markov decision processes. A learning agent must be able to sense the state of its environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment. Markov decision processes are intended to include just these three aspects—sensation, action, and goal—in their simplest possible forms without trivializing any of them. Any method that is well suited to solving such problems we consider to be a reinforcement learning method. The course is an elementary introduction covering: — An introduction to reinforcement learning — Markov decision processes — Dynamic programming — Monte Carlo methods — Temporal difference learning We will follow closely the book “Reinforcement Learning: an introduction”, Barto Sutton, MIT Press and the use the pedagogical material proposed by David Silver.
3.06.2019-7.06.2019	Nikolai Krylov (University of Minnesota)	Mini course (3 lectures)	Topics: Basics of harmonic polynomials and spherical functions Variations on a theme of Jacobians: Brouwer’s fixed point theorem, Alexandrov’s maximum principle, solvability of Laplace’s equation, etc. Schedule: 7.06, 16:00 - 18:00 14.06, 16:00 - 18:00 28.06, 16:00 - 18:00 Venue: Faculty of Mathematics HSE, Usacheva str, 5, room 108

Fall 2018
Date	Speaker	Title	Abstract
26.12.2018	Qeuntin Paris (HSE University)	Learning functional minimizers on metric spaces.	In this talk, we will discuss recent developments motivated by statistical problems that arise in optimal transport. In particular we will discuss new results concerning the estimation of so called barycenters (or Frechet means) in the Wasserstein space. Before presenting our results, the talk will review some basic material from geometry and optimal transport. The end of the talk will discuss some open problems.
26.12.2018	Sergey Samsonov (HSE University, Skoltech)	Tansportation and concentration of measure inequalities for Markov Chains	In this talk I am going to observe some concentration of measure inequalities, starting from rather classical ones by Bakri, Ledoux. Then I will cover some interesting concentration results for Markov Chains, related with transportation inequalities following papers by Djellot, Guillin, Wu 2004 and Joulin, Olivier 2010, and concentration results for quadratic forms of vectors with dependent entries following Adamczak 2015 with particular applications to Unadjusted Langevin Dynamics.
13.12.2018	Sergey Samsonov (HSE University, Skoltech)	Concentration inequalities and their applications	In this talk I am going to cover some functional inequalities for dependent sequences of random variables, especially focusing on the results for Markov chains based on the work by Djellout, Guilin and Wu. As one of the applications we will consider Unadjusted Langevin Algorithm, following paper by Durmus, Moulines, and state some concentration results for it.
16.12.2018	Maxim Kaledin (HSE University, Skoltech)	Dynamic Programming in Stochastic Optimal Control Problems.	Stochastic control problems are well-known in financial mathematics (option pricing, portfolio management) and in many related areas. These problems even in case of discrete state space and known transition density are subject to curse of dimensionality when using deterministic algorithms. Kean&Wolpin(1994) and Rust (1997) were the first who proved that this can be avoided in case of discrete Markov Decision Processes with stochastic algorithms. The modern challenge is to figure out what happens with complexity when the transition density can only be approximated (RL,ML, Finance applications). During my talk I will cover known and new complexity estimates for stochastic control problems as well as several algorithms to solve MDP.
29.11.2018	Vladimir Ulyanov (HSE University, MSU)	Non-linear expectations: notion, facts, applications.	We give a short review on recent developments on problems of probability model under uncertainty by using the notion of nonlinear expectations and, in particular, sublinear expectations. We discuss as well a new law of large numbers (LLN) and central limit theorem (CLT) and provide non-asymptotic results. Classical LLN and CLT have been widely used in probability theory, statistics, data analysis as well as in many practical situations such as financial pricing and risk management. They provide a strong and convincing way to explain why in practice normal distributions are so widely utilized. But often a serious problem is that, in general, the “i.i.d.” condition is difficult to be satisfied. In practice, for the most real-time processes and data for which the classical trials and samplings become impossible, the uncertainty of probabilities and distributions can not be neglected. In fact the abuse of normal distributions in finance and many other industrial or commercial domains has been criticized. The new CLT does not need this strong “i.i.d.” assumption. Instead of fixing a probability measure P, it is introduced an uncertain subset of probability measures {P_a : a in A } and considered the corresponding sublinear expectation as sup{a in A} E_{a}X.
22.11.2018	Nikita Zhivotovskiy (HSE University)	Covariance estimation: missing observations and heavy tails.	In this talk, I will discuss some recent results on covariance estimation together with some work in progress. I will start with the subgaussian case when some of the observations are missing: we observe N i.i.d. vectors and want to estimate the covariance matrix. The problem is that some of the components of each vector are missing, that is these components are set to be equal to zero with some probability. We show how to improve the state of the art results in this scenario. Next, we discuss how to apply dimension-free Bernstein type inequalities for vectors with heavy tails under mild moment assumptions.
14.11.2018	Eric Moulines (Ecole Polytechnique, HSE University)	Low-rank Interaction with Sparse Additive Effects Model for Large Data Frames (NIPS paper)	Many applications of machine learning involve the analysis of large data frames — matrices collecting heterogeneous measurements (binary, numerical, counts, etc.) across samples — with missing values. Low-rank models, as studied by Udell et al. (2016), are popular in this framework for tasks such as visualization, clustering and missing value imputation. Yet, available methods with statistical guarantees and efficient optimization do not allow explicit modeling of main additive effects such as row and column, or covariate effects. In this paper, we introduce a low-rank interaction and sparse additive effects (LORIS) model which combines matrix regression on a dictionary and low-rank design, to estimate main effects and interactions simultaneously. We provide statistical guarantees in the form of upper bounds on the estimation error of both components. Then, we introduce a mixed coordinate gradient descent (MCGD) method which provably converges sub-linearly to an optimal solution and is computationally efficient for large scale data sets. We show on simulated and survey data that the method has a clear advantage over current practices.
14.11.2018	Błażej Miasojedow (University of Warsaw)	Statistical inference for Continous Time Bayesian Networks.	Continuous Time Bayesian Networks (CTBN) are a class of multivariate Markov Jump Processes (MJP), where dependence between coordinates is described by a directed graph with possible cycles. CTBNs are a flexible tool for modelling phenomena from different areas such as chemistry, biology, social sciences etc. In the talk, I will discuss the problems of parameter learning and structure learning for CTBNs with a special attention to computational methods.
8.11.2018	Alexey Naumov (HSE University)	Gaussian approximations for maxima of large number of quadratic forms of high-dimensional random vectors	Let X_1, …, X_n be i.i.d. random vectors taking values in R^d, d \geq 1, and Q_1, …, Q_p, p \geq 1, be symmetric positive definite matrices. We consider the distribution function of vector (Q_j S_n, S_n), j = 1, …, p, where S_n = n^{-1/2}(X_1 + … + X_n), and prove the rate of Gaussian approximation with explicit dependence on n, p and d. We also compare this result with results of Bentkus (2003) and Chernozhukov, Chetverikov, Kato (2016). Applications to change point detection and model selection will be discussed as well. The talk is based on the joint project with F. Goetze, V. Spokoiny and A. Tikhomirov.
1.11.2018	Andzhey Koziuk (WIAS, Berlin)	Exact smoothning of a probability measure	In the talk an exact smoothing of indicator function is introduced and developed to allow differential calculus access the internal machinery behind a problem of measures comparison. Specifically, the problem of accuracy of non-classical analogue of Berry-Esseen inequality for a radnom elemets in Hilbert space is addressed with the help of the tool and aided with the classical Lindenberg chained sum construction.
25.10.2018	Nikita Zhivotoskiy (Technion, HSE University)	Robust estimation of the mean of a random vector	n this talk, we discuss the estimation of the mean random of a vector from an i.i.d. sample of its observations. The latest results show that using a special procedure, the mean of a random vector, for which only the existence of a covariance matrix is known, can be estimated as precisely as if the vector had a multidimensional normal distribution with the same covariance structure. The estimation of the covariance matrix with minimal constraints on the underlying distribution will be briefly discussed. The talk is based on the works of arxiv: 1702.00482 and arxiv: 1809.10462
18.10.2018	Alexey Ustimenko (HSE University, Skoltech)	Convergence of Laplacian Spectra from Point Clouds	The Laplacian-Beltrami operator (LBO) on manifolds and its spectral structure has been widely used in many applications such as manifold learning, spectral clustering, dimensionality reduction and so on. Recently, Point Integral Method (PIM) was proposed to estimate the LBO by the discrete Laplace operator constructed from a set of sample points. We would discuss a paper “Convergence of Laplacian Spectra from Point Clouds” by Zuoqiang Shi, Jian Sun and their’s proof of the convergence of the eigenvalues and eigenvectors obtained by PIM to the eigenvalues and eigenfunctions of the LBO with the Neumann boundary.
11.10.2018	Nikita Puchkin (HSE University)	Manifold Learning	I will discuss a problem of a manifold reconstruction from a noisy point cloud. There are plenty methods of manifold learning (ISOMAP, LLE, etc.) but only a few papers provide a theoretical justification of the proposed algorithms (Genovese et. al. (2012), Maggioni et. al. (2016), Aamari and Levrard (2017)). An approach based on a local polynomial estimation (Aamari, Levrard (2017)) reconstructs distinct parts of the manifold and the reconstructed manifold is often disconnected. Another approach (Genovese et. al. (2012)) tries to estimate a density with a support on the manifold. This approach requires a specific distribution of the noise, which is unlikely to hold in practice. A recently proposed method (Osher et. al. (2017)), based on an approximation of the Laplace-Beltrami operator, is aimed at a global reconstruction of the manifold and does not have the mentioned drawbacks. Unfortunately, the statistical properties of this method are not studied so far. In my talk, I will discuss some details concerning a theoretical justification of the procedure.
4.10.2018	Valentina Shumovskaya (HSE University, Skoltech)	Towards Hypothesis Testing for Random Graphs with Community Structure	The analysis of random graphs and network analysis recently become an active area of research. We will dicuss the problem of testing between two populations of inhomogeneous random graphs defined on the same set of vertices, where the null hypothesis means that the underlying edge probability matrices coincide. We propose a new approach of random graphs testing based on community detection procedure: we introduce a structural assumption, which is that our graphs have community structures with k communities.
27.09.2018	Alexander Goldenshluger (University of Haifa, HSE University)	Nonparamertic estimation in non-regular deconvolution problems.	I will discuss problems of nonparametric estimation in non-regular deconvolution models. In such settings standard estimation methods based on the Fourier transform are not applicable. I will review the existing literature on the topic, discuss general ideas for constructing optimal estimators, and present some preliminary results. Based on joint ongoing work with D. Belomestny.
20.09.2018	Alexander Timofeev (MIPT)	Density-sensitive semisupervised inference	In the talk, we will review the article written by Martin Azizyan, Aarti Singh and Larry Wasserman. We will consider semisupervised inference for the regression problem, based on a density-sensitive metric and show that the proposed semisupervised method outperforms any supervised one for certain sets. At the end of the talk, We will demonstrate how to calculate the density-sensitive metric.
13.09.2018	Katya Krymova (University of Duisburg-Essen)	On a sparse modification of projection approximation subspace tracking method	In this talk we revisit the well-known constrained projection approximation subspace tracking algorithm (CPAST) and derive non-asymptotic bounds on its error. Furthermore we introduce a novel sparse modification of CPAST which is able to exploit sparsity of the underlying covariance structure. We present a non-asymptotic analysis of the proposed algorithm and study its empirical performance on simulated and real data.
10.09.2018	Steve Oudot (Ecole Polytechnique)	Statistics and Learning with topological descriptors	I this survey talk I will review some of the recent advances in Topological Data Analysis to define stable descriptors for data, to use these descriptors for learning tasks, and to do statistics on them. Among other topics, I will address finding suitable metrics, defining and computing means and barycenters, deriving confidence intervals, laws of large numbers, central limit theorems, and kernel mean embeddings.

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.

International Laboratory of Stochastic Algorithms and High-Dimensional Inference

Scientific seminar