Publications
Interpretability and fairness are critical in computer vision and machine learning applications, in particular when dealing with human outcomes, e.g. inviting or not inviting for a job interview based on application materials that may include photographs. One promising direction to achieve fairness is by learning data representations that remove the semantics of protected characteristics, and are therefore able to mitigate unfair outcomes. All available models however learn latent embeddings which comes at the cost of being uninterpretable. We propose to cast this problem as data-to-data translation, i.e. learning a mapping from an input domain to a fair target domain, where a fairness definition is being enforced. Here the data domain can be images, or any tabular data representation. This task would be straightforward if we had fair target data available, but this is not the case. To overcome this, we learn a highly unconstrained mapping by exploiting statistics of residuals – the difference between input data and its translated version – and the protected characteristics. When applied to the CelebA dataset of face images with gender attribute as the protected characteristic, our model enforces equality of opportunity by adjusting the eyes and lips regions. Intriguingly, on the same dataset we arrive at similar conclusions when using semantic attribute representations of images for translation. On face images of the recent DiF dataset, with the same gender attribute, our method adjusts nose regions. In the Adult income dataset, also with protected gender attribute, our model achieves equality of opportunity by, among others, obfuscating the wife and husband relationship. Analyzing those systematic changes will allow us to scrutinize the interplay of fairness criterion, chosen protected characteristics, and prediction performance.
We extend the existing framework of semi-implicit variational inference (SIVI) and introduce doubly semi-implicit variational inference (DSIVI), a way to perform variational inference and learning when both the approximate posterior and the prior distribution are semi-implicit. In other words, DSIVI performs inference in models where the prior and the posterior can be expressed as an intractable infinite mixture of some analytic density with a highly flexible implicit mixing distribution. We provide a sandwich bound on the evidence lower bound (ELBO) objective that can be made arbitrarily tight. Unlike discriminator-based and kernel-based approaches to implicit variational inference, DSIVI optimizes a proper lower bound on ELBO that is asymptotically exact. We evaluate DSIVI on a set of problems that benefit from implicit priors. In particular, we show that DSIVI gives rise to a simple modification of VampPrior, the current state-of-the-art prior for variational autoencoders, which improves its performance.
Reduction of the number of parameters is one of the most important goals in Deep Learning. In this article we propose an adaptation of Doubly Stochastic Variational Inference for Automatic Relevance Determination (DSVI-ARD) for neural networks compression. We find this method to be especially useful in language modeling tasks, where large number of parameters in the input and output layers is often excessive. We also show that DSVI-ARD can be applied together with encoder-decoder weight tying allowing to achieve even better sparsity and performance. Our experiments demonstrate that more than 90% of the weights in both encoder and decoder layers can be removed with a minimal quality loss.
We describe use of the Monte Carlo modeling method to specify the parameters of near infrared light propagation though the tissues of the head, which is needed for optimizing the operation of brain–computer interfaces. The studies used a four-layer spherical model of the head consisting of skin, bone, gray matter, and white matter. The relationship between the parameters of the radiation recorded and the distance between the source and detector were obtained.
8 of top 10 supercomputers of Top500 list published in November 2018 consist of computing nodes with hybrid architectures that require special programming techniques. 5 systems among these are based on Nvidia GPUs. In this paper, we consider the benchmark results of the brand new hybrid supercomputer installed in March 2019 in NRU HSE. This system gives us the possibility to estimate the performance of several widely used material science and machine learning codes that we discuss in this work within the framework of the results available for older HPC systems.
This paper proposes a semi-conditional normalizing flow model for semi-supervised learning. The model uses both labeled and unlabeled data to learn an explicit model of joint distribution over objects and labels. Semi-conditional architecture of the model allows us to efficiently compute a value and gradients of the marginal likelihood for unlabeled objects. The conditional part of the model is based on a proposed conditional coupling layer. We demonstrate performance of the model for semi-supervised classification problem on different datasets. The model outperforms the baseline approach based on variational auto-encoders on MNIST dataset.
Bayesian inference is known to provide a general framework for incorporating prior knowledge or specific properties into machine learning models via carefully choosing a prior distribution. In this work, we propose a new type of prior distributions for convolutional neural networks, deep weight prior (DWP), that exploit generative models to encourage a specific structure of trained convolutional filters e.g., spatial correlations of weights. We define DWP in the form of an implicit distribution and propose a method for variational inference with such type of implicit priors. In experiments, we show that DWP improves the performance of Bayesian neural networks when training data are limited, and initialization of weights with samples from DWP accelerates training of conventional convolutional neural networks.
Ordinary stochastic neural networks mostly rely on the expected values of their weights to make predictions, whereas the induced noise is mostly used to capture the uncertainty, prevent overfitting and slightly boost the performance through test-time averaging. In this paper, we introduce variance layers, a different kind of stochastic layers. Each weight of a variance layer follows a zero-mean distribution and is only parameterized by its variance. We show that such layers can learn surprisingly well, can serve as an efficient exploration tool in reinforcement learning tasks and provide a decent defense against adversarial attacks. We also show that a number of conventional Bayesian neural networks naturally converge to such zero-mean posteriors. We observe that in these cases such zero-mean parameterization leads to a much better training objective than conventional parameterizations where the mean is being learned.
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and ShakeShake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
Bayesian methods have been successfully applied to sparsify weights of neural networks and to remove structure units from the networks, e. g. neurons. We apply and further develop this approach for gated recurrent architectures. Specifically, in addition to sparsification of individual weights and neurons, we propose to sparsify preactivations of gates and information flow in LSTM. It makes some gates and information flow components constant, speeds up forward pass and improves compression. Moreover, the resulting structure of gate sparsity is interpretable and depends on the task.
We explore recently introduced definition modeling technique that provided the tool for evaluation of different distributed vector representations of words through modeling dictionary definitions of words. In this work, we study the problem of word ambiguities in definition modeling and propose a possible solution by employing latent variable modeling and soft attention mechanisms. Our quantitative and qualitative evaluation and analysis of the model shows that taking into account words ambiguity and polysemy leads to performance improvement.
Modern computational approaches and machine learning techniques accelerate the invention of new drugs.Generative models can discover novel molecular structures within hours, while conventional drug discovery pipelines require months of work. In this article, we propose a new generative architecture, entangled conditional adversarial autoencoder, that generates molecular structures based on various properties, such as activity against a specific protein, solubility, or ease of synthesis. We apply the proposed model to generate a novel inhibitor of Janus kinase 3, implicated in rheumatoid arthritis, psoriasis, and vitiligo. The discovered molecule was tested in vitro and showed good activity and selectivity.
The loss functions of deep neural networks are complex and their geometric properties are not well understood. We show that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant. We introduce a training procedure to discover these high-accuracy pathways between modes. Inspired by this new geometric insight, we also propose a new ensembling method entitled Fast Geometric Ensembling (FGE). Using FGE we can train high-performing ensembles in the time required to train a single model. We achieve improved performance compared to the recent state-of-the-art Snapshot Ensembles, on CIFAR-10, CIFAR-100, and ImageNet.
We consider the structured-output prediction problem through probabilistic approaches and generalize the ``''perturb-and-MAP'' framework to more challenging weighted Hamming losses, which are crucial in applications. While in principle our approach is a straightforward marginalization, it requires solving many related MAP inference problems. We show that for log-supermodular pairwise models these operations can be performed efficiently using the machinery of dynamic graph cuts. We also propose to use \emph{double} stochastic gradient descent, both on the data and on the perturbations, for efficient learning. Our framework can naturally take weak supervision (e.g., partial labels) into account. We conduct a set of experiments on large-scale character recognition and image segmentation, showing the benefits of our algorithms.
In dynamic malware analysis, programs are classified as malware or benign based on their execution logs. We propose a concept of applying monotonic classification models to the analysis process, to make the trained model's predictions consistent over execution time and provably stable to the injection of any noise or `benign-looking' activity into the program's behavior. The predictions of such models change monotonically through the log in the sense that the addition of new lines into the log may only increase the probability of the file being found malicious, which make them suitable for real-time classification on a user's machine. We evaluate monotonic neural network models based on the work by Chistyakovet al. (2017) and demonstrate that they provide stable and interpretable results.
The objective of this work is to develop a predictive model for multiphase wellbore flows using the machine learning approach. The artificial neural network is developed and then trained on the dataset generated using the numerical simulator of the full-scale transient wellbore flows. After the training is completed, the neural network is used to predict one of the key parameters of the wellbore flow, namely, the bottomhole pressure. The novelty of this work is related to the application of the neural network to analyze highly transient processes taking place in wellbores. In such processes, most of the parameters of interest can be represented by interdependent time series of variables linked through complex physical phenomena pertinent to the nature of multiphase flows. The proposed neural network with two hidden layers demonstrated the capability to predict the bottomhole pressure within 5% of the normalized root mean squared error for many complex wellbore configurations and flows. It is also shown that relatively higher prediction errors are mainly observed in the case of slug flows where the transient nature of flows is pronounced the most. Finally, the developed model is tested on data affected by noise. It is demonstrated that although the error of prediction slightly increases in contrast to the data without noise, the model captures essential features of the studied transient process. Description of the developed models, analysis of various test use cases, and possible future research directions are outlined.
We present a probabilistic model with discrete latent variables that control the computation time in deep learning models such as ResNets and LSTMs. A prior on the latent variables expresses the preference for faster computation. The amount of computation for an input is determined via amortized maximum a posteriori (MAP) inference. MAP inference is performed using a novel stochastic variational optimization method. The recently proposed adaptive computation time mechanism can be seen as an ad-hoc relaxation of this model. We demonstrate training using the general-purpose concrete relaxation of discrete variables. Evaluation on ResNet shows that our method matches the speed-accuracy trade-off of adaptive computation time, while allowing for evaluation with a simple deterministic procedure that has a lower memory footprint.