Seminars 2020
The Lab helds invited talks on NLP, Recommender Systems, Data Mining, and related topics twice a month. For more details see the Russian page.
Seminar Automatic Processing and Analysis of Texts & rdquo; dedicated to various processing tasks (tokenization, recovery segmentation, part-of-speech markup and syntactic parsing) and textual information parsing (extraction tasks information, construction and use of knowledge graphs, construction of question-answer systems, text classification, etc.).
Online Seminar "Entropy Approach in Topical Modeling"
Date: 5 November 2020.
Speakers: Sergey Koltsov , Leading Researcher, Laboratory of Social and Cognitive Informatics, Associate Professor, Department of Mathematics.
Vera Ignatenko, Researcher, Laboratory of Social and Cognitive Informatics, Associate Professor, Department of Mathematics. .
Abstract: The report will consider the possibilities of using deformed entropies (Renyi, Tsallis, Sharma-Mittal entropies) to analyze the behavior of a number of thematic models (TM ). The report describes an approach to the analysis of the dependence of TM on the number of topics based on ideas from statistical physics. Within the framework of this approach, a collection of documents and words is considered in the form of a mesoscopic information system, the state of which is described by deformed entropies, and the behavior of the information system is determined by the number of clusters / topics. Thematic modeling is considered as a procedure for ordering information systems. Proceeding from this, the problem of choosing the optimal number of topics can be reduced to the problem of finding the minimum free energy or the minimum of the nonequilibrium Renyi / Tsallis entropy, and the search for semantic stability can be determined using the Sharma-Mittal entropy. Within the framework of this report, it will be shown how you can organize the setting of hyper-parameters of thematic models in terms of entropy, as with the help of enumeration of hyper & ndash; parameters on the grid, and using renormalization procedures. The procedure for renormalizing topic models can significantly speed up the application of the entropy approach from a computational point of view, which is extremely important when working with big data. This paper will also consider the possibility of applying the entropy approach to hierarchical thematic models, and discuss the limitations of this approach. In addition, the report will present the results of calculations of such thematic models as PLSA, VLDA (Blay), LDA (Gibbs sampling), GLDA (Gibbs sampling), BigARTM; results of application of renormalization procedures, as well as results of calculations of several hierarchical thematic models (HPAM, HLDA, hARTM).
Online Seminar "Combining Neural Language Models for Word Sense Induction"
Date: December 8, 2020
Speaker: Arefiev Nikolay, junior researcher in the Laborotory for models and methods of computational pragmatics, Ph.D.
Abstract: The word sense induction (WSI) task requires grouping text fragments containing a polysemantic word into clusters corresponding to the meanings of the word. The report is devoted to the author's research in the field of application of neural language models for generating lexical expressions and their use for Russian and English WSI. Approaches to combining probability distributions estimated by language models are considered to improve the quality of substitutions and WSI results.
Online Seminar "Machine Reading Comprehension and Russian Language"
Date: September 17, 2020.
Speaker: Pavel Efimov.
Earned his Master degree in Computer Science at Saint Petersburg State University. Now he is a PhD student at ITMO University.
Abstract: First, I will briefly survey machine reading comprehension (RC) and its flavors, as well as methods and datasets used to leverage the task. Then I will focus on RC datasets for non-English languages. & Nbsp; I will pay special attention to Russian RC dataset & mdash; Sberbank Question Answering Dataset (SberQuAD). SberQuAD has been widely used since its inception in 2017, but it hasn't been described and analyzed properly in the literature until recently. In my presentation, I will provide a thorough analysis of SberQuAD and report several baselines.
Online Seminar "RussianSuperGLUE"
Date: September 3, 2020.
Speaker: Alena Fenogenova, Chief specialist NLP R & amp; D, CDS office, Sberbank.
Abstract: This talk presented a large benchmark for evaluating language models & ndash; Russian Sup.
Online Seminar "Deep Active Learning: Reducing Annotation Effort for Automatic Sequence Tagging of Clinical and Biomedical Texts"
Date: May 13, 2020
Speaker: Alexey Zobnin, Associate Professor of the Faculty of Computer Science at the National Research University Higher School of Economics, leading developer of the geosearch service and directory of Yandex organizations.
Abstract: Active learning is a technique that helps to minimize the annotation budget required for the creation of a labeled dataset while maximizing the performance of a model trained on this dataset. It has been shown that active learning can be successfully applied to sequence tagging tasks of text processing in conjunction with deep learning models even when a limited amount of labeled data is available. Recent advances in transfer learning methods for natural language processing based on deep pre-trained models such as ELMo and BERT offer a much better ability to generalize on small annotated datasets compared to their shallow counterparts. The combination of deep pre-trained models and active learning leads to a powerful approach to dealing with annotation scarcity. In this report, we will present recent experimental results of deep active learning on clinical and biomedical data in English and Russian. We will consider SOTA sequence tagging models in combination with several active learning strategies. Among NER and other sequence labeling tasks, we will discuss the application of active learning in the task of finding heart risk factors in EHRs, which is a part of a biomedical research project on automated ischemic stroke prediction.
Online Seminar "Collaborative filtering and autoencoders"
Date: May 7, 2020
Speaker: Ilya Shenbin, employee of the Samsung AI laboratory at POMI RAS.
Abstract: Matrix factorization has become a standard collaborative filtering approach used in the creation of recommender systems. Despite a number of advantages, state-of-the-art results demonstrate alternative methods. This report will discuss two types of models: the so-called linear autoencoders (for example, SLIM), the essence of which is to learn a similarity matrix between objects, as well as their more flexible generalizations - deep autoencoders (mainly based on VAE).
Online Seminar "Linear algebra in vector representation of words problems"
Date: April 16, 2020
Speaker: Alexey Zobnin, Associate Professor of the Faculty of Computer Science at the National Research University Higher School of Economics, leading developer of the geosearch service and directory of Yandex organizations.
Abstract: In applied problems related to automatic text processing, words are replaced by real vectors of relatively small dimension, such that the semantic and syntactic proximity of words corresponds to the geometric proximity of the vectors. Typically, such vectors are obtained from layers of a neural network, or from low-rank matrix decompositions. We will consider two basic models for constructing such vectors - SVD decomposition of the PPMI matrix and word2vec SGNS. Having analyzed the first model, we will propose a modification of the second model by excluding context vectors from it. For this we will need theorems from classical linear algebra.
Online Seminar "From vector representations of words to hyperbolic space and back"
Date: April 2, 2020
Speaker: Zhenisbek Asylbekov
Abstract: The report consists of two parts. In the first part, I will give a brief overview of our previous work on the transition from vector representations of words to Lobachevsky geometry through a binarized PMI matrix and complex networks. In the second part we will talk about the reverse transition. We select random points in a hyperbolic disk and assert that these points are already representations of words. However, it remains to be seen which dot corresponds to which word in human language. This correspondence can be approximately established using the PMI matrix and graph matching methods.
Online Seminar "Segmentation of network representation of text into sentences and formation of discourse in text synthesis problems"
Date: March 19, 2020
Speakers: Alexander Shvets, Postdoctoral Researcher, the Natural Language Processing Group (TALN), Department of Information and Communication Technologies, Pompeu Fabra University, Barcelona, Dmitry Devyatkin, research fellow in FRC IU RAS
Abstract: The report discusses the main subtasks of generating texts based on non-linguistic data and methods for solving them. Particular attention is paid to approaches to solving two subtasks: decomposition of the original structured description into fragments corresponding to individual sentences (sentence packaging), as well as the formation of a discursive scheme of the text - determining the order in which information should appear in the text. Due to the limited availability of discourse tagging resources, training complex models for discourse analysis is a non-trivial task. The report presents preliminary results of experiments with pre-training of discourse analysis models on a large automatically labeled corpus of texts. In the field of text generation in natural languages (natural language generation), the main attention of researchers is focused on solving problems of generating text based on text (text-to-text). However, an urgent task is also the generation of coherent texts based on data of a non-linguistic nature, for example, based on a knowledge graph or a network of linguistic annotations. Among the applied applications for solving this problem, one can note the generation of virtual news feeds and reports based on statistical information, the construction of weather and financial reports, the generation of generalized information about the patient when automating treatment and preventive activities.
Seminar "Non-Autoregressive Island in Autoregressive World (Non-autoregressive language models)"
Date: March 12, 2020
Speaker: Mikhail Arkhipov, MIPT, Laboratory of Neural Systems and Deep Learning, DeepPavlov // Mikhail Arkhipov, MIPT, DeepPavlov.
Abstract: The majority vast of current state-of-the-art models rely on autoregressive inference for modeling sequences. While showing top quality metrics this approach has several intrinsic drawbacks such as sequential inference and exposure bias. Despite the struggles* of the research community current parallel approaches show lower quality being in particular cases an order of magnitude faster. In this talk, we will review approaches to parallel inference and discuss recent papers devoted to the subject.
Non-Autoregressive Neural Machine Translation
Noisy parallel approximate decoding for conditional recurrent language model
Fast Decoding in Sequence Models Using Discrete Latent Variables
On the Discrepancy between Density Estimation and Sequence Generation
Mask-Predict: Parallel Decoding of Conditional Masked Language Models
Seminar "Incorporating knowledge bases into language models"
Date: February 20, 2020
Speaker: Danil Karpushkin, Sberbank AI laboratory
Abstract: The report is devoted to methods by which a priori knowledge can be “introduced” into well-known transformers. For this purpose, some pre-built knowledge systems (aka Knowledge Bases or KB) are often used, whose entity structure we will try to implement in the models. The following articles were mentioned in the speech:
Enhanced Language Representation with Informative Entities
Enhanced Representation through Knowledge Integration
A Unified Model for Knowledge Embedding and Pre-trained Language Representation
Knowledge Enhanced Contextual Word Representations (aka KnowBERT)
Seminar "Emergence of language in games"
Date: February 13, 2020
Speaker: Ekaterina Artemova
Abstract: This talk will provide an overview of recent works in emergent communications. It is assumed that artificial agents are capable of developing language through playing various co-operative games. In this type of games agents need to collaborate to perform some task, such as to guess an object or a word or to find a path. If a game starts from a tabula rasa setup, agents need to communicate and thus develop their own language. We will discuss several recent papers which model different types of games and communications as well as investigate the inner representations of the agents.
Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.