The Lab helds invited talks on NLP, Recommender Systems, Data Mining, and related topics twice a month. For more details see the Russian page.
Seminar Automatic Processing and Analysis of Texts & rdquo; dedicated to various processing tasks (tokenization, recovery segmentation, part-of-speech markup and syntactic parsing) and textual information parsing (extraction tasks information, construction and use of knowledge graphs, construction of question-answer systems, text classification, etc.).
Online Seminar “Matrix and tensor decompositions in natural language processing problems”
Date: July 15, 2021
Speaker: Grinchuk Alexey
Graduated from MIPT with a bachelor's degree in 2015. In 2017 he graduated from the master's program at MIPT and Skoltech. Since 2017, he has been a graduate student at MIPT and has been applying matrix and tensor decompositions to various natural language processing (NLP) problems under the guidance of I.V. Oseledtsa. Since 2020, he has been working as a leading engineer at NVIDIA, focusing on speech recognition and machine translation.
Abstract: This paper proposes methods for solving various problems in the field of natural language processing using matrix and tensor decompositions. A method for constructing vector representations of words based on Riemannian optimization in the space of low-rank matrices is proposed. A mathematical model of vector representations of words based on tensor train decomposition is proposed, which requires fewer parameters than the classical representation in the form of a dense matrix. A generalization of tensor neural networks is proposed, which allows one to analyze recurrent and fully connected networks with various nonlinearities between layers. A theoretical analysis of the generalizing ability and expressive power of generalized recurrent tensor networks with nonlinearity of the ReLU type is carried out.
Online Seminar “RuSentEval: diagnostic testing of language models in Russian”
Date: May 27, 2021
Speakers: Vladislav Mikhailov (Sberbank), Ekaterina Taktasheva (HSE), Elina Sigdal (HSE).
Annotation: RuSentEval is a new dataset for diagnostic testing (probing) of vector and language models for the Russian language. The set includes 14 datasets that cover various linguistic phenomena - from superficial (number of words in a sentence) to syntactic (depth of the syntactic tree) and semantic (number and gender of the subject). The classic method of diagnostic testing is to train a classifier that predicts the presence of a particular phenomenon based on a sentence vector. The behavior of the classifier can show, for example, which layers of the language model are more sensitive to low-level features and which are more sensitive to high-level features.
In our work, we used data from RuSentEval and SentEval (English) to conduct diagnostic testing of five multilingual transformers - including mBERT, mBART and LABSE - and found that the models have a similar understanding of some features for both languages, despite their typological differences . But mBART and LABSE differ from the others (read how exactly in the article).
Online Seminar "Use of definitions in multilingual classification of semantic proximity of word occurrences and detection of semantic shifts of words for the Russian language"
Date: April 7, 2021.
Speaker: Maxim Rachinsky, research assistant at the National Laboratory of Models and Methods of Computational Pragmatics.
Abstract: Referring to definitions from a dictionary is a familiar way for a person to find out what meanings a particular word has. We assume that a system that can select from a dictionary or glossary the correct definition for a specific word occurrence can also naturally solve the problems of classifying word occurrences by semantic proximity and detecting semantic shifts. This definition-based system took first place in the RuShiftEval competition.
Online Seminar "Four competitions Dialogue Evaluation 2021"
Date: February 25, 2021
The seminar will present the Dialogue Evaluation 2021 competition. We will talk about the formulation of the problems that the competition is dedicated to and present basic approaches to solving them. Based on the results of participation in each competition, it will be possible to submit an article to the Dialogue conference.
As part of the RuNormAS (Russian Normalization of Annotated Spans) competition, a normalization problem is proposed for solution - bringing a part of the text (a named entity, a phrase) into a normal (initial) form. The main part of the task is to correctly normalize the necessary words from the group without changing the rest (dependents, etc.), as well as to use the context correctly. The latter is especially important, since the initial form for many words can only be determined in context - for example, the word “Ivanova”, depending on the surrounding context, can have both the normal form “Ivanova” and “Ivanov”.
Ivan Smurov, ABBYY, MIPT
Clustering, selection and generation of news headlines.
The goal of the competition is to collect and compare approaches to clustering and selecting the best header for the resulting clusters. News clustering looks quite challenging for modern models, and because of this it is a good benchmark. In addition, text clustering as a task is quite common in the industry. Selecting or generating the best headline is its logical continuation.
Ilya Gusev, MIPT
Our competition is an opportunity to work with an object that gives a visual representation of the semantics of a word and its compatibility - with a semantic sketch. The goal of the competition is to evaluate the illustrativeness of the sketches by trying, based on the context of the word, to predict the corresponding sketch from a given set of words.
Maria Ponomareva, ABBYY, HSE
The task of test simplification (text simplification) involves several formulations, from which we choose the most popular: simplification at the sentence level. In this formulation, the task is to obtain a simplified one from a complex sentence.
Ekaterina Artemova, HSE
Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.