Faculty's Researchers to Present at EMNLP 2021

The annual Conference on Empirical Methods in Natural Language Processing (EMNLP) takes place on November 7-11, 2021.

Among the articles presented at the conference will be five written by the researchers of the Faculty:

Artificial Text Detection via Examining the Topology of Attention Maps (L. Kushnareva, D. Cherniavskii, V. Mikhailov, Ekaterina Artemova, S. Barannikov, A. Bernstein, I. Piontkovskaya, D. Piontkovski, E. Burnaev)
NB-MLM: Efficient Domain Adaptation of Masked Language Models for Sentiment Analysis (Nikolay Arefyev, D. Kharchev, A. Shelmanov)
SPARQLing Database Queries from Intermediate Question Decompositions (Irina Saparina, Anton Osokin)
Multi-Sentence Resampling: A Simple Approach to Alleviate Dataset Length Bias and Beam-Search Degradation (I. Provilkov, Andrey Malinin)
Uncertainty Measures in Neural Belief Tracking and the Effects on Dialogue Policy Performance (C. van Niekerk, Andrey Malinin, Ch. Geishauser, M. Heck, H. Lin, N. Lubis, S. Feng, M. Gašić).

We asked the authors to tell us about their researches:

Ekaterina Artemova
Research Fellow

Modern text generation models show impressive results: they can compose a poem, change the style of texts and even write a meaningful essay on a given topic. However, such models can be used for malicious purposes, such as generating fake news, product reviews and political content. Thus, a new challenge arises: learning to distinguish human-written texts from texts generated by neural network language models. This is the task of our paper.

In this paper, we investigated the applicability of topological data analysis (TDA) techniques to the task of detecting generated sentences. We assumed that topological features derived from language models can encode the surface and structural properties of sentences needed for the task.

Generally speaking, TAD methods are very rarely used in text processing. Therefore, our first result is the identification of the different types of topological features: we have shown how to compute Betti numbers, barcodes and graph distances to patterns based on attention maps. Topological features form a vector representation, which can be considered an analogue of standard vector representations and can be used to train classifiers. As a result, classifiers using topological representations have distinct advantages: First, in some cases, they perform better than standard neural network classifiers. Second, they are more robust: a classifier trained to detect sentences generated by one model can also detect sentences generated by another model.

The final part of the paper is devoted to the interpretation of topological features. We show that, as expected, topological features successfully encode sentence length and syntactic tree depth. Overall, our paper is an interdisciplinary project done at the interface of mathematics and word processing. We hope that our results will attract the attention of mathematicians and linguists alike, and pose new research questions for both disciplines.

One of the authors of this article is Professor Dmitry Piontkovsky of the Department of Mathematics, Faculty of Economic Sciences, HSE University.

Nikolay Arefyev
Juniour Research Fellow

Head of the project group "Interlingual methods for polysemantic word meaning extraction" (Laboratory for Models and Methods of Computational Pragmatics)

The modern approach to training neural networks for word processing tasks implies three stages of training. In the first stage, we show the neural network texts with some hidden words and teach the neural network to guess the hidden words. This stage does not require any marking of training texts by a human, so we can train the network on terabytes of texts downloaded from the Internet. In the second stage, the neural network learns the same thing, but from texts from the target domain (for example, movie reviews) - so it adapts to a certain type of text with which it has to work further. In the third stage, the neural network learns how to solve the target problem (for example, to distinguish positive reviews from negative ones) by using marked human examples, the number of which is usually relatively small.

In previous models the words that a neural network learned to guess in the first two stages were randomly selected from texts, most of them are just service words, not related in any way to the target task. In our paper, we propose to teach the model to guess primarily the words related to the target (for example, positive and negative characteristics of movies). This allows already at the adaptation stage to focus the resources of the network on detecting those features that are relevant to the target task, which speeds up the adaptation and improves the quality of the final model. Experiments have shown that our proposed approach is especially effective for adaptation to large collections of texts.

Irina Saparina
Research Assistant

We consider the task of translating a natural language question to a database into an executable query. Solving this problem will make it possible to work with databases without knowledge of query languages.

Many studies focus on generating SQL queries. Most often neural network models are used for this, which are trained on data containing databases, questions to them and corresponding valid queries. However, it is difficult to assemble such markup, as annotators need to know the query language to write them (e.g. SQL).

In our work, we wanted to eliminate the use of query markup for learning, but still retain the ability of the model to generate executable queries. To do this, we used intermediate question representations, which, unlike full queries, can be crowdsourced.

Our system consists of two components: generating intermediate question representations and translating these representations into SPARQL queries. Importantly, only the first component is implemented using a neural network, and this neural network is trained on easy-to-collect markup. The result is a system that performs at the level of the best existing SQL generation methods but is less demanding on markup.

Date

October 29, 2021

Author