We use cookies in order to improve the quality and usability of the HSE website. More information about the use of cookies is available here, and the regulations on processing personal data can be found here. By continuing to use the site, you hereby confirm that you have been informed of the use of cookies by the HSE website and agree with our rules for processing personal data. You may disable cookies in your browser settings.
109028, Moscow,
11, Pokrovsky boulevard.
Phone: +7 (495) 531-00-00 *27254
Email: computerscience@hse.ru
Kashin B. S., Kosov E., Limonova I. V. et al.
Journal of Complexity. 2022. Vol. 71.
Kleeva D., Soghoyan G., Komoltsev I. et al.
Journal of Neural Engineering. 2022. Vol. 19. No. 3.
Kolpakov A., Talambutsa A.
Proceedings of the American Mathematical Society. 2022. Vol. 150. No. 6. P. 2301-2307.
Nesterov R., Bernardinello L., Lomazova I. A. et al.
Software and Systems Modeling. 2022.
In bk.: ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery (ACM), 2021. P. 703-715.
The Faculty of Computer Science was created with the goal of becoming one of the world’s leading faculties for developers and researchers in data analysis, machine learning, big data, theoretical computer science, bioinformatics, system and software engineering, system programming, and distributed computing. In cooperation with major companies like Yandex, Sberbank, SAS, Samsung, 1C, and many others, the Faculty provides both deep theoretical knowledge and hands-on practical experience in many branches of contemporary computer science.
The annual Conference on Empirical Methods in Natural Language Processing (EMNLP) takes place on November 7-11, 2021.
Among the articles presented at the conference will be five written by the researchers of the Faculty:
Artificial Text Detection via Examining the Topology of Attention Maps (L. Kushnareva, D. Cherniavskii, V. Mikhailov, Ekaterina Artemova, S. Barannikov, A. Bernstein, I. Piontkovskaya, D. Piontkovski, E. Burnaev)
NB-MLM: Efficient Domain Adaptation of Masked Language Models for Sentiment Analysis (Nikolay Arefyev, D. Kharchev, A. Shelmanov)
SPARQLing Database Queries from Intermediate Question Decompositions (Irina Saparina, Anton Osokin)
Multi-Sentence Resampling: A Simple Approach to Alleviate Dataset Length Bias and Beam-Search Degradation (I. Provilkov, Andrey Malinin)
Uncertainty Measures in Neural Belief Tracking and the Effects on Dialogue Policy Performance (C. van Niekerk, Andrey Malinin, Ch. Geishauser, M. Heck, H. Lin, N. Lubis, S. Feng, M. Gašić).
We asked the authors to tell us about their researches:
Modern text generation models show impressive results: they can compose a poem, change the style of texts and even write a meaningful essay on a given topic. However, such models can be used for malicious purposes, such as generating fake news, product reviews and political content. Thus, a new challenge arises: learning to distinguish human-written texts from texts generated by neural network language models. This is the task of our paper.
In this paper, we investigated the applicability of topological data analysis (TDA) techniques to the task of detecting generated sentences. We assumed that topological features derived from language models can encode the surface and structural properties of sentences needed for the task.
Generally speaking, TAD methods are very rarely used in text processing. Therefore, our first result is the identification of the different types of topological features: we have shown how to compute Betti numbers, barcodes and graph distances to patterns based on attention maps. Topological features form a vector representation, which can be considered an analogue of standard vector representations and can be used to train classifiers. As a result, classifiers using topological representations have distinct advantages: First, in some cases, they perform better than standard neural network classifiers. Second, they are more robust: a classifier trained to detect sentences generated by one model can also detect sentences generated by another model.
The final part of the paper is devoted to the interpretation of topological features. We show that, as expected, topological features successfully encode sentence length and syntactic tree depth. Overall, our paper is an interdisciplinary project done at the interface of mathematics and word processing. We hope that our results will attract the attention of mathematicians and linguists alike, and pose new research questions for both disciplines.
One of the authors of this article is Professor Dmitry Piontkovsky of the Department of Mathematics, Faculty of Economic Sciences, HSE University.
Head of the project group "Interlingual methods for polysemantic word meaning extraction" (Laboratory for Models and Methods of Computational Pragmatics)
The modern approach to training neural networks for word processing tasks implies three stages of training. In the first stage, we show the neural network texts with some hidden words and teach the neural network to guess the hidden words. This stage does not require any marking of training texts by a human, so we can train the network on terabytes of texts downloaded from the Internet. In the second stage, the neural network learns the same thing, but from texts from the target domain (for example, movie reviews) - so it adapts to a certain type of text with which it has to work further. In the third stage, the neural network learns how to solve the target problem (for example, to distinguish positive reviews from negative ones) by using marked human examples, the number of which is usually relatively small.
In previous models the words that a neural network learned to guess in the first two stages were randomly selected from texts, most of them are just service words, not related in any way to the target task. In our paper, we propose to teach the model to guess primarily the words related to the target (for example, positive and negative characteristics of movies). This allows already at the adaptation stage to focus the resources of the network on detecting those features that are relevant to the target task, which speeds up the adaptation and improves the quality of the final model. Experiments have shown that our proposed approach is especially effective for adaptation to large collections of texts.
We consider the task of translating a natural language question to a database into an executable query. Solving this problem will make it possible to work with databases without knowledge of query languages.
Many studies focus on generating SQL queries. Most often neural network models are used for this, which are trained on data containing databases, questions to them and corresponding valid queries. However, it is difficult to assemble such markup, as annotators need to know the query language to write them (e.g. SQL).
In our work, we wanted to eliminate the use of query markup for learning, but still retain the ability of the model to generate executable queries. To do this, we used intermediate question representations, which, unlike full queries, can be crowdsourced.
Our system consists of two components: generating intermediate question representations and translating these representations into SPARQL queries. Importantly, only the first component is implemented using a neural network, and this neural network is trained on easy-to-collect markup. The result is a system that performs at the level of the best existing SQL generation methods but is less demanding on markup.