About

The Centre for Language and Semantic Technologies is part of the HSE Faculty of Computer Science. It was created to address natural language processing and the development of semantic technologies based on both interpretable artificial intelligence methods and modern machine learning models.

The centre's main objectives are:


1

developing and advancing interpretable machine learning and data mining methods for NLP and recommender systems


2

developing models that enhance the functionality of existing large language models by leveraging additional resources: linguistic models, knowledge models, search models, and planning algorithms

 


3

developing models and methods for automatic knowledge acquisition using large language models (LLM), including methods for transfer learning between different languages and different tasks

 


4

developing models and methods for research, modelling, and analysis within the framework of complex systems theory

 


5

developing semantic analysis tools based on mathematical methods in formal concept theory

 

Structure

International Laboratory of Intelligent Systems and Structural Analysis

We conduct research that enables the integration of structural and neural network representations in applied data analysis tasks

Laboratory of Models and Methods of Computational Pragmatics

We work on natural language processing (NLP), interpretable machine learning, and data mining, develop recommender systems and services, and advance multimodal clustering and classification methods that enable the creation of user interest profiles across multiple modalities

Laboratory of Complex Systems Modelling and Control

We conduct fundamental and applied scientific research in the mathematical modelling of complex systems, studying synchronisation phenomena, sudden regime changes, quasi-regularities, self-organisation, evaluating the effectiveness of rare event forecasting algorithms, and managing complex systems

Semantics Analysis Laboratory (in Russian)

Study of natural language as a whole within the natural science paradigm using methods of computer science and applied mathematics

Management

Sergei Kuznetsov

Director of the Centre, Doctor of Sciences, Professor

Marina Zhelyazkova

Deputy Director of the Centre, Candidate of Sciences

Publications

  • Data Analytics and Management in Data Intensive Domains: 25th International Conference, DAMDID/RCDL 2023, Moscow, Russia, October 24–27, 2023, Revised Selected Papers

    This book constitutes the post-conference proceedings of the 25th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2023, held in Moscow, Russia, during 24-27 October 2023.


    The 21 papers presented here were carefully reviewed and selected from 75 submissions. These papers are organized in the following topical sections: Data Models and Knowledge Graphs; Databases in Data Intensive Domains; Machine learning methods and applications; Data Analysis in Astronomy & Information extraction from text. Papers from keynote talks have also been included in this book.

     


    Vol. 2086: Communications in Computer and Information Science. Springer, 2024.

  • Free energy of neural network can predict accuracy after pruning

    Neural networks are powerful tools capable of achieving state-of-the-art performance across a wide range of tasks; however, their effectiveness often comes at the cost of extremely large numbers of parameters, which can hinder their deployment in resource-constrained environments. To address this issue, various pruning techniques have been proposed to reduce model size and complexity while preserving performance. In this study, we first propose a thermodynamic perspective for analyzing the behavior of neural networks during the pruning process based on magnitude-based weight pruning. Second, we demonstrate that by employing the thermodynamic concept of free energy, the selection procedure for the pruning level can be significantly simplified and accelerated. Thus, in this work, we propose a fast method for selecting the pruning threshold by computing the network’s free energy. We evaluate our method on classification tasks in the domains of natural language processing and computer vision, considering models such as multilayer perceptrons (MLP), encoder–decoder transformers, encoder-only transformers, pretrained transformers, VGG, ResNet, and DenseNet. Experimental results demonstrate that our approach provides a good approximation of the optimal pruning threshold for MLP and transformer-based models while significantly reducing the computational time (at least 70 times) compared to evaluating model accuracy.

    Physica A: Statistical Mechanics and its Applications. 2025. Vol. 681. P. 1-16.

  • Book chapter

    Tariq W., Popov V., Gromov V.

    Building a Clean Bartangi Language Corpus and Training Word Embeddings for Low-Resource Language Modeling

    In this paper, we showcase a comprehensive end-to-end pipeline for creating a superior Bartangi language corpus and using it for training word embeddings. The critically low-resource Pamiri language of Bartangi, which is spoken in Tajikistan, has difficulties such as morphological complexity, orthographic variety, and a lack of data. In order to overcome these obstacles, we gathered a raw corpus of roughly 6,550 phrases, used the Uniparser-Morph- Bartangi morphological analyzer for linguistically accurate lemmatization, and implemented a thorough cleaning procedure to eliminate noise and ensure proper tokenization. The lemmatized corpus that results greatly lowers word sparsity and raises the standard of linguistic analysis. The processed corpus was then used to train two different Word2Vec models, Skipgram and CBOW, with a vector size of 100, a context window of 5, and a minimum frequency threshold of 1. The resultant word embeddings were displayed using dimensionality reduction techniques like PCA (Pearson, 1901) and t-SNE (van der Maaten and Hinton, 2008), and assessed using intrinsic methods like nearest-neighbor similarity tests. Our tests show that even from tiny datasets, meaningful semantic representations can be obtained by combining informed morphological analysis with clean preprocessing. One of the earliest computational datasets for Bartangi, this resource serves as a vital basis for upcoming NLP tasks, such as language modeling, semantic analysis, and low-resource machine translation. To promote more research in Pamiri and other under-represented languages, we make the corpus, lemmatizer pipeline, and trained embeddings publicly available.

    In bk.: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2025). Shumen: INCOMA Ltd, 2025. P. 1256-1262.

  • Working paper

    Mirkin B., Parinov A., Halynchyk M. et al.

    Versions of least-squares k-means algorithm for interval data

    Recently, k-means clustering has been extended to the so-called interval data. In contrast to conventional data case, the interval data feature values are intervals rather than single reals. This paper further explores the least-squares criterion for k-means clustering to tackle the issue of initialization, that is, finding a proper set of initial cluster centers at interval data clustering. Specifically, we extend, for the interval data, a Pythagorean decomposition of the data scatter in the sum of two items, one being a genuine k-means least-squares criterion, the other, a complementary criterion, requiring the clusters to be numerous and anomalous. Therefore we propose a method for one-byone obtaining anomalous clusters. After a run of the method, we start k-means iterations from the centers of the most numerous of the found anomalous clusters. We test and validate our proposed BIKM algorithm at versions of two newly introduced interval datasets.

    Математические методы анализа решений в экономике, бизнесе и политике. WP7. Издательский дом ВШЭ, 2024

All publications