The centre's main objectives are:
1
developing and advancing interpretable machine learning and data mining methods for NLP and recommender systems
2
developing models that enhance the functionality of existing large language models by leveraging additional resources: linguistic models, knowledge models, search models, and planning algorithms
3
developing models and methods for automatic knowledge acquisition using large language models (LLM), including methods for transfer learning between different languages and different tasks
4
developing models and methods for research, modelling, and analysis within the framework of complex systems theory
5
developing semantic analysis tools based on mathematical methods in formal concept theory
Structure
International Laboratory of Intelligent Systems and Structural Analysis
We conduct research that enables the integration of structural and neural network representations in applied data analysis tasks
Laboratory of Models and Methods of Computational Pragmatics
We work on natural language processing (NLP), interpretable machine learning, and data mining, develop recommender systems and services, and advance multimodal clustering and classification methods that enable the creation of user interest profiles across multiple modalities
Laboratory of Complex Systems Modelling and Control
We conduct fundamental and applied scientific research in the mathematical modelling of complex systems, studying synchronisation phenomena, sudden regime changes, quasi-regularities, self-organisation, evaluating the effectiveness of rare event forecasting algorithms, and managing complex systems
Semantics Analysis Laboratory (in Russian)
Study of natural language as a whole within the natural science paradigm using methods of computer science and applied mathematics
Management
Director of the Centre, Doctor of Sciences, Professor
Deputy Director of the Centre, Candidate of Sciences
News
All newsPublications
-
Book
Data Analytics and Management in Data Intensive Domains: 25th International Conference, DAMDID/RCDL 2023, Moscow, Russia, October 24–27, 2023, Revised Selected Papers
This book constitutes the post-conference proceedings of the 25th International Conference on Data Analytics and Management in Data Intensive Domains, DAMDID/RCDL 2023, held in Moscow, Russia, during 24-27 October 2023.
The 21 papers presented here were carefully reviewed and selected from 75 submissions. These papers are organized in the following topical sections: Data Models and Knowledge Graphs; Databases in Data Intensive Domains; Machine learning methods and applications; Data Analysis in Astronomy & Information extraction from text. Papers from keynote talks have also been included in this book.Vol. 2086: Communications in Computer and Information Science. Springer, 2024.
-
Article
SynEL: A synthetic benchmark for entity linking
Large language models (LLMs) offer significant potential for constructing commonsense knowledge graphs from text, demonstrating adaptability across diverse domains. However, their effectiveness varies significantly with domain-specific language, highlighting a critical need for specialized benchmarks to assess and optimize knowledge graph construction sub-tasks like named entity recognition, relation extraction, and entity linking. Currently, domain-specific benchmarks are scarce. To address this gap, we introduce SynEL, a novel benchmark developed for evaluating text-based knowledge extraction methods, validated using customer support dialogues. We present a comprehensive methodology for benchmark construction, propose two distinct approaches for generating synthetic datasets, and evaluate accumulated hallucinations. Our experiments reveal that existing LLMs experience a significant performance drop, with micro-F1 scores decreasing by up to 25 absolute points when extracting low-resource entities compared to high-resource entities from sources like Wikipedia. Furthermore, by incorporating synthetic datasets into the training process, we achieved an improvement in micro-F1 scores of up to 10 absolute points. We publicly release our benchmark and generation code to demonstrate its utility for fine-tuning and evaluating LLMs.
Plos One. 2026. Vol. 1. No. 1. P. 1-18.
-
Book chapter
KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines
We present KoWit-24, a dataset with fine-grained annotation of wordplay in 2,700 Russian news headlines. KoWit-24 annotations include the presence of wordplay, its type, wordplay anchors, and words/phrases the wordplay refers to. Unlike the majority of existing humor collections of canned jokes, KoWit-24 provides wordplay contexts – each headline is accompanied by the news lead and summary. The most common type of wordplay in the dataset is the transformation of collocations, idioms, and named entities – the mechanism that has been underrepresented in previous humor datasets. Our experiments with five LLMs show that there is ample room for improvement in wordplay detection and interpretation tasks. The dataset and evaluation scripts are available at https://github.com/Humor-Research/KoWit-24
In bk.: Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing. Shumen: INCOMA Ltd, 2025. P. 125-132.
-
Working paper
Hessian-based lightweight neural network for brain vessel segmentation on a minimal training dataset
Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image segmentation, but their development depends on well-annotated training datasets. However, there is a notable lack of publicly available MRA datasets with detailed brain vessel annotations. To address this gap, we propose a novel semi-supervised learning lightweight neural network with Hessian matrices on board for 3D segmentation of complex structures such as tubular structures, which we named HessNet. The solution is a Hessian-based neural network with only 6000 parameters. HessNet can run on the CPU and significantly reduces the resource requirements for training neural networks. The accuracy of vessel segmentation on a minimal training dataset reaches state-of-the-art results. It helps us create a large, semi-manually annotated brain vessel dataset of brain MRA images based on the IXI dataset (annotated 200 images). Annotation was performed by three experts under the supervision of three neurovascular surgeons after applying HessNet. It provides high accuracy of vessel segmentation and allows experts to focus only on the most complex important cases. The dataset is available at https://git.scinalytics.com/terilat/VesselDatasetPartly.Statistical mechanics. arXie. arXive, 2025