Nowadays there are representative volumes of demographic data which are the sources for extraction of demographic sequences that can be further analysed and interpreted by domain experts. Since traditional statistical methods cannot face the emerging needs of demography, we used modern methods of pattern mining and machine learning to achieve better results. In particular, our collaborators, the demographers, are interested in two main problems: prediction of the next event in a personal life trajectory and finding interesting patterns in terms of demographic events for the gender feature.
The main goal of this paper is to compare different methods by accuracy for these tasks. We have considered interpretable methods such as decision trees and semi- and non-interpretable methods, such as the SVM method with custom kernels and neural networks. The best accuracy results are obtained with a two-channel convolutional neural network. All the acquired results and the found patterns are passed to the demographers for further investigation.
Deep learning is a term used to describe artificial intelligence (AI) technologies. AI deals with how computers can be used to solve complex problems in the same way that humans do. Such technologies as computer vision (CV) and natural language processing (NLP) are distinguished as the largest AI areas. To imitate human vision and the ability to express meaning and feelings through language, deep learning exploits artificial neural networks that are trained on real life evidence. While most vision-related tasks are solved using common methods nearly irrespective of target domains, NLP methods strongly depend on the properties of a given language. Linguistic diversity complicates deep learning for NLP. This Chapter focuses on deep learning applications to processing the Russian language.
Deep learning is a term used to describe artificial intelligence (AI) technologies. AI deals with how computers can be used to solve complex problems in the same way that humans do. Such technologies as computer vision (CV) and natural language processing (NLP) are distinguished as the largest AI areas. To imitate human vision and the ability to express meaning and feelings through language, deep learning exploits artificial neural networks that are trained on real life evidence.
While most vision-related tasks are solved using common methods nearly irrespective of target domains, NLP methods strongly depend on the properties of a given language. Linguistic diversity complicates deep learning for NLP. This chapter focuses on deep learning applications to processing the Russian language.
Today, increased attention is drawn towards network representation learning, a technique that maps nodes of a network into vectors of a low-dimensional embedding space. A network embedding constructed this way aims to preserve nodes similarity and other specific network properties. Embedding vectors can later be used for downstream machine learning problems, such as node classification, link prediction and network visualization. Naturally, some networks have text information associated with them. For instance, in a citation network, each node is a scientific paper associated with its abstract or title; in a social network, all users may be viewed as nodes of a network and posts of each user as textual attributes. In this work, we explore how combining existing methods of text and network embeddings can increase accuracy for downstream tasks and propose modifications to popular architectures to better capture textual information in network embedding and fusion frameworks.
We present a novel dataset of sports broadcasts with 8,781 games. The dataset contains 700 thousand comments and 93 thousand related news documents in Russian. We run an extensive series of experiments of modern extractive and abstractive approaches. The results demonstrate that BERT-based models show modest performance, reaching up to 0.26 ROUGE-1F-measure. In addition, human evaluation shows that neural approaches could generate feasible although inaccurate news basing on broadcast text.
There is a diverse variety of demographic data that can be analyzed with modern methods of data mining to achieve better results. On the one hand, the main chosen task is to compare different methods for the next event prediction and gender prediction, on the other hand, we pay special attention to interpretable patterns describing demographic behavior in the studied problems. There were considered interpretable methods as decision trees and their ensembles and semi- or non-interpretable methods, such as the SVM method with different customized kernels tailored for demographers' needs and neural networks, respectively. The best accuracy results were obtained with two-channel Convolutional Neural Networks.
Dealing with relational data always required significant computational resources, domain expertise and task-dependent feature engineering in order to incorporate structural information into predictive model. Nowadays, a family of automated graph feature engineering techniques have been proposed in different streams of literature. So-called graph embeddings provide a powerful tool to construct vectorized feature spaces for graphs and their components, such as nodes, edges and subgraphs under preserving inner graph properties. Using the constructed feature spaces, many machine learning problems on graphs can be solved via standard frameworks suitable for vectorized feature representation.
Our survey aims to describe the core concepts of graph embeddings, and provide several taxonomies for their description. First, we start with methodological approach, and extract three types of graph embedding models based on matrix factorization, random-walks and deep learning approaches. Next, we describe how different types of networks impact the ability to of models to incorporate structural and attributed data into a unified embedding. Going further, we perform a thorough evaluation of graph embedding applications to machine learning problems on graphs, among which are node classification, link prediction, clustering, visualization, compression, and a family of the whole graph embedding algorithms suitable for graph classification, similarity and alignment problems. Finally, we overview the existing applications of graph embeddings to computer science domains, formulate open problems and provide experiment results, explaining how different embedding and graph properties are connected to the four classic machine learning problems on graphs, such as node classification, link prediction, clustering and graph visualization.
As a result, our survey covers a new rapidly growing field of network feature engineering, presents an in-depth analysis of models based on network types, and overviews a wide range of applications to machine learning problems on graphs.
Modern information access systems hold the promise to give users direct access to key information from authoritative primary sources such as scientific literature, but non-experts tend to avoid these sources due to their complex language, internal vernacular, or lacking prior background knowledge. Text simplification approaches can remove some of these barriers, thereby avoiding that users rely on shallow information in sources prioritizing commercial or political incentives rather than the correctness and informational value. The CLEF 2021 SimpleText track will address the opportunities and challenges of text simplification approaches to improve scientific information access head-on. We aim to provide appropriate data and benchmarks, starting with pilot tasks in 2021, and create a community of NLP and IR researchers working together to resolve one of the greatest challenges of today.
Drugs and diseases play a central role in many areas of biomedical research and healthcare. Aggregating knowledge about these entities across a broader range of domains and languages is critical for information extraction (IE) applications. To facilitate text mining methods for analysis and comparison of patient’s health conditions and adverse drug reactions reported on the Internet with traditional sources such as drug labels, we present a new corpus of Russian language health reviews.
The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labeled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labeled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labeled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multilabel sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data.
This book constitutes the proceedings of the 8th International Conference on Analysis of Images, Social Networks and Texts, AIST 2019, held in Kazan, Russia, in July 2019.
The 24 full papers and 10 short papers were carefully reviewed and selected from 134 submissions (of which 21 papers were rejected without being reviewed). The papers are organized in topical sections on general topics of data analysis; natural language processing; social network analysis; analysis of images and video; optimization problems on graphs and network structures; analysis of dynamic behaviour through event data.
Applications such as machine translation, speech recognition, and information retrieval require efficient handling of noun compounds as they are one of the possible sources for out of vocabulary words. In-depth processing of noun compounds requires not only splitting them into smaller components (or even roots) but also the identification of instances that should remain unsplitted as they are of idiomatic nature. We develop a two-fold deep learning-based approach of noun compound splitting and idiomatic compound detection for the German language that we train using a newly collected corpus of annotated German compounds. Our neural noun compound splitter operates on a sub-word level and outperforms the current state of the art by about 5%.
Lexical substitution, i.e. generation of plausible words that can replace a particular target word in a given context, is an extremely powerful technology that can be used as a backbone of various NLP applications, including word sense induction and disambiguation, lexical relation extraction, data augmentation, etc. In this paper, we present a large-scale comparative study of lexical substitution methods employing both rather old and most recent language and masked language models (LMs and MLMs), such as context2vec, ELMo, BERT, RoBERTa, XLNet. We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly. Several existing and new target word injection methods are compared for each LM/MLM using both intrinsic evaluation on lexical substitution datasets and extrinsic evaluation on word sense induction (WSI) datasets. On two WSI datasets we obtain new SOTA results. Besides, we analyze the types of semantic relations between target words and their substitutes generated by different models or given by annotators.
We propose a hybrid technique of black-box testing of virtual assistants (VAs) in the financial sector. The specifics of the highly regulated industry imposes numerous limitations on the testing process: GDPR and other data protection requirements, the absence of interaction logs with real users, restricted access to internal data, etc. These limitations also decrease the applicability of a few VA testing methods that are widely described in the research literature. The approach suggested in this paper consists of semi-controlled interaction logging from the trained testers and subsequent augmenting the collected data for automated testing.
SemEval-2020 Task 1 is devoted to detection of changes in word meaning over time. The first subtask raises a question if a particular word has acquired or lost any of its senses during the given time period. The second subtask requires estimating the change in frequencies of the word senses. We have submitted two solutions for both subtasks. The first solution performs word sense induction (WSI) first, then makes the decision based on the induced word senses. We extend the existing WSI method based on clustering of lexical substitutes generated with neural language models and adapt it to the task. The second solution exploits a well-known approach to semantic change detection, that includes building word2vec SGNS vectors, aligning them with Orthogonal Procrustes and calculating cosine distance between resulting vectors. While WSI-based solution performs better in Subtask 1, which requires binary decisions, the second solution outperforms it in Subtask 2 and obtains the 3rd best result in this subtask.
We analyze comparative questions, i.e., questions asking to compare different items, that were submitted to Yandex in 2012. Responses to such questions might be quite different from the simple "ten blue links'' and could, for example, aggregate pros and cons of the different options as direct answers. However, changing the result presentation is an intricate decision such that the classification of comparative questions forms a highly precision-oriented task.
From a year-long Yandex log, we annotate a random sample of 50,000~questions; 2.8% of which are comparative. For these annotated questions, we develop a precision-oriented classifier by combining carefully hand-crafted lexico-syntactic rules with feature-based and neural approaches---achieving a recall of 0.6 at a perfect precision of 1.0. After running the classifier on the full year log (on average, there is at least one comparative question per second), we analyze 6,250 comparative questions using more fine-grained subclasses (e.g., should the answer be a "simple'' fact or rather a more verbose argument) for which individual classifiers are trained. An important insight is that more than 65% of the comparative questions demand argumentation and opinions, i.e., reliable direct answers to comparative questions require more than the facts from a search engine's knowledge graph.
In addition, we present a qualitative analysis of the underlying comparative information needs (separated into 14 categories likeconsumer electronics or health), their seasonal dynamics, and possible answers from community question answering platforms.
Intelligent personal assistants (IPA) use humor to engage and entertain users as well as mitigate performance limitations. In order to understand the types of users’ humorous interactions with IPA, we developed a classification of humorous utterances that included categories of questions about IPA personality, requests for jokes, rhetorical statement, and others. In order to illustrate the usefulness of classification for analyzing IPA interactions, we used it for comparing the four major IPAs on their responses to humorous utterances. A representative sample of 96 humorous utterances in each humor category and IPA type was developed and tested by 14 participants. The study found that IPA responses to specific requests for jokes received the highest humor ratings from users. The study also found that, overall, Alexa was rated as the most humorous IPA, followed by Google Assistant and Cortana. Interpretation of the findings in light of humor theories and IPA features are provided.
Out-of-vocabulary words are still a challenge in cross-lingual Natural Language Processing tasks, for which transliteration from source to target language or script is one of the solutions. In this study, we collect a personal name dataset in 445 Wikidata languages (37 scripts), train Transformer-based multilingual transliteration models on 6 high- and 4 less-resourced languages, compare them with bilingual models from (Merhav and Ash, 2018) and determine that multilingual models perform better for less-resourced languages. We discover that intrinsic evaluation, i.e comparison to a single gold standard, might not be appropriate in the task of transliteration due to its high variability. For this reason, we propose using extrinsic evaluation of transliteration via the cross-lingual named entity list search task (e.g. personal name search in contacts list). Our code and datasets are publicly available online.
In this paper, our focus is the connection and influence of language technologies on the research in neurolinguistics. We present a review of brain imaging-based neurolinguistics studies with a focus on the natural language representations, such as word embeddings and pre-trained language model. Mutual enrichment of neurolinguistics and language technologies leads to development of brain-aware natural language representations. The importance of the research area is emphasized by medical applications
This paper describes our approach to “DeftEval: Extracting Definitions from Free Text in Textbooks” competition held as a part of Semeval 2020. The task was devoted to finding and labeling definitions in texts. DeftEval was split into three subtasks: sentence classification, sequence labeling and relation classification. Our solution ranked 5th in the first subtask and 23rd and 21st in the second and the third subtasks respectively. We applied simultaneous multi-task learning with Transformer-based models for subtasks 1 and 3 and a single BERT-based model for named entity recognition.
Artificial General Intelligence (AGI) is showing growing performance in numerous applications - beating human performance in Chess and Go, using knowledge bases and text sources to answer questions (SQuAD) and even pass human examination (Aristo project). In this paper, we describe the results of AI Journey, a competition of AI-systems aimed to improve AI performance on knowledge bases, reasoning and text generation. Competing systems pass the final native language exam (in Russian), including versatile grammar tasks (test and open questions) and an essay, achieving a high score of 69%, with 68% being an average human result. During the competition, a baseline for the task and essay parts was proposed, and 80+ systems were submitted, showing different approaches to task understanding and reasoning. All the data and solutions can be found on github https://github.com/sberbank-ai/combined_solution_aij2019