Publications
The paper addresses the questions of data science education of current importance. It aims to introduce and justify the framework that allows flexibly evaluate the processes of a data expedition and a digital media created during it. For these purposes, the authors explore features of digital media artefacts which are specific to data expeditions and are essential to accurate evaluation. The rubrics as a power but hardly formalizable evaluation method in application to digital media artefacts are also discussed. Moreover, the paper documents the experience of rubrics creation according to the suggested framework. The rubrics were successfully adopted to two data-driven journalism courses. The authors also formulate recommendations on data expedition evaluation which should take into consideration structural features of a data expedition, distinctive features of digital media, etc.
We present a study on co-authorship network representation based on network embedding together with additional information on topic modeling of research papers and new edge embedding operator. We use the link prediction (LP) model for constructing a recommender system for searching collaborators with similar research interests. Extracting topics for each paper, we construct keywords co-occurrence network and use its embedding for further generalizing author attributes. Standard graph feature engineering and network embedding methods were combined for constructing co-author recommender system formulated as LP problem and prediction of future graph structure. We evaluate our survey on the dataset containing temporal information on National Research University Higher School of Economics over 25 years of research articles indexed in Russian Science Citation Index and Scopus. Our model of network representation shows better performance for stated binary classification tasks on several co-authorship networks.
These day adaptivity is the cutting edge of modern education. Technologies are being developed rapidly and bringing new possibilities to educators. Thus, diverse types of adaptive learning environment have appeared during these last decades. Material Science and Engineering Education (MSEE) have a solid formalized foundation, which consists of standards, recommendations and clear rules. Moreover, investigators report on growing role of computer in teaching and learning in MSEE. These brings great perspectives to computer adaptive learning system based on a material science and engineering ontology. This paper aims to justify general pedagogical foundations of adaptivity and to collect requirements to a computer adaptive learning system. As an extra result we introduce the architecture of ontology-based adaptive learning system to MSEE.
In this paper, we provide the solution for RecSys Challenge 2018 by our Avito team, which obtained the 3rd place in main track. The goal of the competition was to recommend music tracks for automatic playlist continuation. As a part of this challenge, Spotify released a large public dataset, which allowed us to train a rather complex algorithm. Our approach consists of two stages: collaborative filtering for candidate selection and gradient boosting for final prediction. The combination of these two models performed well with the playlist and track metadata given.
Abstract
Logical frameworks allow the specification of deductive systems using the same logical machinery. Linear logical frameworks have been successfully used for the specification of a number of computational, logics and proof systems. Its success relies on the fact that formulas can be distinguished as linear, which behave intuitively as resources, and unbounded, which behave intuitionistically. Commutative subexponentials enhance the expressiveness of linear logic frameworks by allowing the distinction of multiple contexts. These contexts may behave as multisets of formulas or sets of formulas. Motivated by applications in distributed systems and in type-logical grammar, we propose a linear logical framework containing both commutative and non-commutative subexponentials. Non-commutative subexponentials can be used to specify contexts which behave as lists, not multisets, of formulas. In addition, motivated by our applications in type-logical grammar, where the weakenening rule is disallowed, we investigate the proof theory of formulas that can only contract, but not weaken. In fact, our contraction is non-local. We demonstrate that under some conditions such formulas may be treated as unbounded formulas, which behave intuitionistically.
Logical frameworks allow the specification of deductive systems using the same logical machinery. Linear logical frameworks have been successfully used for the specification of a number of computational, logics and proof systems. Its success relies on the fact that formulas can be distinguished as linear, which behave intuitively as resources, and unbounded, which behave intuitionistically. Commutative subexponentials enhance the expressiveness of linear logic frameworks by allowing the distinction of multiple contexts. These contexts may behave as multisets of formulas or sets of formulas. Motivated by applications in distributed systems and in type-logical grammar, we propose a linear logical framework containing both commutative and non-commutative subexponentials. Non-commutative subexponentials can be used to specify contexts which behave as lists, not multisets, of formulas. In addition, motivated by our applications in type-logical grammar, where the weakenening rule is disallowed, we investigate the proof theory of formulas that can only contract, but not weaken. In fact, our contraction is non-local. We demonstrate that under some conditions such formulas may be treated as unbounded formulas, which behave intuitionistically.
This volume contains the refereed proceedings of the 6th International Conference on Analysis of Images, Social Networks, and Texts (AIST 2017)1. The previous conferences during 2012–2016 attracted a significant number of students, researchers, academics, and engineers working on interdisciplinary data analysis of images, texts, and social networks. The broad scope of AIST made it an event where researchers from different domains, such as image and text processing, exploiting various data analysis techniques, can meet and exchange ideas. We strongly believe that this may lead to cross fertilisation of ideas between researchers relying on modern data analysis machinery. Therefore, AIST brought together all kinds of applications of data mining and machine learning techniques. The conference allowed specialists from different fields to meet each other, present their work, and discuss both theoretical and practical aspects of their data analysis problems. Another important aim of the conference was to stimulate scientists and people from industry to benefit from the knowledge exchange and identify possible grounds for fruitful collaboration. The conference was held during July 27–29, 2017. The conference was organised in Moscow, the capital of Russia, on the campus of Moscow Polytechnic University. This year, the key topics of AIST were grouped into six tracks: 1. General topics of data analysis chaired by Sergei Kuznetsov (Higher School of Economics, Russia) and Amedeo Napoli (LORIA, France) 2. Natural language processing chaired by Natalia Loukachevitch (Lomonosov Moscow State University, Russia) and Alexander Panchenko (University of Hamburg, Germany) 3. Social network analysis chaired by Stanley Wasserman (Indiana University, USA) 4. Analysis of images and video chaired by Victor Lempitsky (Skolkovo Institute of Science and Technology, Russia) and Andrey Savchenko (Higher School of Economics, Russia) 5. Optimisation problems on graphs and network structures chaired by Panos Pardalos (University of Florida, USA) and Michael Khachay (IMM UB RAS and Ural Federal University, Russia) 6. Analysis of dynamic behaviour through event data chaired by Wil van der Aalst (Eindhoven University of Technology, The Netherlands) and Irina Lomazova (Higher School of Economics, Russia) One of the novelties this year was the introduction of a new specialised track on process mining (Track 6).
This book constitutes the proceedings of the 6th International Conference on Analysis of Images, Social Networks and Texts, AIST 2017, held in Moscow, Russia, in July 2017.
The 29 full papers and 8 short papers were carefully reviewed and selected from 127 submissions. The papers are organized in topical sections on natural language processing; general topics of data analysis; analysis of images and video; optimization problems on graphs and network structures; analysis of dynamic behavior through event data; social network analysis.
This book constitutes the proceedings of the 7th International Conference on Analysis of Images, Social Networks and Texts, AIST 2018, held in Moscow, Russia, in July 2018.
The 29 full papers were carefully reviewed and selected from 107 submissions (of which 26 papers were rejected without being reviewed). The papers are organized in topical sections on natural language processing; analysis of images and video; general topics of data analysis; analysis of dynamic behavior through event data; optimization problems on graphs and network structures; and innovative systems.
In this paper we show that for a given co-authorship network we could construct a recommender system for searching collaborators with similar research interests defined via keywords and topic modelling. We suggest new link embedding method and evaluate our model on National Research University Higher School of Economics (NRU HSE) co-authorship network.
Abstract
Relativisation involves dependencies which, although unbounded, are constrained with respect to certain island domains. The Lambek calculus L can provide a very rudimentary account of relativisation limited to unbounded peripheral extraction; the Lambek calculus with bracket modalities Lb can further condition this account according to island domains. However in naïve parsing/theorem-proving by backward chaining sequent proof search for Lb the bracketed island domains, which can be indefinitely nested, have to be specified in the linguistic input. In realistic parsing word order is given but such hierarchical bracketing structure cannot be assumed to be given. In this paper we show how parsing can be realised which induces the bracketing structure in backward chaining sequent proof search with Lb.
Co-authorship networks contain invisible patterns of collaboration among researchers. The process of writing joint paper can depend of different factors, such as friendship, common interests, and policy of university. We show that, having a temporal co-authorship network, it is possible to predict future publications. We solve the problem of recommending collaborators from the point of link prediction using graph embedding, obtained from co-authorship network. We run experiments on data from HSE publications graph and compare it with relevant models.
This paper is devoted to mathematical modelling of the progression considering stages of breast cancer. Given the relation between primary tumor (PT) and metastases (MTS), the problem of discovering breast cancer (BC) process seems to be twofold: firstly, it is im- portant to describe the whole natural history of BC to understand the process as a whole; secondly, it is necessary to predict the period of a clinical MTS manifestation. In order to understand growth processes of BC on each stage CoMBreC was proposed as a new research tool. The CoMBreC is threefold: CoMPaS (stages I-II), CoM-III (stage III) and CoM-IV (stage IV). A new model rests on exponential growth model and complementing formulas. For the first time, it allows us to calculate different growth periods of PT and MTS in patients with/without lymph nodes MTS: 1) non-visible period for PT; 2) non- visible period for MTS; 3) visible period for MTS. Calculations via CoMBreC correspond to survival data considering stage of BC. It may help to improve predicting accuracy of BC process using an original mathematical model referred to CoMBreC and corresponding software. Consequently, thesis concentrated on: 1) modelling the whole natural history of PT and MTS in patients with/without lymph nodes MTS; 2) developing adequate and precise CoMBreC that reflects relations between PT and MTS; 3) analysing the CoMBreC scope of application. The CoMBreC was implemented to iOS application as a new predictive tool: 1) is a solid foundation to develop future studies of BC models; 2) does not require any expensive diagnostic tests; 3) is the first predictor of survival in breast cancer that makes forecast using only current patient data.
The goal of this research is to improve the accuracy of predicting the breast cancer (BC) pro- cess using the original mathematical model referred to as CoMPaS. The CoMPaS is the original mathematical model and the corresponding software built by modelling the natural history of the primary tumor (PT) and secondary distant metastases (MTS), it reflects the relations between the PT and MTS. The CoMPaS is based on an exponential growth model and consists of a system of determinate nonlinear and linear equations and corresponds to the TNM classification. It allows us to calculate the different growth periods of PT and MTS: 1) a non-visible period for PT, 2) a non-visible period for MTS, and 3) a visible period for MTS. The CoMPaS has been validated using 10-year and 15-year survival clinical data con- sidering tumor stage and PT diameter. The following are calculated by CoMPaS: 1) the number of doublings for the non-visible and visible growth periods of MTS and 2) the tumor volume doubling time (days) for the non-visible and visible growth periods of MTS. The diameters of the PT and secondary distant MTS increased simultaneously. In other words, the non-visible growth period of the secondary distant MTS shrinks, leading to a decrease of the survival of patients with breast cancer. The CoMPaS correctly describes the growth of the PT for patients at the T1aN0M0, T1bN0M0, T1cN0M0, T2N0M0 and T3N0M0 stages, who does not have MTS in the lymph nodes (N0). Additionally, the CoMPaS helps to con- sider the appearance and evolution period of secondary distant MTS (M1). The CoMPaS correctly describes the growth period of PT corresponding to BC classification (parameter T), the growth period of secondary distant MTS and the 10-15-year survival of BC patients considering the BC stage (parameter M).
We solve the argument mining problem by investigating discourse and communicative text structure. A new formal graph-based structure called communicative discourse tree (CDT) is defined. It consists of a discourse tree with additional labels on edges, which stand for verbs. These verbs represent communicative actions. Discourse trees are based on rhetoric relations, extracted from a text according to Rhetoric Structure Theory. The problem is tackled as a binary classification task, where the positive class corresponds to texts with arguments and the negative class corresponds to texts with no arguments. The feature engineering for the classification task is conducted, deciding on which syntactic and discourse features are associated with logical argumentation. Text classification framework based on syntactic, discourse and communicative discourse text structures with a number of learning approaches is implemented. Evaluation on a combined data-set is provided.
In this paper (The first author is the 1st place winner of the Open HSE Student Research Paper Competition (NIRS) in 2017, Computer Science nomination, with the topic “Extraction of Visual Features for Recommendation of Products”, as alumni of 2017 “Data Science” master program at Computer Science Faculty, HSE, Moscow), we describe a special recommender approach based on features extracted from the clothes’ images. The method of feature extraction relies on pre-trained deep neural network that follows transfer learning on the dataset. Recommendations are generated by the neural network as well. All the experiments are based on the items of category Clothing, Shoes and Jewelry from Amazon product dataset. It is demonstrated that the proposed approach outperforms the baseline collaborative filtering method.