Dean — Ivan Arzhantsev
First Deputy Dean— Tamara Voznesenskaya
Deputy Dean for Research and International Relations - Sergei Obiedkov
Deputy Dean for finance and administration - Irina Gergart
Phone: +7 (495) 772-95-90 * 12332
Moscow, 3 Kochnovsky Proezd (near metro station 'Aeroport').
Algorithmic statistics has two different (and almost orthogonal) motivations. From the philosophical point of view, it tries to formalize how the statistics works and why some statistical models are better than others. After this notion of a "good model" is introduced, a natural question arises: it is possible that for some piece of data there is no good model? If yes, how often these bad ("non-stochastic") data appear "in real life"? Another, more technical motivation comes from algorithmic information theory. In this theory a notion of complexity of a finite object (=amount of information in this object) is introduced; it assigns to every object some number, called its algorithmic complexity (or Kolmogorov complexity). Algorithmic statistic provides a more fine-grained classification: for each finite object some curve is defined that characterizes its behavior. It turns out that several different definitions give (approximately) the same curve. In this survey we try to provide an exposition of the main results in the field (including full proofs for the most important ones), as well as some historical comments. We assume that the reader is familiar with the main notions of algorithmic information (Kolmogorov complexity) theory.
A small improvement in the structure of a material could potentially lower manufacturing costs. Thus, the free material design can be formulated as an optimization problem. However, due to its large scale, second-order methods cannot solve the free material design problem in a reasonable time. We formulate the free material optimization (FMO) problem into a saddle-point form in which the inverse of the stiffness matrix $A(E)$ in the constraint is eliminated. The size of $A(E)$ is generally large, denoted as $N \times N$. This is the first formulation of FMO without $A(E)^{-1}$. We apply the primal-dual subgradient method [Y. Nesterov, Math. Program., 120 (2009), pp. 221--259] to solve the restricted saddle-point formula. This is the first gradient-type method for FMO. Each iteration of our algorithm takes a total of $\mathcal{O}(N^2)$ floating-point operations and an auxiliary vector storage of size $\mathcal{O}(N)$, compared with formulations having the inverse of $A(E)$ which requires $\mathcal{O}(N^3)$ arithmetic operations and an auxiliary vector storage of size $\mathcal{O}(N^2)$. To solve the problem, we developed a closed-form solution to a semidefinite least squares problem and an efficient parameter update scheme for the gradient method. We also approximate a solution to the bounded Lagrangian dual problem. The problem is decomposed into small problems, each having only an unknown of $k\times k$ ($k=3$ or $6$) matrix, and can be solved in parallel. The iteration bound of our algorithm is optimal for a general subgradient scheme. Finally, we present promising numerical results.\
This paper is a collection of thoughts and observations, being partly a review and partly a report of current research, on recent work in various aspects of Grunbaum colorings, their existence and usage. In particular, one of the most striking significances of Grunbaum’s Conjecture in the 2-dimensional case is its equivalence to the 4-Color Theorem. The notion of Grunbaum coloring is extended from the 2-dimensional case to the case of arbitrary finite hyper-dimensions.
Dualization of a monotone Boolean function on a finite lattice can be represented by transforming the set of its minimal 1 values to the set of its maximal 0 values. In this paper we consider finite lattices given by ordered sets of their meet and join irreducibles (i.e., as a concept lattice of a formal context). We show that in this case dualization is equivalent to the enumeration of so-called minimal hypotheses. In contrast to usual dualization setting, where a lattice is given by the ordered set of its elements, dualization in this case is shown to be impossible in output polynomial time unless P = NP. However, if the lattice is distributive, dualization is shown to be possible in subexponential time.
Strongly-Optimal Structure Preserving Signatures from Type II Pairings: Synthesis and Lower Bounds
We present a multiplayer first-person shooter (FPS) game with advanced intelligent non-playable characters (NPC) under computer control. The game is specially adapted for playing in VR headset so the simulator sickness symptoms are significantly reduced. The demo allows users to play with the other human and NPC players in a shooter game made in Unreal Engine 4. User can verify his/her game skills versus evolving human-like NPCs with a level adjusting model. The humanness of NPC was verified with Alan Turing game test beating 52\% record from BotPrize'12 competition.
The major issues hampering progress in the treatment of cancer patients are distant metastases and drug resistance to chemotherapy. Metastasis formation is a very complex process, and looking at gene signatures alone is not enough to get deep insight into it. This paper reviews traditional and novel approaches to identify gene signature biomarkers and intratumoural fluid pressure both as a novel way of creating predictive markers and as an obstacle to cancer therapy. Finally recently developed in vitro systems to predict the response of individual patient derived cancer explants to chemotherapy are discussed.
Small open reading frames (sORFs) and genes for non-coding RNAs are poorly investigated components of most genomes. Our analysis of 1391 ORFs recently annotated in the soybean symbiont Bradyrhizobium japonicum USDA 110 revealed that 78% of them contain less than 80 codons. Twenty-one of these sORFs are conserved in or outside Alphaproteobacteria and most of them are similar to genes found in transposable elements, in line with their broad distribution. Stabilizing selection was demonstrated for sORFs with proteomic evidence and bll1319_ISGA which is conserved at the nucleotide level in 16 alphaproteobacterial species, 79 species from other taxa and 49 other Proteobacteria. Further we used Northern blot hybridization to validate ten small RNAs (BjsR1 to BjsR10) belonging to new RNA families. We found that BjsR1 and BjsR3 have homologs outside the genus Bradyrhizobium, and BjsR5, BjsR6, BjsR7, and BjsR10 have up to four imperfect copies in Bradyrhizobium genomes. BjsR8, BjsR9, and BjsR10 are present exclusively in nodules, while the other sRNAs are also expressed in liquid cultures. We also found that the level of BjsR4 decreases after exposure to tellurite and iron, and this down-regulation contributes to survival under high iron conditions. Analysis of additional small RNAs overlapping with 3’-UTRs revealed two new repetitive elements named Br-REP1 and Br-REP2. These REP elements may play roles in the genomic plasticity and gene regulation and could be useful for strain identification by PCR-fingerprinting. Furthermore, we studied two potential toxin genes in the symbiotic island and confirmed toxicity of the yhaVhomolog bll1687 but not of the newly annotated higB homolog blr0229_ISGA in E. coli. Finally, we revealed transcription interference resulting in an antisense RNA complementary to blr1853, a gene induced in symbiosis. The presented results expand our knowledge on sORFs, non-coding RNAs and repetitive elements in B. japonicum and related bacteria.
Recent work on structure-preserving signatures studies optimality of these schemes in terms of the number of group elements needed in the verification key and the signature, and the number of pairing-product equations in the verification algorithm. While the size of keys and signatures is crucial for many applications, another important aspect to consider for performance is the time it takes to verify a given signature. By far, the most expensive operation during verification is the computation of pairings. However, the concrete number of pairings that one needs to compute is not captured by the number of pairing-product equations considered in earlier work. To fill this gap, we consider the question of what is the minimal number of pairings that one needs to compute in the verification of structure-preserving signatures. First, we prove lower bounds for schemes in the Type II setting that are secure under chosen message attacks in the generic group model, and we show that three pairings are necessary and that at most one of these pairings can be precomputed. We also extend our lower bound proof to schemes secure under random message attacks and show that in this case two pairings are still necessary. Second, we build an automated tool to search for schemes matching our lower bounds. The tool can generate automatically and exhaustively all valid structure-preserving signatures within a user-specified search space, and analyze their (bounded) security in the generic group model. Interestingly, using this tool, we find a new randomizable structure-preserving signature scheme in the Type II setting that is optimal with respect to the lower bound on the number of pairings, and also minimal with respect to the number of group operations that have to be computed during verification.
Biological nitrogen fixation plays a crucial role in the nitrogen cycle. An ability to fix atmospheric nitrogen, reducing it to ammonium, was described for multiple species of Bacteria and Archaea. The transcriptional regulatory network for nitrogen fixation was extensively studied in several representatives of the class Alphaproteobacteria. This regulatory network includes the activator of nitrogen fixation NifA, working in tandem with the alternative sigma-factor RpoN as well as oxygen-responsive regulatory systems, one-component regulators FnrN/FixK and two-component system FixLJ. Here we used a comparative genomics approach for in silico study of the transcriptional regulatory network in 50 genomes of Alphaproteobacteria. We extended the known regulons and proposed the scenario for the evolution of the nitrogen fixation transcriptional network. The reconstructed network substantially expands the existing knowledge of transcriptional regulation in nitrogen-fixing microorganisms and can be used for genetic experiments, metabolic reconstruction, and evolutionary analysis. © 2016 Tsoy, Ravcheev, Čuklina and Gelfand.
This paper considers the problem of the deductive verification of the Linux kernel code that is concurrent and accesses shared data. The presence of shared data does not allow applying traditional deductive verification techniques, so we consider how to verify such a code by proving its compliance to a given specification of a certain synchronization discipline. The approach is illustrated by the examples of a spinlock specification and a simplified specification of the read-copy-update (RCU) API.
This paper proposes a new object model of data for the in-depth analysis of network traffic. In contrast to the model used by most modern network analyzers (for example, Wireshark and Snort), the proposed model supports data stream reassembling with subsequent parsing. The model also provides a convenient universal mechanism for binding parsers, thus making it possible to develop completely independent parsers. Moreover, the proposed model allows processing modified—compressed or encrypted—data. This model forms the basis of the infrastructure for the in-depth analysis of network traffic.
This paper presents a memory model with nonoverlapping memory areas (regions) for the deductive verification of C programs. This memory model uses a core language that supports arbitrary nesting of structures, unions, and arrays into other structures and allows reducing the number of user-provided annotations as a result of region analysis. This paper also describes semantics of the core language and normalizing transformations for translating an input annotated C program into a program in the core language. In addition, an encoding is proposed for modeling the memory state of a core-language program with logical formulas as described by the model semantics of the core language. For the model semantics, the soundness and completeness theorems are proved. Additional constraints on the typing context of the core-language terms are described that determine the result of the region analysis enabling the complete modeling of a limited class of programs without using additional annotations. A proof sketch for the theorem stating completeness of the proposed region analysis for a limited class of programs is presented.
Multi-level multi-agent systems (MASs) with dynamic structure are widely used in solving important applied problems in telecommunication, transportation, social, and other systems. Therefore, ensuring correct behavior of such systems is an actual and important task. One of the most error-prone stages of system development in the framework of model-oriented approach is the implementation stage, in the course of which a program code is constructed based on the model developed. This paper presents an algorithm for automated translation of MAS models represented as nested Petri nets into systems of distributed components. Nested Petri nets are the extension of Petri nets in the framework of the nets-within-nets approach, which assumes that tokens in a Petri net may themselves be Petri nets, possess autonomous behavior, and interact with other tokens of the net. This makes it possible to model MASs with dynamic structure in a natural way. The translation presented in this paper preserves distribution level and important behavioral properties (safety, liveness, and conditional liveness) of the original model and ensures fairness of the target system execution. The use of such translation makes it possible to automate construction of distributed MASs by models of nested Petri nets. As a test example, translation of nested Petri nets into systems of distributed components was implemented on the basis of the EJB component technology.
The paper deals with the classical extremal problem concerning colorings of hypergraphs. The problem is to find the value m(n,r), equal to the minimum number of edges in a n-uniform hypergraph with chromatic number greater than r. We obtain new upper and lower bounds for m(n,r) in the case when the parameter r is very large in comparison with n.
In order to solve robust PageRank problem a saddle-point Mirror Descent algorithm for solving convex-concave optimization problems is enhanced and studied. The algorithm is based on two proxy functions, which use specificities of value sets to be optimized on (min-max search). In robust PageRank case the ones are entropy-like function and square of Euclidean norm. The saddle-point Mirror Descent algorithm application to robust PageRank leads to concrete complexity results, which are being discussed alongside with illustrative numerical example.
We derive two convergence results for a sequential alternating maximization procedure to approximate the maximizer of random functionals such as the realized log likelihood in MLE estimation. We manage to show that the sequence attains the same deviation properties as shown for the profile M-estimator by Andresen and Spokoiny (2013), that means a finite sample Wilks and Fisher theorem. Further under slightly stronger smoothness constraints on the random functional we can show nearly linear convergence to the global maximizer if the starting point for the procedure is well chosen. ©2016 Andreas Andresen, and Vladimir Spokoiny.
Algorithmic statistics is a part of algorithmic information theory (Kolmogorov complexity theory) that studies the following task: given a finite object x (say, a binary string), find an `explanation' for it, i.e., a simple finite set that contains x and where x is a `typical element'. Both notions (`simple' and `typical') are defined in terms of Kolmogorov complexity. It is known that this cannot be achieved for some objects: there are some ``non-stochastic'' objects that do not have good explanations. In this paper we study the properties of maximally non-stochastic objects; we call them ``antistochastic''. In this paper, we demonstrate that the antistochastic strings have the following property: if an antistochastic string x has complexity k, then any k bit of information about x are enough to reconstruct x (with logarithmic advice). In particular, if we erase all but k bits of this antistochastic string, the erased bits can be restored from the remaining ones (with logarithmic advice). As a corollary we get the existence of good list-decoding codes with erasures (or other ways of deleting part of the information). Antistochastic strings can also be used as a source of counterexamples in algorithmic information theory. We show that the symmetry of information property fails for total conditional complexity for antistochastic strings.
ecomposition is an important phase in the design of medium and large-scale systems. Various architectures of software systems and decomposition methods are studied in numerous publications. Presently, formal specifications of software systems are mainly used for experimental purposes; for this reason, their size and complexity are relatively low. As a result, in the development of a nontrivial specification, different approaches to the decomposition should be compared and the most suitable approach should be chosen. In this paper, the experience gained in the deductive verification of the formal specification of the mandatory entity-role model of access and information flows control in Linux (MROSL DP-model) using the formal Event-B method and stepwise refinement technique is analyzed. Two approaches to the refinementbased decomposition of specifications are compared and the sources and features of the complexity of the architecture of the model are investigated.
In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. This allows the cluster merging process to start from this partition rather than from a trivial partition composed solely of singletons.
Our second contribution is an extension of the Ward and Wardp algorithms to the situation where the feature weight exponent can differ from the exponent of the Minkowski distance. This new method, called A-Wardpβ, is able to generate a much wider variety of clustering solutions. We also demonstrate that its parameters can be estimated reasonably well by using a cluster validity index.
We perform numerous experiments using data sets with two types of noise, insertion of noise features and blurring within-cluster values of some features. These experiments allow us to conclude: (i) our anomalous pattern initialisation method does indeed reduce the time a hierarchical clustering algorithm takes to complete, without negatively impacting its cluster recovery ability; (ii) A-Wardpβ provides better cluster recovery than both Ward and Wardp.
This paper suggests a new randomized forecasting method based on entropy-robust estimation for the probability density functions (PDFs) of random parameters in dynamic models described by the systems of linear ordinary differential equations. The structure of the PDFs of the parameters and measurement noises with the maximal entropy is studied. We generate the sequence of random vectors with the entropy-optimal PDFs using a modification of the Ulam–von Neumann method. The developed method of randomized forecasting is applied to the world population forecasting problem.
In this paper, we study scalar multivariate non-stationary subdivision schemes with integer dilation matrix M and present a unifying, general approach for checking their convergence and for determining their Hölder regularity (latter in the case M=mI,m≥2). The combination of the concepts of asymptotic similarity and approximate sum rules allows us to link stationary and non-stationary settings and to employ recent advances in methods for exact computation of the joint spectral radius. As an application, we prove a recent conjecture by Dyn et al. on the Hölder regularity of the generalized Daubechies wavelets. We illustrate our results with several examples.
OBJECTIVE: Despite the long history of digital radiology, one of its most critical aspects-information security-still remains extremely underdeveloped and poorly standardized. To study the current state of radiology security, we explored the worldwide security of medical image archives. MATERIALS AND METHODS: Using the DICOM data-transmitting standard, we implemented a highly parallel application to scan the entire World Wide Web of networked computers and devices, locating open and unprotected radiology servers. We used only legal and radiology-compliant tools. Our security-probing application initiated a standard DICOM handshake to remote computer or device addresses, and then assessed their security posture on the basis of handshake replies. RESULTS: The scan discovered a total of 2774 unprotected radiology or DICOM servers worldwide. Of those, 719 were fully open to patient data communications. Geolocation was used to analyze and rank our findings according to country utilization. As a result, we built maps and world ranking of clinical security, suggesting that even the most radiology-advanced countries have hospitals with serious security gaps. CONCLUSION: Despite more than two decades of active development and implementation, our radiology data still remains insecure. The results provided should be applied to raise awareness and begin an earnest dialogue toward elimination of the problem. The application we designed and the novel scanning approach we developed can be used to identify security breaches and to eliminate them before they are compromised. © American Roentgen Ray Society.
Van Lambalgen’s theorem states that a pair (α, β) of bit sequences is Martin-Löf random if and only ifα is Martin-Löf random and β is Martin-Löf random relative to α. In [Information and Computation 209.2 (2011): 183-197, Theorem 3.3], Hayato Takahashi generalized van Lambalgen’s theorem for computable measures P on a product of two Cantor spaces; he showed that the equivalence holds for each β for which the conditional probability P(⋅|β) is computable. He asked whether this computability condition is necessary. We give a positive answer by providing a computable measure for which van Lambalgen’s theorem fails. We also present a simple construction of a computable measure for which conditional measure is not computable. Such measures were first constructed by Ackerman et al. ([1]).
We obtain sufficient conditions for the differentiability of solutions to stationary Fokker--Planck--Kolmogorov equations with respect to a parameter. In particular, this gives conditions for the differentiability of stationary distributions of diffusion processes with respect to a parameter.
In this paper we address the problem of finding the most probable state of a discrete Markov random field (MRF), also known as the MRF energy minimization problem. The task is known to be NP-hard in general and its practical importance motivates numerous approximate algorithms. We propose a submodular relaxation approach (SMR) based on a Lagrangian relaxation of the initial problem. Unlike the dual decomposition approach of Komodakis et al., 2011 SMR does not decompose the graph structure of the initial problem but constructs a submodular energy that is minimized within the Lagrangian relaxation. Our approach is applicable to both pairwise and high-order MRFs and allows to take into account global potentials of certain types. We study theoretical properties of the proposed approach and evaluate it experimentally.
Sophistication and logical depth are two measures that express how complicated the structure in a string is. Sophistication is defined as the minimal complexity of a computable function that defines a two-part description for the string that is shortest within some precision; the second can be defined as the minimal computation time of a program that is shortest within some precision. We show that the Busy Beaver function of the sophistication of a string exceeds its logical depth with logarithmically bigger precision, and that logical depth exceeds the Busy Beaver function of sophistication with logarithmically bigger precision. We also show that sophistication is unstable in its precision: constant variations can change its value by a linear term in the length of the string.
We present a new recommender system developed for the Russian interactive radio network FMhost. To the best of our knowledge, it is the first model and associated case study for recommending radio stations hosted by real DJs rather than automatically built streamed playlists. To address such problems as cold start, gray sheep, boosting of rankings, preference and repertoire dynamics, and absence of explicit feedback, the underlying model combines a collaborative user-based approach with personalized information from tags of listened tracks in order to match user and radio station profiles. This is made possible with adaptive tag-aware profiling that follows an online learning strategy based on user history. We compare the proposed algorithms with singular value decomposition (SVD) in terms of precision, recall, and normalized discounted cumulative gain (NDCG) measures; experiments show that in our case the fusion-based approach demonstrates the best results. In addition, we give a theoretical analysis of some useful properties of fusion-based linear combination methods in terms of graded ordered sets.
The Universe of RNA Structures DataBase (URSDB) stores information obtained from all RNA-containing PDB entries (2935 entries in October 2015). The content of the database is updated regularly. The database consists of 51 tables containing indexed data on various elements of the RNA structures. The database provides a web interface allowing user to select a subset of structures with desired features and to obtain various statistical data for a selected subset of structures or for all structures. In particular, one can easily obtain statistics on geometric parameters of base pairs, on structural motifs (stems, loops, etc.) or on different types of pseudoknots. The user can also view and get information on an individual structure or its selected parts, e.g. RNA–protein hydrogen bonds. URSDB employs a new original definition of loops in RNA structures. That definition fits both pseudoknot-free and pseudoknotted secondary structures and coincides with the classical definition in case of pseudoknot-free structures. To our knowledge, URSDB is the first database supporting searches based on topological classification of pseudoknots and on extended loop classification.
In this paper we study interactive “one-shot” analogues of the classical Slepian–Wolf theorem. Alice receives a value of a random variable X, Bob receives a value of another random variable Y that is jointly distributed with X. Alice’s goal is to transmit X to Bob (with some error probability εε). Instead of one-way transmission we allow them to interact. They may also use shared randomness.
Let K be an algebraically closed field of characteristic zero, G_m be its multiplicative group, and G_a be its additive group. Consider a linear algebraic group G=(G_m)^r\times G_a. We study equivariant G-embeddings, i.e. normal G-varieties X which contains G as an open orbit. We prove that X is a toric varieties and all such embeddings comes from Demazure roots of the corresponding fan of polyhedral cones.
We prove the existence of positive linear switching systems (continuous time), whose trajectories grow to infinity, but slower than a given increasing function. This implies that, unlike the situation with linear ODE, the maximal growth of trajectories of linear systems may be arbitrarily slow. For systems generated by a finite set of matrices, this phenomenon is proved to be impossible in dimension 2, while in all bigger dimensions the sublinear growth may occur. The corresponding examples are provided and several open problems are formulated.
The paper is focused on the problem of multi-class classification of composite (piecewise-regular) objects (e.g., speech signals, complex images, etc.). We propose a mathematical model of composite object representation as a sequence of independent segments. Each segment is represented as a random sample of independent identically distributed feature vectors. Based on this model and statistical approach we reduce the task to a problem of composite hypothesis testing of segment homogeneity. Several nearest-neighbor criteria are implemented, for some of them the well-known special cases (e.g., the Kullback-Leibler minimum information discrimination principle, the probabilistic neural network) are highlighted. It is experimentally shown that the proposed approach improves the accuracy when compared with contemporary classifiers.
Characteristic curves of a Hamilton–Jacobi equation can be seen as action minimizing trajectories of fluid particles. However this description is valid only for smooth solutions. For nonsmooth “viscosity” solutions, which give rise to discontinuous velocity fields, this picture holds only up to the moment when trajectories hit a shock and cease to minimize the Lagrangian action. In this paper we discuss two physically meaningful regularization procedures, one corresponding to vanishing viscosity and another to weak noise limit. We show that for any convex Hamiltonian, a viscous regularization allows us to construct a nonsmooth flow that extends particle trajectories and determines dynamics inside the shock manifolds. This flow consists of integral curves of a particular “effective” velocity field, which is uniquely defined everywhere in the flow domain and is discontinuous on shock manifolds. The effective velocity field arising in the weak noise limit is generally non-unique and different from the viscous one, but in both cases there is a fundamental self-consistency condition constraining the dynamics.
Nowadays data-sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of “complex” sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of formal concept analysis and its extension based on “pattern structures”. Pattern structures are used for mining complex data (such as sequences or graphs) and are based on a subsumption operation, which in our case is defined with respect to the partial order on sequences. We show how pattern structures along with projections (i.e. a data reduction of sequential structures) are able to enumerate more meaningful patterns and increase the computing efficiency of the approach. Finally, we show the applicability of the presented method for discovering and analysing interesting patient patterns from a French healthcare data-set on cancer. The quantitative and qualitative results (with annotations and analysis from a physician) are reported in this use-case which is the main motivation for this work.
A multiplier bootstrap procedure for construction of likelihood-based confidence sets is considered for finite samples and a possible model misspecification. Theoretical results justify the bootstrap validity for a small or moderate sample size and allow to control the impact of the parameter dimension pp: the bootstrap approximation works if p3/np3/n is small. The main result about bootstrap validity continues to apply even if the underlying parametric model is misspecified under the so-called small modelling bias condition. In the case when the true model deviates significantly from the considered parametric family, the bootstrap procedure is still applicable but it becomes a bit conservative: the size of the constructed confidence sets is increased by the modelling bias. We illustrate the results with numerical examples for misspecified linear and logistic regressions.
The paper presents a least squares framework for divisive clustering. Two popular divisive clustering methods, Bisecting K-Means and Principal Direction Division, appear to be versions of the same least squares approach. The PDD recently has been enhanced with a stopping criterion taking into account the minima of the corresponding one-dimensional density function (dePDDP method). We extend this approach to Bisecting K-Means by projecting the data onto random directions and compare thus modified methods. It appears the dePDDP method is superior at datasets with relatively small numbers of clusters, whatever cluster intermix, whereas our version of Bisecting K-Means is superior at greater cluster numbers with noise entities added to the cluster structure.
This paper studies the first differences w(t) of the International Sunspot Numbers daily series, ISSN, over the 1850-2013 time span. The one-day correlations r1 between w(t) and w(t+1) are computed within 4 year sliding windows and found to shift from negative to positive values near the end of Cycle 17. They remain positive during the last Grand Maximum and until ~2009, when they fall to zero. We test an autoregressive process of order 1 (AR(1)) as a model that can reproduce the high frequency component of ISSN: we compute r1 for this AR(1) process, and find that it is negative. Positive values of r1 are found only if the process involves positive correlation: this leads us to suggest that the births of successive spots are positively correlated. We also show that the two-day correlation r2 of ISSN is, as expected, closer to 0 than r1. Finally, we identify two prominent regime changes in ~1915 and ~2009, strengthening previous evidence of major anomalies of solar activity at these dates.
Program obfuscation is a semantic-preserving transformation aimed at bringing a program into a form that impedes understanding of its algorithm and data structures or prevents extracting certain valuable information from the text of the program. Since obfuscation may find wide use in computer security, information hiding and cryptography, security requirements to program obfuscators have become a major focus of interest in the theory of software obfuscation starting from the pioneering works in this field. In this paper we give a survey of various definitions of obfuscation security and basic results that establish possibility or impossibility of secure program obfuscation under certain cryptographic assumptions.
Errors in implicative theories coming from binary data are studied. First, two classes of errors that may affect implicative theories are singled out. Two approaches for finding errors of these classes are proposed, both of them based on methods of Formal Concept Analysis. The first approach uses the cardinality minimal (canonical or Duquenne–Guigues) implication base. The construction of such a base is computationally intractable. Using an alternative approach one checks possible errors on the fly in polynomial time via computing closures of subsets of attributes. Both approaches are interactive, based on questions about the validity of certain implications. Results of computer experiments are presented and discussed.
Process-aware information systems (PAIS) are systems relying on processes, which involve human and software resources to achieve concrete goals. There is a need to develop approaches for modeling, analysis, improvement and monitoring processes within PAIS. These approaches include process mining techniques used to discover process models from event logs, find log and model deviations, and analyze performance characteristics of processes. The representational bias (a way to model processes) plays an important role in process mining. The BPMN 2.0 (Business Process Model and Notation) standard is widely used and allows to build conventional and understandable process models. In addition to the flat control flow perspective, subprocesses, data flows, resources can be integrated within one BPMN diagram. This makes BPMN very attractive for both process miners and business users. In this paper, we describe and justify robust control flow conversion algorithms, which provide the basis for more advanced BPMN-based discovery and conformance checking algorithms. We believe that the results presented in this paper can be used for a wide variety of BPMN mining and conformance checking algorithms. We also provide metrics for the processes discovered before and after the conversion to BPMN structures. Cases for which conversion algorithms produce more compact or more involved BPMN models in comparison with the initial models are identified.
In this paper a novel clustering algorithm is proposed as a version of the Seeded Region Growing (SRG) approach for the automatic recognition of coastal upwelling from Sea Surface Temperature (SST) images. The new algorithm, One Seed Expanding Cluster (SEC), takes advantage of the concept of approximate clustering due to Mirkin (1996, 2013) to derive a homogeneity criterion in the format of a product rather than the conventional difference between a pixel value and the mean of values over the region of interest. It involves a boundary-oriented pixel labeling so that the cluster growing is performed by expanding its boundary iteratively. The starting point is a cluster consisting of just one seed, the pixel with the cold est temperature. The baseline version of the SEC algorithm uses the Otsu’s thresholding method to fine-tune the homogeneity threshold. Unfortunately, this method does not always lead to a satisfactory solution. Therefore, we introduce a self-tuning version of the algorithm in which the homogeneity threshold parameter is abolished and the similarity threshold derived from the approximation criterion also serves as a homogeneity parameter.
This paper presents several definitions of “optimal patterns” in triadic data and results of experimental comparison of five triclustering algorithms on real-world and synthetic datasets. The evaluation is carried over such criteria as resource efficiency, noise tolerance and quality scores involving cardinality, density, coverage, and diversity of the patterns. An ideal triadic pattern is a totally dense maximal cuboid (formal triconcept). Relaxations of this notion under consideration are: OAC-triclusters; triclusters optimal with respect to the least-square criterion; and graph partitions obtained by using spectral clustering. We show that searching for an optimal tricluster cover is an NP-complete problem, whereas determining the number of such covers is #P-complete. Our extensive computational experiments lead us to a clear strategy for choosing a solution at a given dataset guided by the principle of Pareto-optimality according to the proposed criteria.
We consider the problems of computing the maximal and the minimal non-empty suffixes of substrings of a longer text of length . n. For the minimal suffix problem we show that for every . τ, . 1≤τ≤logn, there exists a linear-space data structure with . O(τ) query time and . O(nlogn/τ) preprocessing time. As a sample application, we show that this data structure can be used to compute the Lyndon decomposition of any substring of the text in . O(kτ) time, where . k is the number of distinct factors in the decomposition. For the maximal suffix problem, we give a linear-space structure with . O(1) query time and . O(n) preprocessing time. In other words, we simultaneously achieve both the optimal query time and the optimal construction time. © 2015 Elsevier B.V.
This paper describes a modified version of an algorithm suggested earlier by the authors for optimizing of ads allocation in sponsored search on the main page of search results in response to user search queries to a web search engine. It is demonstrated that the modification of the algorithm reduces appreciably the searching procedure and the algorithm complexity. And finally, the new algorithm undergoes experimental testing on real data sets provided by Yandex. © 2015, Pleiades Publishing, Ltd.
The work deals with combinatorial problems concerning colorings of non-uniform hypergraphs. Let $H=(V,E)$ be a hypergraph with minimum edge-cardinality $n$. We show that if $H$ is a simple hypergraph (i.e. every two distinct edges have at most one common vertex) and $$ \sum_{e\in E}r^{1-|e|}\leqslant c\sqrt{n}, $$ for some absolute constant $c>0$, then $H$ is $r$-colorable. We also obtain a stronger result for triangle-free simple hypergraphs by proving that if $H$ is a simple triangle-free hypergraph and $$ \sum_{e\in E}r^{1-|e|}\leqslant c\cdot n, $$ for some absolute constant $c>0$, then $H$ is $r$-colorable.
We initiate a comprehensive study of the complexity of computing Boolean functions by polynomial threshold functions (PTFs) on general Boolean domains (as opposed to domains such as $\zon$ or $\mon$ that enforce multilinearity). A typical example of such a general Boolean domain, for the purpose of our results, is $\{1,2\}^n$. We are mainly interested in the length (the number of monomials) of PTFs, with their degree and weight being of secondary interest. First we motivate the study of PTFs over the $\{1,2\}^n$ domain by showing their close relation to depth two threshold circuits. In particular we show that PTFs of polynomial length and polynomial degree compute exactly the functions computed by polynomial size $\ensuremath{{\rm \sf THR}} \circ \ensuremath{{\rm \sf MAJ}}$ circuits. We note that known lower bounds for $\ensuremath{{\rm \sf THR}} \circ \ensuremath{{\rm \sf MAJ}}$ circuits extends to the likely strictly stronger model of PTFs (with no degree restriction). We also show that a ``max-plus'' version of PTFs are related to $\mathsf{AC}^0 \circ \ensuremath{{\rm \sf THR}}$ circuits. We exploit this connection to gain a better understanding of threshold circuits. In particular, we show that (super-logarithmic) lower bounds for 3-player randomized communication protocols with unbounded error would yield (super-polynomial) size lower bounds for $\ensuremath{{\rm \sf THR}} \circ \ensuremath{{\rm \sf THR}}$ circuits. Finally, having thus motivated the model, we initiate structural studies of PTFs. These include relationships between weight and degree of PTFs, and a degree lower bound for PTFs of constant length.
The Multifractal Embedded Branching Process (MEBP) process and Canonical Embedded Branching Process (CEBP) process were introduced by Decrouez and Jones (2012). The CEBP is a process in which the crossings of dyadic intervals constitute a branching process. An MEBP process is defined as a mul- tifractal time-change of a CEBP process, where the time-change is such that both it and the CEBP can be simulated on-line. In this paper, under various moment conditions, we show that CEBP processes have a constant modulus of continuity, obtain the Hausdorff spectrum of the time-change, and thus obtain the Hausdorff spectrum of an MEBP process.
We show that the Voronoi conjecture is true for parallelohedra with simply connected δ-surfaces. That is, we show that if the boundary of parallelohedron P remains simply connected after removing closed nonprimitive faces of codimension 2, then P is affinely equivalent to a Dirichlet–Voronoi domain of some lattice. Also, we construct the π-surface associated with a parallelohedron and give another condition in terms of a homology group of the constructed surface. Every parallelohedron with a simply connected δ-surface also satisfies the condition on the homology group of the π-surface.
We study a stability property of probability laws with respect to small violations of algorithmic randomness. Some sufficient condition of stability is presented in terms of Schnorr tests of algorithmic randomness. Most probability laws, like the strong law of large numbers, the law of iterated logarithm, and even Birkhoff's pointwise ergodic theorem for ergodic transformations, are stable in this sense. Nevertheless, the phenomenon of instability occurs in ergodic theory. Firstly, the stability property of Birkhoff's ergodic theorem is non-uniform. Moreover, a computable non-ergodic measure-preserving transformation can be constructed such that the ergodic theorem is non-stable.
The theory of separating codes has been applied in several areas of science ranging from automata synthesis to the protection of distribution rights. In this paper, we introduce a relaxed version of separating and secure frameproof codes and show that for the relaxed definitions these two notions are different, as opposed to the original definitions when these notions coincide. Moreover, we also discuss how this new relaxed versions of the codes can be used to construct a family of fingerprinting codes.
Since their inception in 1962, Petri nets have been used in a wide variety of application domains. Although Petri nets are graphical and easy to understand, they have formal semantics and allow for analysis techniques ranging from model checking and structural analysis to process mining and performance analysis. Over time Petri nets emerged as a solid foundation for Business Process Management (BPM) research. The BPM discipline develops methods, techniques, and tools to support the design, enactment, management, and analysis of operational business processes. Mainstream business process modeling notations and workflow management systems are using token-based semantics borrowed from Petri nets. Moreover, state-of-the-art BPM analysis techniques are using Petri nets as an internal representation. Users of BPM methods and tools are often not aware of this. This paper aims to unveil the seminal role of Petri nets in BPM.
A new data structure for efficient similarity search in very large datasets of high-dimensional vectors is introduced. This structure called the inverted multi-index generalizes the inverted index idea by replacing the standard quantization within inverted indices with product quantization. For very similar retrieval complexity and pre-processing time, inverted multi-indices achieve a much denser subdivision of the search space compared to inverted indices, while retaining their memory efficiency. Our experiments with large datasets of SIFT and GIST vectors demonstrate that because of the denser subdivision, inverted multi-indices are able to return much shorter candidate lists with higher recall. Augmented with a suitable reranking procedure, multi-indices were able to significantly improve the speed of approximate nearest neighbor search on the dataset of 1 billion SIFT vectors compared to the best previously published systems, while achieving better recall and incurring only few percent of memory overhead. © 2014 IEEE.
This study is dedicated to the introduction of a novel method that automatically extracts potential structural alerts from a data set of molecules. These triggering structures can be further used for knowledge discovery and classification purposes. Computation of the structural alerts results from an implementation of a sophisticated workflow that integrates a graph mining tool guided by growth rate and stability. The growth rate is a well-established measurement of contrast between classes. Moreover, the extracted patterns correspond to formal concepts; the most robust patterns, named the stable emerging patterns (SEPs), can then be identified thanks to their stability, a new notion originating from the domain of formal concept analysis. All of these elements are explained in the paper from the point of view of computation. The method was applied to a molecular data set on mutagenicity. The experimental results demonstrate its efficiency: it automatically outputs a manageable number of structural patterns that are strongly related to mutagenicity. Moreover, a part of the resulting structures corresponds to already known structural alerts. Finally, an in-depth chemical analysis relying on these structures demonstrates how the method can initiate promising processes of chemical knowledge discovery. © 2015 American Chemical Society.
Activities such as clinical investigations (CIs) or financial processes are subject to regulations to ensure quality of results and avoid negative consequences. Regulations may be imposed by multiple governmental agencies as well as by institutional policies and protocols. Due to the complexity of both regulations and activities, there is great potential for violation due to human error, misunderstanding, or even intent. Executable formal models of regulations, protocols and activities can form the foundation for automated assistants to aid planning, monitoring and compliance checking. We propose a model based on multiset rewriting where time is discrete and is specified by timestamps attached to facts. Actions, as well as initial, goal and critical states may be constrained by means of relative time constraints. Moreover, actions may have non-deterministic effects, i.e. they may have different outcomes whenever applied. We present a formal semantics of our model based on focused proofs of linear logic with definitions. We also determine the computational complexity of various planning problems. Plan compliance problem, for example, is the problem of finding a plan that leads from an initial state to a desired goal state without reaching any undesired critical state. We consider all actions to be balanced, i.e. their pre- and post-conditions have the same number of facts. Under this assumption on actions, we show that the plan compliance problem is PSPACE-complete when all actions have only deterministic effects and is EXPTIME-complete when actions may have non-deterministic effects. Finally, we show that the restrictions on the form of actions and time constraints taken in the specification of our model are necessary for decidability of the planning problems. Copyright © Cambridge University Press 2015
A tropical (or min-plus) semiring is a set $\mathbb{Z}$ (or $\mathbb{Z \cup \{\infty\}}$) endowed with two operations: $\oplus$, which is just usual minimum, and $\odot$, which is usual addition. In tropical algebra a vector $x$ is a solution to a polynomial $g_1(x) \oplus g_2(x) \oplus \ldots \oplus g_k(x)$, where $g_i(x)$'s are tropical monomials, if the minimum in $\min_i(g_{i}(x))$ is attained at least twice. In min-plus algebra solutions of systems of equations of the form $g_1(x)\oplus \ldots \oplus g_k(x) = h_1(x)\oplus \ldots \oplus h_l(x)$ are studied. In this paper we consider computational problems related to tropical linear system. We show that the solvability problem (both over $\mathbb{Z}$ and $\mathbb{Z} \cup \{\infty\}$) and the problem of deciding the equivalence of two linear systems (both over $\mathbb{Z}$ and $\mathbb{Z} \cup \{\infty\}$) are equivalent under polynomial-time reductions to mean payoff games and are also equivalent to analogous problems in min-plus algebra. In particular, all these problems belong to $\mathsf{NP} \cap \mathsf{coNP}$. Thus, we provide a tight connection of computational aspects of tropical linear algebra with mean payoff games and min-plus linear algebra. On the other hand we show that computing the dimension of the solution space of a tropical linear system and of a min-plus linear system is $\mathsf{NP}$-complete.
We study a stability property of probability laws with respect to small violations of algorithmic randomness. Some sufficient condition of stability is presented in terms of Schnorr tests of algorithmic randomness. Most probability laws, like the strong law of large numbers, the law of iterated logarithm, and even Birkhoff's pointwise ergodic theorem for ergodic transformations, are stable in this sense. Nevertheless, the phenomenon of instability occurs in ergodic theory. Firstly, the stability property of Birkhoff's ergodic theorem is non-uniform. Moreover, a computable non-ergodic measure-preserving transformation can be constructed such that the ergodic theorem is non-stable.
Consideration was given to optimization of the queue control strategy in the MlGl1l queuing system where decision about continuing or stopping admission of customers is made at the service completion instants of each customer in compliance with the distribution on the set of decisions depending on the number of customers remaining in the system. The mean specific income in the stationary mode was used as the efficiency criterion and the set of permissible strategies coincided with the set of homogeneous randomised Markov strategies. It was proved that if there exists an optimal strategy, then it is degenerate and threshold with one point on control switching, that is, if the number of customers in the system exceeds a certain level, then admission of customers must be stooped or, otherwise, it must be continued.
In this paper we study the structure of suffix trees. Given an unlabelled tree $\tau$ on $n$ nodes and suffix links of its internal nodes, we ask the question "Is $\tau$ a suffix tree?", i.e., is there a string $S$ whose suffix tree has the same topological structure as $\tau$? We place no restrictions on $S$, in particular we do not require that $S$ ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. We prove that $\tau$ is a suffix tree if and only if it is realized by a string $S$ of length $n-1$, and we give a linear-time algorithm for inferring $S$ when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014].
We apply the suboptimal sequential nonparametric hypotheses testing approach for effectiveness of a statistical decision by sample space reducing. Numerical examples of the sample space reducing are given when an appropriate reducing makes it possible to construct robust sequential nonparametric hypotheses testing with a smaller mean duration time then one on the total sample space. © 2014 IEEE.
Formal concept analysis (FCA) is a well-founded method for data analysis and has many applications in data mining. Pattern structures is an extension of FCA for dealing with complex data such as sequences or graphs. However the computational complexity of computing with pattern structures is high and projections of pattern structures were introduced for simplifying computation. In this paper we introduce o-projections of pattern structures, a generalization of projections which defines a wider class of projections preserving the properties of the original approach. Moreover, we show that o-projections form a semilattice and we discuss the correspondence between o-projections and the representation contexts of o-projected pattern structures. © 2015 Springer International Publishing Switzerland.
An extension of the problem of reconstruction of words from a given multiset of its subwords supposedly generated by unit shifts of a window of fixed length along such words. In the extension, feasible solutions must satisfy additional constraints. The case is considered when these constraints are specified by forbidden words. A solution to the problem is obtained as a result of searching for Euler paths or Euler cycles in a de Bruijn multidigraph with the additional operation of reduction of its edges and subsequent application of special algebraic operations of multiplication of adjacency matrices as in the first part of this article. © 2015, Springer Science+Business Media New York.
The paper defines an annotated suffix tree (AST) - a data structure used to calculate and store the frequencies of all the fragments of the given string or a collection of strings. The AST is associated with a string to text scoring, which takes all fuzzy matches into account. We show how the AST and the AST scoring can be used for Natural Language Processing tasks. Copyright © by the paper's authors. Copying only for private and academic purposes.
The cell membrane is "stuffed" with proteins, whose transmembrane (TM) helical domains spontaneously associate to form functionally active complexes. For a number of membrane receptors, a modulation of TM domains' oligomerization has been shown to contribute to the development of severe pathological states, thus calling for detailed studies of the atomistic aspects of the process. Despite considerable progress achieved so far, several crucial questions still remain: How do the helices recognize each other in the membrane? What is the driving force of their association? Here, we assess the dimerization free energy of TM helices along with a careful consideration of the interplay between the structure and dynamics of protein and lipids using atomistic molecular dynamics simulations in the hydrated lipid bilayer for three different model systems - TM fragments of glycophorin A, polyalanine and polyleucine peptides. We observe that the membrane driven association of TM helices exhibits a prominent entropic character, which depends on the peptide sequence. Thus, a single TM peptide of a given composition induces strong and characteristic perturbations in the hydrophobic core of the bilayer, which may facilitate the initial "communication" between TM helices even at the distances of 20-30 Å. Upon tight helix-helix association, the immobilized lipids accommodate near the peripheral surfaces of the dimer, thus disturbing the packing of the surrounding. The dimerization free energy of the modeled peptides corresponds to the strength of their interactions with lipids inside the membrane being the lowest for glycophorin A and similarly higher for both homopolymers. We propose that the ability to accommodate lipid tails determines the dimerization strength of TM peptides and that the lipid matrix directly governs their association. © 2015 American Chemical Society.
An approach to multiple labeling research papers is explored. We develop techniques for annotating/labeling research pa- pers in informatics and computer sciences with key phrases taken from the ACM Computing Classification System. The techniques utilize a phrase-to-text relevance measure so that only those phrases that are most relevant go to the anno- tation. Three phrase-to-text relevance measures are experi- mentally compared in this setting. The measures are: (a) co- sine relevance score between conventional vector space repre- sentations of the texts coded with tf-idf weighting; (b) pop- ular characteristic of probability of term generation BM25; and (c) an in-house characteristic of conditional probability of symbols averaged over matching fragments in suffix trees representing texts and phrases, CPAMF. In an experiment conducted over a set of texts published in journals of the ACM and manually annotated by their authors, CPAMF outperforms both the cosine measure and BM25 by a wide margin.