Тел.: +7 (495) 772-95-90 * 12332
125319, Москва, Кочновский проезд, д. 3 (недалеко от станции метро "Аэропорт").
Декан — Аржанцев Иван Владимирович
Первый заместитель декана факультета — Вознесенская Тамара Васильевна
Заместитель декана по научной работе и международным связям — Объедков Сергей Александрович
Заместитель декана по административно-финансовой работе — Плисецкая Ирина Александровна
Algorithmic statistics is a part of algorithmic information theory (Kolmogorov complexity theory) that studies the following task: given a finite object x (say, a binary string), find an `explanation' for it, i.e., a simple finite set that contains x and where x is a `typical element'. Both notions (`simple' and `typical') are defined in terms of Kolmogorov complexity. It is known that this cannot be achieved for some objects: there are some ``non-stochastic'' objects that do not have good explanations. In this paper we study the properties of maximally non-stochastic objects; we call them ``antistochastic''. In this paper, we demonstrate that the antistochastic strings have the following property: if an antistochastic string x has complexity k, then any k bit of information about x are enough to reconstruct x (with logarithmic advice). In particular, if we erase all but k bits of this antistochastic string, the erased bits can be restored from the remaining ones (with logarithmic advice). As a corollary we get the existence of good list-decoding codes with erasures (or other ways of deleting part of the information). Antistochastic strings can also be used as a source of counterexamples in algorithmic information theory. We show that the symmetry of information property fails for total conditional complexity for antistochastic strings.
Multiplicative matrix semigroups with constant spectral radius (c.s.r.) are studied and applied to several problems of algebra, combinatorics, functional equations, and dynamical systems. We show that all such semigroups are characterized by means of irreducible ones. Each irreducible c.s.r. semigroup defines walks on Euclidean sphere, all its nonsingular elements are similar (in the same basis) to orthogonal. We classify all nonnegative c.s.r. semigroups and arbitrary low-dimensional semigroups. For higher dimensions, we describe five classes and leave an open problem on completeness of that list. The problem of algorithmic recognition of c.s.r. property is proved to be polynomially solvable for irreducible semigroups and undecidable for reducible ones.
The Minkowski weighted K-means (MWK-means) is a recently developed clustering algorithm capable of computing feature weights. The cluster-specific weights in MWK-means follow the intuitive idea that a feature with low variance should have a greater weight than a feature with high variance. The final clustering found by this algorithm depends on the selection of the Minkowski distance exponent. This paper explores the possibility of using the central Minkowski partition in the ensemble of all Minkowski partitions for selecting an optimal value of the Minkowski exponent. The central Minkowski partition appears to be also a good consensus partition. Furthermore, we discovered some striking correlation results between the Minkowski profile, defined as a mapping of the Minkowski exponent values into the average similarity values of the optimal Minkowski partitions, and the Adjusted Rand Index vectors resulting from the comparison of the obtained partitions to the ground truth. Our findings were confirmed by a series of computational experiments involving synthetic Gaussian clusters and real-world data
In this paper we study interactive “one-shot” analogues of the classical Slepian–Wolf theorem. Alice receives a value of a random variable X, Bob receives a value of another random variable Y that is jointly distributed with X. Alice’s goal is to transmit X to Bob (with some error probability ε). Instead of one-way transmission we allow them to interact. They may also use shared randomness. We show, that for every natural r Alice can transmit X to Bob using (1+1r)H(X|Y)+r+O(log2(1ε))(1+1r)H(X|Y)+r+O(log2(1ε)) bits on average in 2H(X|Y)r+22H(X|Y)r+2 rounds on average. Setting r=⌈H(X|Y)−−−−−−−√⌉r=⌈H(X|Y)⌉ and using a result of Braverman and Garg (2) we conclude that every one-round protocol π with information complexity I can be compressed to a (many-round) protocol with expected communication about I+2I–√I+2I bits. This improves a result by Braverman and Rao (3), where they had I+5I–√I+5I. Further, we show (by setting r = ⌈H(X|Y)⌉) how to solve this problem (transmitting X) using 2H(X|Y)+O(log2(1ε))2H(X|Y)+O(log2(1ε)) bits and 4 rounds on average. This improves a result of Brody et al. (4), where they had 4H(X|Y)+O(log1/ε)4H(X|Y)+O(log1/ε) bits and 10 rounds on average. In the end of the paper we discuss how many bits Alice and Bob may need to communicate on average besides H(X|Y). The main question is whether the upper bounds mentioned above are tight. We provide an example of (X, Y), such that transmission of X from Alice to Bob with error probability ε requires H(X|Y)+Ω(log2(1ε))H(X|Y)+Ω(log2(1ε)) bits on average.
We consider Gillette’s two-person zero-sum stochastic games with perfect information. For each k∈N={0,1,…}k∈N={0,1,…} we introduce an effective reward function, called k-total. For k=0k=0 and 1 this function is known as mean payoff and total reward, respectively. We restrict our attention to the deterministic case. For all k, we prove the existence of a saddle point which can be realized by uniformly optimal pure stationary strategies. We also demonstrate that k-total reward games can be embedded into (k+1)(k+1)-total reward games.
A small improvement in the structure of a material could potentially lower manufacturing costs. Thus, the free material design can be formulated as an optimization problem. However, due to its large scale, second-order methods cannot solve the free material design problem in a reasonable time. We formulate the free material optimization (FMO) problem into a saddle-point form in which the inverse of the stiffness matrix $A(E)$ in the constraint is eliminated. The size of $A(E)$ is generally large, denoted as $N \times N$. This is the first formulation of FMO without $A(E)^{-1}$. We apply the primal-dual subgradient method [Y. Nesterov, Math. Program., 120 (2009), pp. 221--259] to solve the restricted saddle-point formula. This is the first gradient-type method for FMO. Each iteration of our algorithm takes a total of $\mathcal{O}(N^2)$ floating-point operations and an auxiliary vector storage of size $\mathcal{O}(N)$, compared with formulations having the inverse of $A(E)$ which requires $\mathcal{O}(N^3)$ arithmetic operations and an auxiliary vector storage of size $\mathcal{O}(N^2)$. To solve the problem, we developed a closed-form solution to a semidefinite least squares problem and an efficient parameter update scheme for the gradient method. We also approximate a solution to the bounded Lagrangian dual problem. The problem is decomposed into small problems, each having only an unknown of $k\times k$ ($k=3$ or $6$) matrix, and can be solved in parallel. The iteration bound of our algorithm is optimal for a general subgradient scheme. Finally, we present promising numerical results.\
Dualization of a monotone Boolean function on a finite lattice can be represented by transforming the set of its minimal 1 values to the set of its maximal 0 values. In this paper we consider finite lattices given by ordered sets of their meet and join irreducibles (i.e., as a concept lattice of a formal context). We show that in this case dualization is equivalent to the enumeration of so-called minimal hypotheses. In contrast to usual dualization setting, where a lattice is given by the ordered set of its elements, dualization in this case is shown to be impossible in output polynomial time unless P = NP. However, if the lattice is distributive, dualization is shown to be possible in subexponential time.
Small open reading frames (sORFs) and genes for non-coding RNAs are poorly investigated components of most genomes. Our analysis of 1391 ORFs recently annotated in the soybean symbiont Bradyrhizobium japonicum USDA 110 revealed that 78% of them contain less than 80 codons. Twenty-one of these sORFs are conserved in or outside Alphaproteobacteria and most of them are similar to genes found in transposable elements, in line with their broad distribution. Stabilizing selection was demonstrated for sORFs with proteomic evidence and bll1319_ISGA which is conserved at the nucleotide level in 16 alphaproteobacterial species, 79 species from other taxa and 49 other Proteobacteria. Further we used Northern blot hybridization to validate ten small RNAs (BjsR1 to BjsR10) belonging to new RNA families. We found that BjsR1 and BjsR3 have homologs outside the genus Bradyrhizobium, and BjsR5, BjsR6, BjsR7, and BjsR10 have up to four imperfect copies in Bradyrhizobium genomes. BjsR8, BjsR9, and BjsR10 are present exclusively in nodules, while the other sRNAs are also expressed in liquid cultures. We also found that the level of BjsR4 decreases after exposure to tellurite and iron, and this down-regulation contributes to survival under high iron conditions. Analysis of additional small RNAs overlapping with 3’-UTRs revealed two new repetitive elements named Br-REP1 and Br-REP2. These REP elements may play roles in the genomic plasticity and gene regulation and could be useful for strain identification by PCR-fingerprinting. Furthermore, we studied two potential toxin genes in the symbiotic island and confirmed toxicity of the yhaVhomolog bll1687 but not of the newly annotated higB homolog blr0229_ISGA in E. coli. Finally, we revealed transcription interference resulting in an antisense RNA complementary to blr1853, a gene induced in symbiosis. The presented results expand our knowledge on sORFs, non-coding RNAs and repetitive elements in B. japonicum and related bacteria.
Biological nitrogen fixation plays a crucial role in the nitrogen cycle. An ability to fix atmospheric nitrogen, reducing it to ammonium, was described for multiple species of Bacteria and Archaea. The transcriptional regulatory network for nitrogen fixation was extensively studied in several representatives of the class Alphaproteobacteria. This regulatory network includes the activator of nitrogen fixation NifA, working in tandem with the alternative sigma-factor RpoN as well as oxygen-responsive regulatory systems, one-component regulators FnrN/FixK and two-component system FixLJ. Here we used a comparative genomics approach for in silico study of the transcriptional regulatory network in 50 genomes of Alphaproteobacteria. We extended the known regulons and proposed the scenario for the evolution of the nitrogen fixation transcriptional network. The reconstructed network substantially expands the existing knowledge of transcriptional regulation in nitrogen-fixing microorganisms and can be used for genetic experiments, metabolic reconstruction, and evolutionary analysis. © 2016 Tsoy, Ravcheev, Čuklina and Gelfand.
This paper proposes a new object model of data for the in-depth analysis of network traffic. In contrast to the model used by most modern network analyzers (for example, Wireshark and Snort), the proposed model supports data stream reassembling with subsequent parsing. The model also provides a convenient universal mechanism for binding parsers, thus making it possible to develop completely independent parsers. Moreover, the proposed model allows processing modified—compressed or encrypted—data. This model forms the basis of the infrastructure for the in-depth analysis of network traffic.
This paper considers the problem of the deductive verification of the Linux kernel code that is concurrent and accesses shared data. The presence of shared data does not allow applying traditional deductive verification techniques, so we consider how to verify such a code by proving its compliance to a given specification of a certain synchronization discipline. The approach is illustrated by the examples of a spinlock specification and a simplified specification of the read-copy-update (RCU) API.
This paper presents a memory model with nonoverlapping memory areas (regions) for the deductive verification of C programs. This memory model uses a core language that supports arbitrary nesting of structures, unions, and arrays into other structures and allows reducing the number of user-provided annotations as a result of region analysis. This paper also describes semantics of the core language and normalizing transformations for translating an input annotated C program into a program in the core language. In addition, an encoding is proposed for modeling the memory state of a core-language program with logical formulas as described by the model semantics of the core language. For the model semantics, the soundness and completeness theorems are proved. Additional constraints on the typing context of the core-language terms are described that determine the result of the region analysis enabling the complete modeling of a limited class of programs without using additional annotations. A proof sketch for the theorem stating completeness of the proposed region analysis for a limited class of programs is presented.
Multi-level multi-agent systems (MASs) with dynamic structure are widely used in solving important applied problems in telecommunication, transportation, social, and other systems. Therefore, ensuring correct behavior of such systems is an actual and important task. One of the most error-prone stages of system development in the framework of model-oriented approach is the implementation stage, in the course of which a program code is constructed based on the model developed. This paper presents an algorithm for automated translation of MAS models represented as nested Petri nets into systems of distributed components. Nested Petri nets are the extension of Petri nets in the framework of the nets-within-nets approach, which assumes that tokens in a Petri net may themselves be Petri nets, possess autonomous behavior, and interact with other tokens of the net. This makes it possible to model MASs with dynamic structure in a natural way. The translation presented in this paper preserves distribution level and important behavioral properties (safety, liveness, and conditional liveness) of the original model and ensures fairness of the target system execution. The use of such translation makes it possible to automate construction of distributed MASs by models of nested Petri nets. As a test example, translation of nested Petri nets into systems of distributed components was implemented on the basis of the EJB component technology.
In order to solve robust PageRank problem a saddle-point Mirror Descent algorithm for solving convex-concave optimization problems is enhanced and studied. The algorithm is based on two proxy functions, which use specificities of value sets to be optimized on (min-max search). In robust PageRank case the ones are entropy-like function and square of Euclidean norm. The saddle-point Mirror Descent algorithm application to robust PageRank leads to concrete complexity results, which are being discussed alongside with illustrative numerical example.
We derive two convergence results for a sequential alternating maximization procedure to approximate the maximizer of random functionals such as the realized log likelihood in MLE estimation. We manage to show that the sequence attains the same deviation properties as shown for the profile M-estimator by Andresen and Spokoiny (2013), that means a finite sample Wilks and Fisher theorem. Further under slightly stronger smoothness constraints on the random functional we can show nearly linear convergence to the global maximizer if the starting point for the procedure is well chosen. ©2016 Andreas Andresen, and Vladimir Spokoiny.
ecomposition is an important phase in the design of medium and large-scale systems. Various architectures of software systems and decomposition methods are studied in numerous publications. Presently, formal specifications of software systems are mainly used for experimental purposes; for this reason, their size and complexity are relatively low. As a result, in the development of a nontrivial specification, different approaches to the decomposition should be compared and the most suitable approach should be chosen. In this paper, the experience gained in the deductive verification of the formal specification of the mandatory entity-role model of access and information flows control in Linux (MROSL DP-model) using the formal Event-B method and stepwise refinement technique is analyzed. Two approaches to the refinementbased decomposition of specifications are compared and the sources and features of the complexity of the architecture of the model are investigated.
In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. This allows the cluster merging process to start from this partition rather than from a trivial partition composed solely of singletons.
Our second contribution is an extension of the Ward and Wardp algorithms to the situation where the feature weight exponent can differ from the exponent of the Minkowski distance. This new method, called A-Wardpβ, is able to generate a much wider variety of clustering solutions. We also demonstrate that its parameters can be estimated reasonably well by using a cluster validity index.
We perform numerous experiments using data sets with two types of noise, insertion of noise features and blurring within-cluster values of some features. These experiments allow us to conclude: (i) our anomalous pattern initialisation method does indeed reduce the time a hierarchical clustering algorithm takes to complete, without negatively impacting its cluster recovery ability; (ii) A-Wardpβ provides better cluster recovery than both Ward and Wardp.
This paper suggests a new randomized forecasting method based on entropy-robust estimation for the probability density functions (PDFs) of random parameters in dynamic models described by the systems of linear ordinary differential equations. The structure of the PDFs of the parameters and measurement noises with the maximal entropy is studied. We generate the sequence of random vectors with the entropy-optimal PDFs using a modification of the Ulam–von Neumann method. The developed method of randomized forecasting is applied to the world population forecasting problem.
In this paper, we study scalar multivariate non-stationary subdivision schemes with integer dilation matrix M and present a unifying, general approach for checking their convergence and for determining their Hölder regularity (latter in the case M=mI,m≥2). The combination of the concepts of asymptotic similarity and approximate sum rules allows us to link stationary and non-stationary settings and to employ recent advances in methods for exact computation of the joint spectral radius. As an application, we prove a recent conjecture by Dyn et al. on the Hölder regularity of the generalized Daubechies wavelets. We illustrate our results with several examples.
OBJECTIVE: Despite the long history of digital radiology, one of its most critical aspects-information security-still remains extremely underdeveloped and poorly standardized. To study the current state of radiology security, we explored the worldwide security of medical image archives. MATERIALS AND METHODS: Using the DICOM data-transmitting standard, we implemented a highly parallel application to scan the entire World Wide Web of networked computers and devices, locating open and unprotected radiology servers. We used only legal and radiology-compliant tools. Our security-probing application initiated a standard DICOM handshake to remote computer or device addresses, and then assessed their security posture on the basis of handshake replies. RESULTS: The scan discovered a total of 2774 unprotected radiology or DICOM servers worldwide. Of those, 719 were fully open to patient data communications. Geolocation was used to analyze and rank our findings according to country utilization. As a result, we built maps and world ranking of clinical security, suggesting that even the most radiology-advanced countries have hospitals with serious security gaps. CONCLUSION: Despite more than two decades of active development and implementation, our radiology data still remains insecure. The results provided should be applied to raise awareness and begin an earnest dialogue toward elimination of the problem. The application we designed and the novel scanning approach we developed can be used to identify security breaches and to eliminate them before they are compromised. © American Roentgen Ray Society.
Van Lambalgen’s theorem states that a pair (α, β) of bit sequences is Martin-Löf random if and only ifα is Martin-Löf random and β is Martin-Löf random relative to α. In [Information and Computation 209.2 (2011): 183-197, Theorem 3.3], Hayato Takahashi generalized van Lambalgen’s theorem for computable measures P on a product of two Cantor spaces; he showed that the equivalence holds for each β for which the conditional probability P(⋅|β) is computable. He asked whether this computability condition is necessary. We give a positive answer by providing a computable measure for which van Lambalgen’s theorem fails. We also present a simple construction of a computable measure for which conditional measure is not computable. Such measures were first constructed by Ackerman et al. ([1]).
We obtain sufficient conditions for the differentiability of solutions to stationary Fokker--Planck--Kolmogorov equations with respect to a parameter. In particular, this gives conditions for the differentiability of stationary distributions of diffusion processes with respect to a parameter.
In this paper we address the problem of finding the most probable state of a discrete Markov random field (MRF), also known as the MRF energy minimization problem. The task is known to be NP-hard in general and its practical importance motivates numerous approximate algorithms. We propose a submodular relaxation approach (SMR) based on a Lagrangian relaxation of the initial problem. Unlike the dual decomposition approach of Komodakis et al., 2011 SMR does not decompose the graph structure of the initial problem but constructs a submodular energy that is minimized within the Lagrangian relaxation. Our approach is applicable to both pairwise and high-order MRFs and allows to take into account global potentials of certain types. We study theoretical properties of the proposed approach and evaluate it experimentally.
Sophistication and logical depth are two measures that express how complicated the structure in a string is. Sophistication is defined as the minimal complexity of a computable function that defines a two-part description for the string that is shortest within some precision; the second can be defined as the minimal computation time of a program that is shortest within some precision. We show that the Busy Beaver function of the sophistication of a string exceeds its logical depth with logarithmically bigger precision, and that logical depth exceeds the Busy Beaver function of sophistication with logarithmically bigger precision. We also show that sophistication is unstable in its precision: constant variations can change its value by a linear term in the length of the string.
In this paper we study interactive “one-shot” analogues of the classical Slepian–Wolf theorem. Alice receives a value of a random variable X, Bob receives a value of another random variable Y that is jointly distributed with X. Alice’s goal is to transmit X to Bob (with some error probability εε). Instead of one-way transmission we allow them to interact. They may also use shared randomness.
The paper is focused on the problem of multi-class classification of composite (piecewise-regular) objects (e.g., speech signals, complex images, etc.). We propose a mathematical model of composite object representation as a sequence of independent segments. Each segment is represented as a random sample of independent identically distributed feature vectors. Based on this model and statistical approach we reduce the task to a problem of composite hypothesis testing of segment homogeneity. Several nearest-neighbor criteria are implemented, for some of them the well-known special cases (e.g., the Kullback-Leibler minimum information discrimination principle, the probabilistic neural network) are highlighted. It is experimentally shown that the proposed approach improves the accuracy when compared with contemporary classifiers.
A multiplier bootstrap procedure for construction of likelihood-based confidence sets is considered for finite samples and a possible model misspecification. Theoretical results justify the bootstrap validity for a small or moderate sample size and allow to control the impact of the parameter dimension pp: the bootstrap approximation works if p3/np3/n is small. The main result about bootstrap validity continues to apply even if the underlying parametric model is misspecified under the so-called small modelling bias condition. In the case when the true model deviates significantly from the considered parametric family, the bootstrap procedure is still applicable but it becomes a bit conservative: the size of the constructed confidence sets is increased by the modelling bias. We illustrate the results with numerical examples for misspecified linear and logistic regressions.
Program obfuscation is a semantic-preserving transformation aimed at bringing a program into a form that impedes understanding of its algorithm and data structures or prevents extracting certain valuable information from the text of the program. Since obfuscation may find wide use in computer security, information hiding and cryptography, security requirements to program obfuscators have become a major focus of interest in the theory of software obfuscation starting from the pioneering works in this field. In this paper we give a survey of various definitions of obfuscation security and basic results that establish possibility or impossibility of secure program obfuscation under certain cryptographic assumptions.
Errors in implicative theories coming from binary data are studied. First, two classes of errors that may affect implicative theories are singled out. Two approaches for finding errors of these classes are proposed, both of them based on methods of Formal Concept Analysis. The first approach uses the cardinality minimal (canonical or Duquenne–Guigues) implication base. The construction of such a base is computationally intractable. Using an alternative approach one checks possible errors on the fly in polynomial time via computing closures of subsets of attributes. Both approaches are interactive, based on questions about the validity of certain implications. Results of computer experiments are presented and discussed.
In this paper a novel clustering algorithm is proposed as a version of the Seeded Region Growing (SRG) approach for the automatic recognition of coastal upwelling from Sea Surface Temperature (SST) images. The new algorithm, One Seed Expanding Cluster (SEC), takes advantage of the concept of approximate clustering due to Mirkin (1996, 2013) to derive a homogeneity criterion in the format of a product rather than the conventional difference between a pixel value and the mean of values over the region of interest. It involves a boundary-oriented pixel labeling so that the cluster growing is performed by expanding its boundary iteratively. The starting point is a cluster consisting of just one seed, the pixel with the cold est temperature. The baseline version of the SEC algorithm uses the Otsu’s thresholding method to fine-tune the homogeneity threshold. Unfortunately, this method does not always lead to a satisfactory solution. Therefore, we introduce a self-tuning version of the algorithm in which the homogeneity threshold parameter is abolished and the similarity threshold derived from the approximation criterion also serves as a homogeneity parameter.
This paper presents several definitions of “optimal patterns” in triadic data and results of experimental comparison of five triclustering algorithms on real-world and synthetic datasets. The evaluation is carried over such criteria as resource efficiency, noise tolerance and quality scores involving cardinality, density, coverage, and diversity of the patterns. An ideal triadic pattern is a totally dense maximal cuboid (formal triconcept). Relaxations of this notion under consideration are: OAC-triclusters; triclusters optimal with respect to the least-square criterion; and graph partitions obtained by using spectral clustering. We show that searching for an optimal tricluster cover is an NP-complete problem, whereas determining the number of such covers is #P-complete. Our extensive computational experiments lead us to a clear strategy for choosing a solution at a given dataset guided by the principle of Pareto-optimality according to the proposed criteria.
We consider the problems of computing the maximal and the minimal non-empty suffixes of substrings of a longer text of length . n. For the minimal suffix problem we show that for every . τ, . 1≤τ≤logn, there exists a linear-space data structure with . O(τ) query time and . O(nlogn/τ) preprocessing time. As a sample application, we show that this data structure can be used to compute the Lyndon decomposition of any substring of the text in . O(kτ) time, where . k is the number of distinct factors in the decomposition. For the maximal suffix problem, we give a linear-space structure with . O(1) query time and . O(n) preprocessing time. In other words, we simultaneously achieve both the optimal query time and the optimal construction time. © 2015 Elsevier B.V.
This paper describes a modified version of an algorithm suggested earlier by the authors for optimizing of ads allocation in sponsored search on the main page of search results in response to user search queries to a web search engine. It is demonstrated that the modification of the algorithm reduces appreciably the searching procedure and the algorithm complexity. And finally, the new algorithm undergoes experimental testing on real data sets provided by Yandex. © 2015, Pleiades Publishing, Ltd.
Since their inception in 1962, Petri nets have been used in a wide variety of application domains. Although Petri nets are graphical and easy to understand, they have formal semantics and allow for analysis techniques ranging from model checking and structural analysis to process mining and performance analysis. Over time Petri nets emerged as a solid foundation for Business Process Management (BPM) research. The BPM discipline develops methods, techniques, and tools to support the design, enactment, management, and analysis of operational business processes. Mainstream business process modeling notations and workflow management systems are using token-based semantics borrowed from Petri nets. Moreover, state-of-the-art BPM analysis techniques are using Petri nets as an internal representation. Users of BPM methods and tools are often not aware of this. This paper aims to unveil the seminal role of Petri nets in BPM.
A new data structure for efficient similarity search in very large datasets of high-dimensional vectors is introduced. This structure called the inverted multi-index generalizes the inverted index idea by replacing the standard quantization within inverted indices with product quantization. For very similar retrieval complexity and pre-processing time, inverted multi-indices achieve a much denser subdivision of the search space compared to inverted indices, while retaining their memory efficiency. Our experiments with large datasets of SIFT and GIST vectors demonstrate that because of the denser subdivision, inverted multi-indices are able to return much shorter candidate lists with higher recall. Augmented with a suitable reranking procedure, multi-indices were able to significantly improve the speed of approximate nearest neighbor search on the dataset of 1 billion SIFT vectors compared to the best previously published systems, while achieving better recall and incurring only few percent of memory overhead. © 2014 IEEE.
This study is dedicated to the introduction of a novel method that automatically extracts potential structural alerts from a data set of molecules. These triggering structures can be further used for knowledge discovery and classification purposes. Computation of the structural alerts results from an implementation of a sophisticated workflow that integrates a graph mining tool guided by growth rate and stability. The growth rate is a well-established measurement of contrast between classes. Moreover, the extracted patterns correspond to formal concepts; the most robust patterns, named the stable emerging patterns (SEPs), can then be identified thanks to their stability, a new notion originating from the domain of formal concept analysis. All of these elements are explained in the paper from the point of view of computation. The method was applied to a molecular data set on mutagenicity. The experimental results demonstrate its efficiency: it automatically outputs a manageable number of structural patterns that are strongly related to mutagenicity. Moreover, a part of the resulting structures corresponds to already known structural alerts. Finally, an in-depth chemical analysis relying on these structures demonstrates how the method can initiate promising processes of chemical knowledge discovery. © 2015 American Chemical Society.
Activities such as clinical investigations (CIs) or financial processes are subject to regulations to ensure quality of results and avoid negative consequences. Regulations may be imposed by multiple governmental agencies as well as by institutional policies and protocols. Due to the complexity of both regulations and activities, there is great potential for violation due to human error, misunderstanding, or even intent. Executable formal models of regulations, protocols and activities can form the foundation for automated assistants to aid planning, monitoring and compliance checking. We propose a model based on multiset rewriting where time is discrete and is specified by timestamps attached to facts. Actions, as well as initial, goal and critical states may be constrained by means of relative time constraints. Moreover, actions may have non-deterministic effects, i.e. they may have different outcomes whenever applied. We present a formal semantics of our model based on focused proofs of linear logic with definitions. We also determine the computational complexity of various planning problems. Plan compliance problem, for example, is the problem of finding a plan that leads from an initial state to a desired goal state without reaching any undesired critical state. We consider all actions to be balanced, i.e. their pre- and post-conditions have the same number of facts. Under this assumption on actions, we show that the plan compliance problem is PSPACE-complete when all actions have only deterministic effects and is EXPTIME-complete when actions may have non-deterministic effects. Finally, we show that the restrictions on the form of actions and time constraints taken in the specification of our model are necessary for decidability of the planning problems. Copyright © Cambridge University Press 2015
A tropical (or min-plus) semiring is a set $\mathbb{Z}$ (or $\mathbb{Z \cup \{\infty\}}$) endowed with two operations: $\oplus$, which is just usual minimum, and $\odot$, which is usual addition. In tropical algebra a vector $x$ is a solution to a polynomial $g_1(x) \oplus g_2(x) \oplus \ldots \oplus g_k(x)$, where $g_i(x)$'s are tropical monomials, if the minimum in $\min_i(g_{i}(x))$ is attained at least twice. In min-plus algebra solutions of systems of equations of the form $g_1(x)\oplus \ldots \oplus g_k(x) = h_1(x)\oplus \ldots \oplus h_l(x)$ are studied. In this paper we consider computational problems related to tropical linear system. We show that the solvability problem (both over $\mathbb{Z}$ and $\mathbb{Z} \cup \{\infty\}$) and the problem of deciding the equivalence of two linear systems (both over $\mathbb{Z}$ and $\mathbb{Z} \cup \{\infty\}$) are equivalent under polynomial-time reductions to mean payoff games and are also equivalent to analogous problems in min-plus algebra. In particular, all these problems belong to $\mathsf{NP} \cap \mathsf{coNP}$. Thus, we provide a tight connection of computational aspects of tropical linear algebra with mean payoff games and min-plus linear algebra. On the other hand we show that computing the dimension of the solution space of a tropical linear system and of a min-plus linear system is $\mathsf{NP}$-complete.
In this paper we study the structure of suffix trees. Given an unlabeled tree τ on n nodes and suffix links of its internal nodes, we ask the question “Is τ a suffix tree?”, i.e., is there a string S whose suffix tree has the same topological structure as τ? We place no restrictions on S, in particular we do not require that S ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. We prove that τ is a suffix tree if and only if it is realized by a string S of length n − 1, and we give a linear-time algorithm for inferring S when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014]. © Springer International Publishing Switzerland 2015.
We apply the suboptimal sequential nonparametric hypotheses testing approach for effectiveness of a statistical decision by sample space reducing. Numerical examples of the sample space reducing are given when an appropriate reducing makes it possible to construct robust sequential nonparametric hypotheses testing with a smaller mean duration time then one on the total sample space. © 2014 IEEE.
Formal concept analysis (FCA) is a well-founded method for data analysis and has many applications in data mining. Pattern structures is an extension of FCA for dealing with complex data such as sequences or graphs. However the computational complexity of computing with pattern structures is high and projections of pattern structures were introduced for simplifying computation. In this paper we introduce o-projections of pattern structures, a generalization of projections which defines a wider class of projections preserving the properties of the original approach. Moreover, we show that o-projections form a semilattice and we discuss the correspondence between o-projections and the representation contexts of o-projected pattern structures. © 2015 Springer International Publishing Switzerland.
An extension of the problem of reconstruction of words from a given multiset of its subwords supposedly generated by unit shifts of a window of fixed length along such words. In the extension, feasible solutions must satisfy additional constraints. The case is considered when these constraints are specified by forbidden words. A solution to the problem is obtained as a result of searching for Euler paths or Euler cycles in a de Bruijn multidigraph with the additional operation of reduction of its edges and subsequent application of special algebraic operations of multiplication of adjacency matrices as in the first part of this article. © 2015, Springer Science+Business Media New York.
The paper defines an annotated suffix tree (AST) - a data structure used to calculate and store the frequencies of all the fragments of the given string or a collection of strings. The AST is associated with a string to text scoring, which takes all fuzzy matches into account. We show how the AST and the AST scoring can be used for Natural Language Processing tasks. Copyright © by the paper's authors. Copying only for private and academic purposes.
The cell membrane is "stuffed" with proteins, whose transmembrane (TM) helical domains spontaneously associate to form functionally active complexes. For a number of membrane receptors, a modulation of TM domains' oligomerization has been shown to contribute to the development of severe pathological states, thus calling for detailed studies of the atomistic aspects of the process. Despite considerable progress achieved so far, several crucial questions still remain: How do the helices recognize each other in the membrane? What is the driving force of their association? Here, we assess the dimerization free energy of TM helices along with a careful consideration of the interplay between the structure and dynamics of protein and lipids using atomistic molecular dynamics simulations in the hydrated lipid bilayer for three different model systems - TM fragments of glycophorin A, polyalanine and polyleucine peptides. We observe that the membrane driven association of TM helices exhibits a prominent entropic character, which depends on the peptide sequence. Thus, a single TM peptide of a given composition induces strong and characteristic perturbations in the hydrophobic core of the bilayer, which may facilitate the initial "communication" between TM helices even at the distances of 20-30 Å. Upon tight helix-helix association, the immobilized lipids accommodate near the peripheral surfaces of the dimer, thus disturbing the packing of the surrounding. The dimerization free energy of the modeled peptides corresponds to the strength of their interactions with lipids inside the membrane being the lowest for glycophorin A and similarly higher for both homopolymers. We propose that the ability to accommodate lipid tails determines the dimerization strength of TM peptides and that the lipid matrix directly governs their association. © 2015 American Chemical Society.