The concept of anomalous clustering applies to finding individual clusters on a digital geography map supplied with a single feature such as brightness or temperature. An algorithm derived within the individual anomalous cluster framework extends the so-called region growing algorithms. Yet our approach differs in that the algorithm parameter values are not expert-driven but rather derived from the anomalous clustering model. This novel framework successfully applies to the issue of automatically delineating coastal upwelling from Sea Surface Temperature (SST) maps, a natural phenomenon seasonally occurring in coastal waters.
The ideal type model by Mirkin and Satarov (1990) expresses data points as convex combinations of some `ideal type' points. However, this model cannot prevent the ideal type points being far away from the observations and, in fact, requires that. Archetypal analysis by Cutler and Breiman (1994) and proportional membership fuzzy clustering by Nascimento et al. (2003) propose different ways of avoiding this entrapment. We propose one more way out - by assuming the ideal types being mutually orthogonal and transforming the model by multiplying it over its transpose. The obtained additive fuzzy clustering model for relational data is akin to that more recently analysed by Mirkin and Nascimento (2012) in a different context. The one-by-one clustering approach to the ideal type model is reformulated here as that naturally leading to a spectral clustering algorithm for finding fuzzy membership vectors. The algorithm is proven to be computationally valid and competitive against popular relational fuzzy clustering algorithms.
The Minkowski weighted K-means (MWK-means) is a recently developed clustering algorithm capable of computing feature weights. The cluster-specific weights in MWK-means follow the intuitive idea that a feature with low variance should have a greater weight than a feature with high variance. The final clustering found by this algorithm depends on the selection of the Minkowski distance exponent. This paper explores the possibility of using the central Minkowski partition in the ensemble of all Minkowski partitions for selecting an optimal value of the Minkowski exponent. The central Minkowski partition appears to be also a good consensus partition. Furthermore, we discovered some striking correlation results between the Minkowski profile, defined as a mapping of the Minkowski exponent values into the average similarity values of the optimal Minkowski partitions, and the Adjusted Rand Index vectors resulting from the comparison of the obtained partitions to the ground truth. Our findings were confirmed by a series of computational experiments involving synthetic Gaussian clusters and real-world data
Аpproximate cluster structures are those of formal concepts and n-concepts with added numerical intensity weights. The talk presents theoretical results and computational methods for approximate clustering and n-clustering as extensions of the algebraic-geometrical properties of numerical matrices (SVD and the like) to the situations where one or most of elements of the solutions to be found are expressed by binary vectors. The theory embraces such methods as k-means, consensus clustering, network clustering, biclusters and triclusters and provides natural data analysis criteria, effective algorithms and interpretation tools.
In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. This allows the cluster merging process to start from this partition rather than from a trivial partition composed solely of singletons.
Our second contribution is an extension of the Ward and Wardp algorithms to the situation where the feature weight exponent can differ from the exponent of the Minkowski distance. This new method, called A-Wardpβ, is able to generate a much wider variety of clustering solutions. We also demonstrate that its parameters can be estimated reasonably well by using a cluster validity index.
We perform numerous experiments using data sets with two types of noise, insertion of noise features and blurring within-cluster values of some features. These experiments allow us to conclude: (i) our anomalous pattern initialisation method does indeed reduce the time a hierarchical clustering algorithm takes to complete, without negatively impacting its cluster recovery ability; (ii) A-Wardpβ provides better cluster recovery than both Ward and Wardp.
In this paper a novel clustering algorithm is proposed as a version of the Seeded Region Growing (SRG) approach for the automatic recognition of coastal upwelling from Sea Surface Temperature (SST) images. The new algorithm, One Seed Expanding Cluster (SEC), takes advantage of the concept of approximate clustering due to Mirkin (1996, 2013) to derive a homogeneity criterion in the format of a product rather than the conventional difference between a pixel value and the mean of values over the region of interest. It involves a boundary-oriented pixel labeling so that the cluster growing is performed by expanding its boundary iteratively. The starting point is a cluster consisting of just one seed, the pixel with the cold est temperature. The baseline version of the SEC algorithm uses the Otsu’s thresholding method to fine-tune the homogeneity threshold. Unfortunately, this method does not always lead to a satisfactory solution. Therefore, we introduce a self-tuning version of the algorithm in which the homogeneity threshold parameter is abolished and the similarity threshold derived from the approximation criterion also serves as a homogeneity parameter.
This paper presents several definitions of “optimal patterns” in triadic data and results of experimental comparison of five triclustering algorithms on real-world and synthetic datasets. The evaluation is carried over such criteria as resource efficiency, noise tolerance and quality scores involving cardinality, density, coverage, and diversity of the patterns. An ideal triadic pattern is a totally dense maximal cuboid (formal triconcept). Relaxations of this notion under consideration are: OAC-triclusters; triclusters optimal with respect to the least-square criterion; and graph partitions obtained by using spectral clustering. We show that searching for an optimal tricluster cover is an NP-complete problem, whereas determining the number of such covers is #P-complete. Our extensive computational experiments lead us to a clear strategy for choosing a solution at a given dataset guided by the principle of Pareto-optimality according to the proposed criteria.
Abstract. A suffix-tree based method for measuring similarity of a key phrase to an unstructured text is proposed. The measure involves less computation and it does not depend on the length of the text or the key phrase. This applies to the following tasks in semantic text analysis:
Finding interrelations between key phrases over a set of texts;
Annotating a research article by topics from a taxonomy of the domain;
Clustering relevant topics and mapping clusters on a domain taxonomy.