eLearning Archive: A Primer on Clustering: II. Tendency Assessment and Cluster Validity
This course is part of our eLearning Archive, which includes older courses that may not be current or as user-friendly as courses designed more recently.
This course - the second in a series of three - discusses several approaches to the first and third problems of clustering identified in module I - viz., pre-clustering tendency assessment and post-clustering cluster validation. The target audience comprises advanced undergraduate and graduate students majoring in engineering and science, and practicing engineers and scientists interested in either research about or applications of clustering to real world problems such as data mining, image analysis and bioinformatics. Some of subject matter in this course is available in textbooks (most notably some of the material about cluster validity functionals), and some of the subject matter is the object of (my) current research. The references contain pointers to some excellent papers on these topics, and on a number of related or competitive methods that have been proposed and studied by others. I begin with a simple numerical example that establishes the necessity for both assessment and validity. Then, I discuss the visual assessment of tendency family of algorithms (VAT, sVAT and coVAT). These algorithms produce images that enable a user to make useful guesses about the number of clusters to seek in relational data before proceeding with a partitioning method for finding the clusters. Since object data can always be converted to relational form by computing pair wise distances, these methods are well defined for all types of unlabeled numerical data. The coVAT algorithm provides a means for estimating the number of clusters in each of the four problems associated with rectangular relational data: row clusters, column clusters, joint (pure) clusters, and mixed co-clusters. The second half of this course presents some examples of cluster validation using scalar measures or indices of cluster validity. Several examples from each of the three major categories (crisp, fuzzy and probabilistic) of indices are presented. This course concludes with a numerical example that c mpares 23 indices of all three types on clusters in 12 sets of data drawn from mixtures of Gaussian distributions having either 3 or 6 components. (SOME) indices of all three types do pretty well in this example, while others do very badly. I don't think this problem has a general "solution", but since we use clustering in many, many applications, we keep trying to find good indices to validate algorithmic outputs.
What you will learn:
- Review scalar measures of Validity
- Examine Visual Assessment of Tendency (VAT)
- Discuss VAT for small, square data sets
Related courses:
Who should attend: Electrical engineer, Systems engineer, Hardware engineer, Design engineer, Product engineer, Communication engineer
Instructor
James Bezdek
Jim received the BS in Civil Engineering from U. of Nevada and the PhD in Applied Mathematics Cornell University. Jim is past president of NAFIPS, IFSA and the IEEE NNC (aka CIS), is the founding editor of the Int'l. Jo. Approximate Reasoning and the IEEE Transactions on Fuzzy Systems, is a fellow of the IEEE and IFSA, and is a recipient of the IEEE 3rd Millenium , IEEE Fuzzy Systems Pioneer and IEEE Frank Rosenblatt medals
Publication Year: 2008
ISBN: 1-4244-1441-5