Clustering Tendency Applied to Chemical Feature Selection

Abstract

Methods for feature selection in cluster analysis are not yet well established, although research has demonstrated clearly that extraneous descriptors can mask natural clusters in data. The goal in this work has been to use variables’ contribution to clustering tendency to distinguish those that contribute to clustering from those variables that do not. It is also important to choose the smallest subsets of variables that will support clustering.

A modified version of Hopkins’ statistic is used to evaluate the degree to which each variable in a pool of measured or calculated variables contributes to the clustering tendency of a data set. The value of clustering tendency in choosing reasonable sets of variables will be demonstrated in examples using real and artificial data sets. Since clustering is exploratory in nature, there may be more than one set of useful variables.

Keywords

Clustering Hopkins’ statistic Acrylates

Get full access to this article

View all access options for this article.

References

Hansen

, Jurs

. Prediction of olefin boiling points from molecular structure. Anal Chem. 1987;59:2322–2327.

Small

, Stouch

, Jurs

. Automated selection of models for the simulation of carbon-13 nuclear magnetic resonance spectra. Anal Chem. 1984;56:2314–2319.

Rohrbaugh

, Jurs

. Prediction of gas chromatographic retention indices of polycyclic aromatic compounds and nitrated polycyclic aromatic compounds. Anal Chem. 1988;58:1210–1212.

Miyashita

, Takahashi

, Takayama

, Ohkubo

, Funatsu

, Sasaki

Computer assisted structure/taste studies on sulfamates by pattern recognition methods. Anal Chim Acta. 1986;184:143–149.

Sheridan

, Venkataraghavan

New methods in computer-aided drug design. Acc Chem Res. 1987;20:322–329.

Winkler

, Holan

, Johnson

WMP

, Virgona

CTF

, Jarvis

. Quantitative structure-activity relationships in insecticidal pyrethroid ethers. Quant Struct-Act Relat. 1988;7:79–84.

Heimler

, Boddi

Cluster analysis in the comparison of two-dimensional chromatograms. J Chromatography. 1989;466:371–378.

Norskov-Lauritsen

Bürgi HB.

Cluster Analysis of Periodic Distributions; Application to Conformational Analysis, J Comput Chem. 1985;6:216–228.

Lawson

, Jurs

. Cluster analysis of acrylate compounds to guide sampling for toxicity testing, J Chem Inf Comp Sci. in review.

10.

Hodes L. Clustering a large number of compounds. 1. Establishing the method on an initial sample, J Chem Inf Comput Sci. 1989;29:66–71.

11.

Willett

, Winterman

, Bawden

Implementation of nonhierarchic cluster analysis methods in chemical information systems: Selection of compounds for biological testing and clustering of substructure search output, J Chem Inf Comput Sci. 1986;26:109–118.

12.

Jain

, Dubes

. Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall; 1988.

13.

Massart

, Kaufman

The interpretation of analytical chemical data by the use of cluster analysis. New York: Wiley; 1983.

14.

McFarland

, Gans

. On the significance of clusters in the graphical display of structure-activity data, J Med Chem. 1986;29:505–514.

15.

Fowlkes

, Gnanadesikan

, Kettenring

. Variable selection in clustering, J Classification. 1988;5:205–228.

16.

Clementi

, Cruciani

, Curti

Some applications of the partial least-squares method, Anal Chim Acta. 1986;191:149–160.

17.

Van Der Voet

, Coenegracht

PMJ

, Hemel

. New probabilistic versions of the simca and classy classification methods, Anal Chim Acta. 1986;191:63–73.

18.

Armanino

, Leardi

, Lanteri

, and Modi

, Chemometric analysis of tuscan olive oils, Chem Int Lab Syst. 1989;5:343–354.

19.

Glick

, Davis

Variability in the inorganic element content of U.S. coals including results of cluster analysis, Org Geochem. 1987;11:331–342.

20.

Fowlkes

, Gnanadesikan

, Kettenring

. Variable selection in clustering and other contexts. In Mallows

(editor), Design, Data, and Analysis, Wiley, New York: 1987;13–34.

21.

Zeng

, Dubes

. A comparison of tests for randomness, Patt Rec. 1985;18:191–198.

22.

Dubes

, Zeng

A test for spatial homogeneity in cluster analysis, J Classification. 1987;4:33–56.

23.

Hopkins

A new method for determining the type of distribution of plant individuals, Ann Bot. 1954;18:213–227.

24.

Lawson

, Jurs

. New index for clustering tendency and its application to chemical problems J Chem Inf Comp Sci. 1990;30:36–41.

25.

Stuper

, Bruger

, Jurs

. Computer assisted studies of chemical structure and biological function. Wiley, New York: 1979.

26.

Jurs

. Computer assisted studies of structure-activity relations using pattern recognition. Drug Inf J. 1983;17:219–229.

27.

McQueen

. Some methods of classification and analysis of multivariate observations. Proc Fifth Berk. Symp on Math Stat and Prob. 1967;281–297.

28.

Ball

, Hall

. Isodata, an iterative method of multivariate analysis and pattern classification. Proceedings of the IFIPS Congress: 1965.

29.

Rand

. Objective criteria for the evaluation of clustering methods, J Am Stat Assoc. 1971;66:846–850.

30.

Milligan

, Soon

, Sokol

. The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans of Patt Recog and Machine Int PAMI. 1983;5:40–47.