Progeny Clustering: A Method to Identify Biological Phenotypes
Date
2015-08-12
Journal Title
Journal ISSN
Volume Title
Publisher
Nature Publishing Group
Abstract
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any
type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here,
we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally
efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny
Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the
clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and
data space. Our method was shown successful and robust when applied to two synthetic datasets
(datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two
standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a
cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA)
dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the tendimensional
synthetic dataset as well as in the cell phenotype dataset, and it was the only method
that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset.
Description
Publisher's PDF.
Keywords
Citation
Hu, C. W. et al. Progeny Clustering: A Method to Identify Biological Phenotypes. Sci. Rep. 5, 12894; doi: 10.1038/srep12894 (2015).