Clustering evolving proteins into homologous families

Chan, Cheong Xin, Mahbob, Maisarah and Ragan, Mark A. (2013) Clustering evolving proteins into homologous families. BMC Bioinformatics, 14 120.1-120.11. doi:10.1186/1471-2105-14-120


Author Chan, Cheong Xin
Mahbob, Maisarah
Ragan, Mark A.
Title Clustering evolving proteins into homologous families
Journal name BMC Bioinformatics   Check publisher's open access policy
ISSN 1471-2105
Publication date 2013-04
Sub-type Article (original research)
DOI 10.1186/1471-2105-14-120
Open Access Status DOI
Volume 14
Start page 120.1
End page 120.11
Total pages 11
Place of publication London, United Kingdom
Publisher BioMed Central
Collection year 2014
Language eng
Formatted abstract
Background: Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches.

Results: Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment.

Conclusions: Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting.
Q-Index Code C1
Q-Index Status Confirmed Code
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: Official 2014 Collection
School of Chemistry and Molecular Biosciences
Institute for Molecular Bioscience - Publications
 
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 3 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Sun, 23 Jun 2013, 00:04:25 EST by System User on behalf of Institute for Molecular Bioscience