Searching for convergence in phylogenetic Markov chain Monte Carlo

Beiko, R. G., Keith, J. M., Harlow, T. J. and Ragan, M. A. (2006) Searching for convergence in phylogenetic Markov chain Monte Carlo. Systematic Biology, 55 4: 553-565. doi:10.1080/10635150600812544


Author Beiko, R. G.
Keith, J. M.
Harlow, T. J.
Ragan, M. A.
Title Searching for convergence in phylogenetic Markov chain Monte Carlo
Journal name Systematic Biology   Check publisher's open access policy
ISSN 1063-5157
Publication date 2006-08
Sub-type Article (original research)
DOI 10.1080/10635150600812544
Volume 55
Issue 4
Start page 553
End page 565
Total pages 13
Editor C. Simon
Place of publication Philadelphia
Publisher Taylor & Francis Inc
Collection year 2006
Language eng
Subject C1
230203 Statistical Theory
780105 Biological sciences
Abstract Markov chain Monte Carlo (MCMC) is a methodology that is gaining widespread use in the phylogenetics community and is central to phylogenetic software packages such as MrBayes. An important issue for users of MCMC methods is how to select appropriate values for adjustable parameters such as the length of the Markov chain or chains, the sampling density, the proposal mechanism, and, if Metropolis-coupled MCMC is being used, the number of heated chains and their temperatures. Although some parameter settings have been examined in detail in the literature, others are frequently chosen with more regard to computational time or personal experience with other data sets. Such choices may lead to inadequate sampling of tree space or an inefficient use of computational resources. We performed a detailed study of convergence and mixing for 70 randomly selected, putatively orthologous protein sets with different sizes and taxonomic compositions. Replicated runs from multiple random starting points permit a more rigorous assessment of convergence, and we developed two novel statistics, delta and epsilon, for this purpose. Although likelihood values invariably stabilized quickly, adequate sampling of the posterior distribution of tree topologies took considerably longer. Our results suggest that multimodality is common for data sets with 30 or more taxa and that this results in slow convergence and mixing. However, we also found that the pragmatic approach of combining data from several short, replicated runs into a metachain to estimate bipartition posterior probabilities provided good approximations, and that such estimates were no worse in approximating a reference posterior distribution than those obtained using a single long run of the same length as the metachain. Precision appears to be best when heated Markov chains have low temperatures, whereas chains with high temperatures appear to sample trees with high posterior probabilities only rarely. [Bayesian phylogenetic inference; heating parameter; Markov chain Monte Carlo; replicated chains.]
Keyword Bayesian Phylogenetic Inference
Heating Parameter
Markov Chain Monte Carlo
Replicated Chains
Evolutionary Biology
Bayesian-inference
Proposal Distributions
Protein Families
Dna-sequences
Gene-transfer
Tree Space
Likelihood
Model
Mcmc
Algorithms
Q-Index Code C1

 
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 38 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 40 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Wed, 15 Aug 2007, 09:12:40 EST