A Word-Oriented Approach to Alignment Validation

Beiko, Robert G., Chan, Cheong Xin and Ragan, Mark A. (2005) A Word-Oriented Approach to Alignment Validation. Bioinformatics, 21 10: 2230-2239. doi:10.1093/bioinformatics/bti335

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads
Beiko2005.pdf Beiko2005.pdf application/pdf 660.44KB 0

Author Beiko, Robert G.
Chan, Cheong Xin
Ragan, Mark A.
Title A Word-Oriented Approach to Alignment Validation
Journal name Bioinformatics   Check publisher's open access policy
ISSN 1367-4803
Publication date 2005-02-01
Sub-type Article (original research)
DOI 10.1093/bioinformatics/bti335
Open Access Status File (Author Post-print)
Volume 21
Issue 10
Start page 2230
End page 2239
Total pages 10
Place of publication Oxford
Publisher Oxford University Press
Collection year 2005
Language eng
Subject 239901 Biological Mathematics
279999 Biological Sciences not elsewhere classified
270208 Molecular Evolution
780105 Biological sciences
270199 Biochemistry and Cell Biology not elsewhere classified
C1
Abstract Motivation: Multiple sequence alignment at the level of whole proteomes requires a high degree of automation, precluding the use of traditional validation methods such as manual curation. Since evolutionary models are too general to describe the history of each residue in a protein family, there is no single algorithm/model combination that can yield a biologically or evolutionarily optimal alignment. We propose a 'shotgun' strategy where many different algorithms are used to align the same family, and the best of these alignments is then chosen with a reliable objective function. We present WOOF, a novel 'word-oriented' objective function that relies on the identification and scoring of conserved amino acid patterns (words) between pairs of sequences. Results: Tests on a subset of reference protein alignments from BAliBASE showed that WOOF tended to rank the (manually curated) reference alignment highest among 1060 alternative (automatically generated) alignments for a majority of protein families. Among the automated alignments, there was a strong positive relationship between the WOOF score and similarity to the reference alignment. The speed of WOOF and its independence from explicit considerations of three-dimensional structure make it an excellent tool for analyzing large numbers of protein families.
Keyword multiple sequence alignment
objective function
sequence analysis
Computer Science, Interdisciplinary Applications
Biotechnology & Applied Microbiology
Biochemical Research Methods
Mathematics, Interdisciplinary Applications
References Bonizzoni,P. and Della Vedova,G. (2001) The complexity of multiple sequence alignment with a SP-score that is a metric. Theoret. Comput. Sci., 259, 63-79. Brocchieri,L. and Karlin,S. (1998) A symmetric-iterated multiple alignment of protein sequences. J. Mol. Biol., 276, 249-264. Carrillo,H. and Lipman,D. (1988) The multiple sequence alignment problem in biology. Siam J. Appl. Math., 48, 1073-1082. Castresana,J. (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol., 17, 540-552. Dehal,P. et al. (2002) The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science, 298, 2157-2167. Elofsson,A. (2002) A study on protein sequence alignment quality. Proteins, 46, 330-339. Falquet,L. et al. (2002) The PROSITE database, its status in 2002. Nucleic Acids Res., 30, 235-238. Feng,Z.K. and Sippl,M.J. (1996) Optimum superimposition of protein structures: ambiguities and implications. Fold. Des., 1, 123-132. Godzik,A. (1996) The structural alignment between two proteins: is there a unique answer? Protein Sci., 5, 1325-1338. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., 264, 823-838. Gupta,S.K. et al. (1995) Improving the practical space and time efficiency of the shortestpaths approach to sum-of-pairs multiple sequence alignment. J. Comput. Biol., 2, 459-472. Hart,R.K. et al. (2000) Systematic and fully automated identification of protein sequence patterns. J. Comput. Biol., 7, 585-600. Henikoff,J.G. et al. (2000) Blocks-based methods for detecting protein homology. Electrophoresis, 21, 1700-1706. Hertz,G.Z. and Stormo,G.D. (1995) Identification of consensus patterns in unaligned DNA and protein sequences: A large-deviation statistical basis for penalizing gaps. In Lim,H.A. and Cantor,C.R. (eds), Proceedings of the Third International Conference on Bioinformatics and Genome Research. World Scientific Publishing, pp. 201-216. Hertz,G.Z. and Stormo,G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563-577. Koehl,P. (2001) Protein structure similarities. Curr. Opin. Struct. Biol., 11, 348-353. Lee,C. et al. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452-464. Lipman,D.J. et al. (1989) A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412-4415. Lo Conte,L. et al. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264-267. Morgenstern,B. (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211-218. Notredame,C. (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131-144. Notredame,C. et al. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217. O'Sullivan,O. et al. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19 (Suppl 1), I215-I221. Pei,J. and Grishin,N.V. (2001)AL2CO:calculation of positional conservation in a protein sequence alignment. Bioinformatics, 17, 700-712. Rigoutsos,I. and Floratos,A. (1998) Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics, 14, 55-67. Rigoutsos,I. et al. (1999) Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins, 37, 264-277. Shibuya,T. and Rigoutsos,I. (2002) Dictionary-driven prokaryotic gene finding. Nucleic Acids Res., 30, 2710-2725. Spearman,C. (1904) The proof and measurement of association between two things. Am. J. Psychol., 15, 72-101. Sullivan,J. and Swofford,D.L. (2001) Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol., 50, 723-729. Thompson,J.D. et al. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673-4680. Thompson,J.D. et al. (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res., 25, 4876-4882. Thompson,J.D. et al. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87-88. Thompson,J.D. et al. (2001) Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol., 314, 937-951. Valdar,W.S. (2002) Scoring residue conservation. Proteins, 48, 227-241.
Q-Index Code C1
Additional Notes This is a pre-copy-editing, author-produced PDF of an article accepted for publication in Bioinformatics following peer review. The definitive publisher-authenticated version of Robert G. Beiko, Cheong Xin Chan and Mark A. Ragan, A word-oriented approach to alignment validation, Bioinformatics 2005 21(10): 2230-2239; doi:10.1093/bioinformatics/bti335 is available online at: http://dx.doi.org/doi:10.1093/bioinformatics/bti335. Copyright 2005 Oxford Journals. All rights reserved.

 
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 11 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 10 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Mon, 04 Sep 2006, 10:00:00 EST by Cheong Xin Chan on behalf of Institute for Molecular Bioscience