Identifying Change-Points in Biological Sequences via Sequential Importance Sampling

Sofronov, George Yu., Evans, Gareth E., Keith, Jonathan M. and Kroese, Dirk P (2009) Identifying Change-Points in Biological Sequences via Sequential Importance Sampling. Environmental Modeling & Assessment, 14 5: 577-584. doi:10.1007/s10666-008-9160-8

Author Sofronov, George Yu.
Evans, Gareth E.
Keith, Jonathan M.
Kroese, Dirk P
Title Identifying Change-Points in Biological Sequences via Sequential Importance Sampling
Journal name Environmental Modeling & Assessment   Check publisher's open access policy
ISSN 1420-2026
Publication date 2009-10
Year available 2008
Sub-type Article (original research)
DOI 10.1007/s10666-008-9160-8
Volume 14
Issue 5
Start page 577
End page 584
Total pages 8
Editor Jerzy A. Filar
Place of publication Netherlands
Publisher Springer
Collection year 2009
Language eng
Subject C1
970101 Expanding Knowledge in the Mathematical Sciences
010401 Applied Statistics
Abstract The genomes of complex organisms, including the human genome, are highly structured. This structure takes the form of segmental patterns of variation in various properties and may be caused by the division of genomes into regions of distinct function, by the contingent evolutionary processes that gave rise to genomes, or by a combination of both. Whatever the cause, identifying the change-points between segments is potentially important, as a means of discovering the functional components of a genome, understanding the evolutionary processes involved, and fully describing genomic architecture. One property of genomes that is known to display a segmental pattern of variation is GC content. The GC content of a portion of DNA is the proportion of GC pairs that it contains. Sharp changes in GC content can be observed in human and other genomes. Such change-points may be the boundaries of functional elements or may play a structural role. We model genome sequences as a multiple change-point process, that is, a process in which sequential data are separated into segments by an unknown number of change-points, with each segment supposed to have been generated by a different process. We consider a Sequential Importance Sampling approach to change-point modeling using Monte Carlo simulation to find estimates of change-points as well as parameters of the process on each segment. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates for the locations of change-points in artificially generated sequences and compare the accuracy of these estimates to those obtained via Markov chain Monte Carlo and a well-known method, IsoFinder. We also provide examples with real data sets to illustrate the usefulness of this method.
Keyword Comparative genomics
Multiple change-point problem
Q-Index Code C1
Q-Index Status Provisional Code

Document type: Journal Article
Sub-type: Article (original research)
Collections: Excellence in Research Australia (ERA) - Collection
School of Mathematics and Physics
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 8 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 11 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Wed, 31 Mar 2010, 11:21:04 EST by Kay Mackie on behalf of School of Mathematics & Physics