Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest

Hudson, Nicholas J., Porto-Neto, Laercio R., Kijas, James, McWilliam, Sean, Taft, Ryan J. and Reverter, Antonio (2014) Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest. BMC Bioinformatics, 15 Article ID.66: . doi:10.1186/1471-2105-15-66


Author Hudson, Nicholas J.
Porto-Neto, Laercio R.
Kijas, James
McWilliam, Sean
Taft, Ryan J.
Reverter, Antonio
Title Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest
Journal name BMC Bioinformatics   Check publisher's open access policy
ISSN 1471-2105
Publication date 2014-03-07
Year available 2014
Sub-type Article (original research)
DOI 10.1186/1471-2105-15-66
Open Access Status DOI
Volume 15
Issue Article ID.66
Total pages 21
Place of publication London, United Kingdom
Publisher BioMed Central Ltd.
Language eng
Subject 1315 Structural Biology
1303 Specialist Studies in Education
1312 Molecular Biology
1706 Computer Science Applications
2604 Applied Mathematics
Abstract Background: Genomic information allows population relatedness to be inferred and selected genes to be identified. Single nucleotide polymorphism microarray (SNP-chip) data, a proxy for genome composition, contains patterns in allele order and proportion. These patterns can be quantified by compression efficiency (CE). In principle, the composition of an entire genome can be represented by a CE number quantifying allele representation and order. Results: We applied a compression algorithm (DEFLATE) to genome-wide high-density SNP data from 4,155 human, 1,800 cattle, 1,222 sheep, 81 dogs and 49 mice samples. All human ethnic groups can be clustered by CE and the clusters recover phylogeography based on traditional fixation index (FST) analyses. CE analysis of other mammals results in segregation by breed or species, and is sensitive to admixture and past effective population size. This clustering is a consequence of individual patterns such as runs of homozygosity. Intriguingly, a related approach can also be used to identify genomic loci that show population-specific CE segregation. A high resolution CE 'sliding window' scan across the human genome, organised at the population level, revealed genes known to be under evolutionary pressure. These include SLC24A5 (European and Gujarati Indian skin pigmentation), HERC2 (European eye color), LCT (European and Maasai milk digestion) and EDAR (Asian hair thickness). We also identified a set of previously unidentified loci with high population-specific CE scores including the chromatin remodeler SCMH1 in Africans and EDA2R in Asians. Closer inspection reveals that these prioritised genomic regions do not correspond to simple runs of homozygosity but rather compositionally complex regions that are shared by many individuals of a given population. Unlike FST, CE analyses do not require ab initio population comparisons and are amenable to the hemizygous X chromosome. Conclusions: We conclude with a discussion of the implications of CE for a complex systems science view of genome evolution. CE allows one to clearly visualise the evolution of individual genomes and populations through a formal, mathematically-rigorous information space. Overall, CE makes a set of biological predictions, some of which are unique and await functional validation.
Keyword Information compression
Phylogeography
Selection signatures
Q-Index Code C1
Q-Index Status Confirmed Code
Grant ID B.BSC.0344
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: Official 2015 Collection
Institute for Molecular Bioscience - Publications
 
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 4 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 3 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Tue, 29 Apr 2014, 10:36:45 EST by System User on behalf of Institute for Molecular Bioscience