An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome

Ribeiro, Antonio, Golicz, Agnieszka, Hackett, Christine Anne, Milne, Iain, Stephen, Gordon, Marshall, David, Flavell, Andrew J. and Bayer, Micha (2015) An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome. BMC Bioinformatics, 16 1: 382.1-382.16. doi:10.1186/s12859-015-0801-z


Author Ribeiro, Antonio
Golicz, Agnieszka
Hackett, Christine Anne
Milne, Iain
Stephen, Gordon
Marshall, David
Flavell, Andrew J.
Bayer, Micha
Title An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome
Journal name BMC Bioinformatics   Check publisher's open access policy
ISSN 1471-2105
Publication date 2015-11-11
Year available 2015
Sub-type Article (original research)
DOI 10.1186/s12859-015-0801-z
Open Access Status DOI
Volume 16
Issue 1
Start page 382.1
End page 382.16
Total pages 16
Place of publication London, United Kingdom
Publisher BioMed Central
Language eng
Formatted abstract
Background
Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling — quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive.

Results
The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases.

Conclusions
The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.
Keyword False positive
Mapping stringency
Misassembly
Read length
Read mismapping
Q-Index Code C1
Q-Index Status Confirmed Code
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: School of Agriculture and Food Sciences
Official 2016 Collection
 
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 3 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Tue, 24 Nov 2015, 10:25:49 EST by System User on behalf of School of Agriculture and Food Sciences