Brassica and wheat are important crops for agriculture in Australia and world-wide. Their production is challenging because of biotic stresses such as diseases, and environmental factors including drought and soil salinity.
In comparison to the model species Arabidopsis thaliana and rice, the genomes of Brassica and wheat are both large and complex. This size and complexity makes it more difficult to determine their genome sequences..
The sequence information produced by Second Generation Sequencing (SGS) technologies allows researchers to identify for example large numbers of molecular genetic markers which can be used to study heritable traits and for applied crop improvement.
SGS technologies are speeding up genome sequencing, but they have led to vast increases in the amount of data resulting in major computational challenges. To manage this data, new computational systems have to be designed to support the SGS based research.
This thesis describes the design, implementation and validation of the SGSautoSNP pipeline, a new approach to call SNPs in large and complex crop genomes using SGS sequences. In our method the reference genome sequence is used only to assemble the reads, and SNPs are then called between these assembled reads. The pipeline includes gene prediction, SNP annotation and identifies low SNP density regions which are more conserved than high SNP density regions.
A total of 638,593 SNPs in the Brassica napus AA genome and 881,289 SNPs in the wheat group 7 chromosome arms were identified using the SGSautoSNP pipeline. Validation of 20 B. napus AA genome SNPs resulted in a SNP prediction accuracy of around 95%. Of the 28 wheat SNPs that were used for validation of the SGSautoSNP pipeline, 26 (93%) produced the expected genotype.
By combining the SGSautoSNP pipeline together with SnpEff it was possible to determine whole genome SNPs trends, transition to transversion ratios and SNP frequencies across chromosomes. Annotation of B. napus AA genome SNPs have revealed that 0.5% of predicted SNPs are classified as “high effect” SNPs, and these could impact the structure of the proteins or the amino acid transcripts.
The discovered molecular markers, genes, genetic and marker annotations and gene ontology by SGSautoSNP pipeline are stored in a new developed database called SGSautoSNPdb. This information are linked to other databases in order to allow researchers to access information quick and in a biologist friendly manner.
Together, the SGSautoSNP pipeline and SGSautoSNPdb provides tools to help us to understand how natural selection has shaped the evolution of crop genomes and SNPs that can be applied to improve crops in order to secure a sufficient food-source into the future.