Bioinformatic identification of functional noncoding elements and expressed noncoding RNAs

Chol Hee Jung (2010). Bioinformatic identification of functional noncoding elements and expressed noncoding RNAs PhD Thesis, Institute for Molecular Bioscience, The University of Queensland.

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads
s41123359_PhD_abstract.pdf s41123359_PhD_abstract.pdf application/pdf 34.79KB 1
s41123359_PhD_totalthesis.pdf s41123359_PhD_totalthesis.pdf application/pdf 6.20MB 13
Author Chol Hee Jung
Thesis Title Bioinformatic identification of functional noncoding elements and expressed noncoding RNAs
School, Centre or Institute Institute for Molecular Bioscience
Institution The University of Queensland
Publication date 2010-06
Thesis type PhD Thesis
Supervisor Professor John Mattick
Total pages 305
Total colour pages 46
Total black and white pages 259
Subjects 06 Biological Sciences
Abstract/Summary The majority of the genome in multicellular organisms is composed of non-protein-coding (noncoding) sequences, the amount of which increases with developmental complexity. This observation and other evidence support the contention that these noncoding regions house numerous functional sequence elements including cis-acting protein-binding sites and regulatory noncoding RNAs (ncRNAs), many of which are conserved. Over recent years, the number of identified ncRNA genes has increased. Indeed, it is likely that higher eukaryotes produce a comparable if not greater number of ncRNAs than protein-coding mRNAs. All annotated conserved regions of the genome are associated with functional elements and transcripts, but most are unannotated, suggesting that many more of these elements remain to be identified. On this basis, I analyzed both conserved noncoding regions and sequenced fragments of expressed ncRNAs of the genomically and transcriptomically well-characterized model organism Drosophila melanogaster to identify potentially functional sequence elements and ncRNAs. The first approach was based upon a search for over-represented sequence motifs, on the hypothesis that different classes of regulatory RNAs or cis-acting regulatory sequences might possess core sequence motifs, such as those found in small nucleolar RNAs (snoRNAs), which may be used to parse these sequences and identify new subclasses. The conserved noncoding regions of D. melanogaster were searched for over-represented tetranucleotide pairs (which best covers the possibilities while being computationally tractable) separated by certain distances up to 100 bp apart (termed 'pattern-cores'). Among over 17,000 over-represented pattern-cores, 473 showed the highest information content in their surrounding sequences and were extended using the program MEME into longer motifs defined by position-specific scoring matrices (PSSMs). These motifs were then classified into 23 groups based on their similarity. The whole genome was scanned for genomic sites of the motifs with certain threshold values that effectively separated true from false positives. The results identified five groups of known functional elements: a subset of tRNAs, motifs immediately downstream of Histone genes, and three types of protein-binding sites, including one recognized by the chromatin insulator protein Su(Hw). Two novel groups with large numbers of instances, DLM3 and DLM4, were investigated in more detail and showed strong evidence for functional potential including their abundance in the genome, conservation across other (and only within) Drosophilae, location in specific genomic regions (DLM3), strong predicted RNA folding energy (DLM4) and positive signals in Northern hybridization analysis (DLM4). Motifs in some other groups also showed functional potential such as enrichment in promoter regions of genes with specific categories of biological processes or genomic loci covered by short RNA sequence data. These findings suggest that there may be many more such motifs, especially lineage-specific motifs, to be discovered in other genomes by these strategies. The second approach employed an empirical analysis of short-read high density RNA sequencing data. Published datasets of short RNA sequences from D. melanogaster were combined and used to assemble tag-contigs. Tag-contigs identified most known small ncRNAs (such as tRNAs, snRNAs and snoRNAs), and showed distinctive characteristics associated with different classes of small ncRNAs. By using these characteristics in conjunction with the typical sequence motifs of snoRNAs, 7 novel box H/ACA and 26 box C/D snoRNAs were identified. In addition, one novel snRNA and hundreds of putative ncRNAs candidates of uncharacterized classes were predicted, 15 out of 21 of which showed corresponding signals in subsequent Northern hybridization analysis. The combined use of small RNA sequence data from various tissues also successfully inferred the expression profiles of the putative ncRNA candidates. This approach was then extended to the nematode Caenorhabditis elegans to identify hundreds of putative ncRNAs with specific expression profiles. The pattern-core approach accurately identified over-represented sequence motifs and can be modified to accommodate different gap-sizes between two tetranucleotides or to alter the size of co-dependent two short sequence elements. The tag-contig approach is a simple yet effective way to gather preliminary candidates of novel noncoding RNAs, but it (as yet) only skims the surface of the great complexity of small RNA species. Moreover, excessive accumulation of sequence data can cause ambiguity in 5' / 3' cleavage sites of the candidates. Thus, additional computational and data-driven analyses need to be developed for better prediction, identification and understanding of ncRNAs. Additionally, since some species of small noncoding RNAs can be defined by distinctive sequence features within them, applying the pattern-core approach to the tag-contig data would be one way to better classify putative noncoding RNAs.
Keyword noncoding RNA
regulatory elements
next-generation sequencing
Additional Notes colour pages: 46 pages 3 (1 page), 5 (1 page), 10 (1 page), 25 (1 page), 34-35 (2 pages), 37 (1 page), 39 (1 page), 52-55 (4 pages), 57 (1 page), 58 (1 page), 60-61 (2 pages), 64 (1 page), 72 (1 page), 82 (1 page), 88 (1 page), 91-92 (2 pages), 94 (1 page), 96-97 (2 pages), 255-257 (3 pages), 259 (1 page), 271-287 (17 pages), landscape pages: 57 (1 page), 99-101 (3 pages), 122-182 (61 pages), 242 (1 page), 264-287 (24 pages)

Citation counts: Google Scholar Search Google Scholar
Access Statistics: 170 Abstract Views, 14 File Downloads  -  Detailed Statistics
Created: Wed, 22 Sep 2010, 17:03:23 EST by Mr Chol Hee Jung on behalf of Library - Information Access Service