THE COMPLETE HUMAN GENOME SEQUENCE has been resolved for more than a decade, yet the fraction that is biologically functional remains uncertain. Classical genes encoding proteins compose less than 2% of the sequence, while the majority is transcribed in dynamic and specific patterns during differentiation and development. Hence, the pervasive nature of mammalian transcription lies in stark contrast to its functional annotation. The key to resolving this discrepancy likely resides in the structural and functional analysis of the extensive non-protein coding regions of the genome, which scale proportionately to organismal complexity throughout metazoan evolution. Moreover there is increasing, albeit still limited, evidence that non-protein-coding RNA transcripts arising from the genome convey a multitude of functional roles in the cell, primarily in the regulation of epigenetic processes.
This thesis investigates the structures of RNA molecules encoded in mammalian genomes, with the aim of characterizing the biological foundation of the expansive genomic landscape of undetermined function. In the first part of the thesis I describe the structural predictions of non-protein coding RNAs involved in the regulation of mammary development, breast cancer, melanoma, and mitochondrial disease. These findings support the idea that non-protein coding portions of mammalian genomes may indeed convey function through RNA structure, and are presumably therefore subject to evolutionary selection. However, these non-coding regions only display patchy evidence of sequence conservation, consistent with observations that less than v10% of the mammalian genome sequence is conserved throughout evolution. These studies do not consider RNA structure conservation, which adheres to distinct evolutionary dynamics than when the function of a given locus is derived from sequence constraint alone. Indeed, genomic loci that function through RNA structure display greater evolutionary plasticity than, for instance, protein-coding regions as mutations may be incorporated more liberally, so long as they maintain base-pairing in the entailing structure. Past reports premised on evaluating RNA structure conservation are limited in the breadth of sampled loci and produce an unsettling amount of false positives and divergent predictions.
I address these methodological impediments in the subsequent portions of this thesis through the development of an algorithm for benchmarking the performance of RNA secondary structure prediction. Two refined, energy-based consensus structure prediction algorithms (RNAZ and SISSIZ) were tested on a broad set of true positives that reflects experimental methodologies, which enables independent performance evaluation under variable parameters. The results expose the complementary nature of both algorithms, highlighting SISSIZ’s strength at detecting evolutionary conserved RNA structures where sequence conservation is limited.
I subsequently describe the elaboration of a hybrid algorithm for massively parallel, comparative genomic screens of RNA secondary structure conservation based on the optimal performance range of each tested algorithm, as determined from the aforementioned benchmarking. When applied to consistency-based multiple genome alignments of 35 mammals, this approach confidently identifies over 4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails a false discovery rate of 8.1%, a historic low for such analyses. These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence constrained element. The findings divulged in this thesis represent a lower bound; extrapolations suggest that over 30% of the genome is under natural purifying selection for RNA structure, consistent with other indices that suggest that a large proportion of the mammalian genome is functional. The extensive set of functional transcriptomic annotations presented in this thesis provide a comprehensive resource to aid in uncovering the precise molecular mechanisms underlying complex diseases, development and evolution, as well as the beginnings of a potential basis for parsing structure-function relationships in the vast numbers of noncoding RNAs expressed from mammalian genomes.