In recent years, it has become apparent that there are large numbers of non-protein coding RNAs (ncRNAs) expressed from the genomes of mammals and other complex organisms, most of which appear to be expressed in a developmentally regulated manner. A substantial (and increasing) proportion of these ncRNAs have been shown to be functional, primarily as regulatory molecules. With the increasing recognition by researchers of the importance of ncRNAs in multicellular organisms, together with the expanding rate of ncRNA discovery and functional association, it was obvious that there was a need for a mammalian ncRNA database, as a basis for the structured exploration of these ncRNAs by members of the research community. Although there were existing proteomics and/or structural ncRNA databases, there were none specifically dedicated to regulatory ncRNAs. By collating a wide range of sources the resulting database (RNAdb), as at the end of 2007, now contains the sequences and annotations of several hundred thousand putative mammalian regulatory ncRNAs. These include a wide range of microRNAs (miRNAs), small nucleolar RNAs (snoRNAs), PIWI-interacting RNAs (piRNAs) and larger mRNA-like ncRNAs, including putative antisense transcripts, as well as ncRNAs predicted on the basis of structural features and alignments. This database, from original (2005) through to the updated (2007) release has been cited in almost 60 publications. The derivation of most of these ncRNAs from intergenic regions of the genome also prompted an analysis of sequence conservation patterns in the non-coding portion of the human and other mammalian genomes. This was made possible by advent of vertebrate whole genome sequencing projects and subsequent multi-species alignments.
In 2004, during initial processing of multi-species alignments looking for regions of high conservation within non-coding regions, on the postulate that such conservation would be due to negative selection acting on functional elements, in conjunction with my colleagues I discovered that there were almost 500 sequences of length at least 200 base pairs (bp), together with more than 5,000 sequences of length at least 100 bp, that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Importantly this 3-way comparison not only increased the stringency of this comparison but also enabled the ruling out of library cross-contamination as a source of these sequences. These ultraconserved (UCEs) sequences are far more highly conserved than protein-coding sequences and the majority are located outside of the coding (CDS) regions. These UCEs also show 96% identity with chicken, which diverged from mammals ~310 million years ago (Mya) ago. If the low substitution rate in UCEs remained constant, these elements should also be present with a high level of identity in fish (~450 Mya). However this is not the case, suggesting that many of these elements appeared in the amniotes or tetrapods, or that the molecular clock has slowed down in these lineages, or both. It was also apparent that some of the original UCEs were not 100% conserved in other mammals, suggesting that these sequences were evolving very slowly and that a new operational definition of these sequences was required that was not dependent on any particular reference set. Taking advantage of the subsequent availability of multiple genomes, 13,736 UCEs were identified in the human genome that are identical over at least 100 bp in at least 3 out of 5 placental mammals, including 2,189 sequences over at least 200bp, thereby greatly expanding the repertoire of known UCEs, and investigated the evolution of these sequences in opossum, chicken, frog, and fish. These analyses showed that there was a massive genome-wide acquisition and expansion (in terms of size and numbers of sequences that were recruited as ostensibly functional units) of UCEs during tetrapod and then amniote evolution, accompanied by a slowdown of the molecular clock, particularly in the amniotes, a process consistent with their functional exaptation in these lineages. The majority of tetrapod-specific UCEs are non-coding and associated with genes involved in regulation of transcription and development. In contrast, fish genomes contain relatively few UCEs, the majority of which are common to all bony vertebrates. These elements are different from other conserved non-coding elements, and appear to be important regulatory innovations that became fixed following the emergence of vertebrates from the sea to the land.