The makeup of vertebrate genomes have been shaped by a number of complex evolutionary
processes including sequence mutation, duplication, deletion, transposition
and natural selection. As only a very small fraction of the vertebrate genome encodes
proteins (approximately 1.5% in human), the functional role, if any, of the vast
majority of sequence is still largely unknown. Transposable elements (TEs) are a
major constituent of all vertebrate genomes. The human genome contains over three
million recognisable transposon insertions that are separated, on average, by !500 bp
and account for around 45% of the genome.
This thesis shows that, despite the density of insertion across the genome, the
human and mouse genomes each contain almost 1000 transposon-free regions (TFRs)
over 10 kb in length. The majority of human TFRs correlate with orthologous TFRs
in the mouse, despite the fact that most transposons are lineage specific. Many human
TFRs also overlap with orthologous TFRs in the marsupial opossum, indicating that
these regions have remained refractory to transposon insertion for long evolutionary
periods. Over 90% of the bases covered by TFRs are non-coding, much of which
is not highly conserved. Most TFRs are not associated with unusual nucleotide
composition, but are significantly associated with genes encoding developmental
TFRs are specifically resistant to transposon insertions, while other forms of non
TE insertions appear to be tolerated at similar rates to other genomic sequences.
Furthermore, some families of transposons appear to be tolerated on the borders of
TFRs more frequently than other types of TEs. Together these results suggest that
it is not the potential of TEs to disrupt function by insertional mutagenisis that is
the primary pressure preventing TE insertions becoming fixed in these regions, but
rather the effect may be dependant on sequence specific recognition of TE derived
sequence by the cell. To investigate the evolutionary origin of TFRs, the available non-mammalian
genomes were examined to establish which were suitable for TFR annotation. The
zebrafish genome contains 470 TFRs over 10 kb and a further 3,951 TFRs over
5 kb, which is comparable to the number identified in mammals. Two thirds of
zebrafish TFRs over 10 kb are orthologous to TFRs in at least one mammal, and
many have orthologous TFRs in all three mammalian genomes as well as in the
genome of Xenopus tropicalis. This indicates that the mechanism responsible for
the maintenance of TFRs has been active at these loci for over 450 million years.
Furthermore, syntenically conserved TFRs are also more enriched for regulatory
genes compared to lineage-specific TFRs.
Finally this thesis analyses the relationship between TFRs and the recently defined
“bivalent” chromatin domains in embryonic stem (ES) cells. Bivalent domains are
primarily restricted to ES cells and are strongly associated with transcriptionally
silenced genes that when activated, induce linage-specific differentiation. Initial
results, derived from a partial analysis of the genome, suggested that bivalent
domains were strongly associated with the presence of TFRs. However, using
recently published, whole genome annotation of bivalent domains, we find that only
a subset of TFRs are associated with bivalent domains.
In conclusion, the results presented in this thesis demonstrate that TFRs represent
regions of the genome that have been maintained free of transposon insertions for
at least the last 450 million years. TFRs are enriched for many important genomic
features such as regulatory genes, ultra conserved elements and bivalent domains.
However, the molecular and genetic basis that prevents these extended regions
from tolerating transposon sequence is still unclear. We suggest that TFRs contain
extended regulatory sequences that contribute to the precise expression of genes
central to early vertebrate development, and can be used as predictors of important
regulatory regions. Furthermore, these results suggest that the analysis of nonrandom
patterns of different classes of sequences within genomes, in contrast to the
traditional focus on primary sequence conservation, may offer new opportunities for
the detection of functional elements within the genome.