Download Genome structure and evolution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Mutation wikipedia , lookup

Transcript
Introduction
•
Sequence evolution and alignment
– Context dependent substitution models
– Indels
– Statistical alignment
Uncertainty in Homology Inferences: Assessing and Improving Genomic Sequence Alignment
Gerton Lunter, Andrea Rocco, Naila Mimouni, Andreas Heger, Alexandre Caldeira, and Jotun Hein
•
Non-coding DNA
– Identification
– Estimation
– Evolution (adaptation?)
– Expression (where? function?)
•
Evolution of expression levels
How do gene expression levels evolve over time?
Is selection or drift the main factor?
Collaboration with Philipp Khaitovich (Shanghai); Raphaelle Chaix (Paris)
•
Structured Coalescent
Population genetics with geographical structure
(E.g. island populations; HIV transmission in different risk groups)
Collaboration met Chris Holmes (Oxford), Oliver Pybus (Oxford), Alexei
Drummond (New Zealand), Andrew Rambaut (Edinburgh)
New sequencing technologies
• ABI
1100 bases / read, 1 Mb / day
• Roche / 454
250 bases / read, 100 Mb / 7.5 hour
• Solexa / Illumina
35-50 bases / read, 2 Gb / 6.5 hour
This week
1. Genome structure and evolution -- general introduction, biological motivation
2. Bayesian statistics; Stochastic processes; substitutions
3. Pairwise alignments; probabilistic alignments; hidden Markov models
4. Applications: Neutral indel model; amount of functional sequence in human genome;
positive selection on non-coding DNA
5. Population genetics; Lewontin’s “paradox of variation”; modelling geographic
population structure
Genome structure and evolution
1. Genome structure – Double helix
1. Genome structure - Chromatin
DNA is packed into chromosomes in a
hierarchical way:
DNA double helix is coiled around histone
octamers. About 165 nucleotides wrap
around a single octamer (wrapping 2.85
times). These “beads” are separated by
50 nt long spacer sequence.
Histone beads pack into 30 nm fibers
Fibers are tied up into scaffolds
Condensed scaffolds make up the
macroscopic form of chromosomes.
1. Genome structure - Chromosomes
Human cells have 46 chromosomes: 22
normal chromosomes (autosomes), in
pairs (from father and mother), and two
sex chromosomes (X from the mother, X
or Y from the father).
In preparation for normal cell division
(mitosis), chromosomes are replicated,
but remain joined at their centromere
(prophase). This gives the chromosomes
their “X” shape. Both “halves” of the X
are called (sister) chromatids.
When cells are not replicating, this is the
usual form of chromosomes.
http://biology.unm.edu/ccouncil/Biology_124/Summaries/Sex.html
1. Genome structure – DNA methylation
DNA methylation is an epigenetic marker that
controls / regulates many biological functions:
- Control of gene expression
- Control of DNA replication
- Control of the cell cycle
- and more
Cytosines are methylated by enzyme (Cmethyltransferase), which targets CpG pairs
(Cytosine, phosphate, Guanine).
Methylation patterns are established during early
development, and maintained over many
generations by maintenance methyltransferases
copying the methylation status to a newly
synthesized strand (note: CpG is its own reverse
complement).
1. Genome structure – Histone modifications
Histones have tails which can be modified in
various ways, and at several locations. Each
(combination of) modifications has a different
biological function (“Histone code”).
Histones are involved in many essential
biological processes including
- Gene regulation
- DNA repair
- Chromosome condensation / mitosis
“Until the early 1990s, histones were dismissed as
merely packing material for nuclear DNA”
(Wikipedia). Extreme conservation of histone
proteins (found all the way back to archaea)
suggests that they are involved in important
biological pathways.
Nature 447, 433-440(24 May 2007)
http://chemistry.gsu.edu/faculty/Zheng/
1. Genome structure – Sequence structure
What does the human genome sequence consist of?
Total size:
2,858,160,000 bp
- Protein-coding genes:
About 20,500
- Protein-coding exons:
About 220,000; cover 1.2% of genome
- Transposable elements: About 45% of genome
- Tandem repetitive sequence:
Few %
- Heterochromatin:
Few %
- Unknown:
About half.
Conserved:
Biologically functional:
About 5%
? (>5%)
1. Genome structure – GC content
GC content of mammalian genomes is variable, and shows long-distance structure.
Regions of “fairly homogeneous” GC content are called “isochores”. Cannot be defined exactly, but reflect the
fact that the genome does show compositional discontinuities.
(Clay & Bernardi, Trends in Biotechn 2002, 20(6), p. 237.)
1. Genome structure - Genes
Darwin (1809-1882) used the term “gemmule” to denote a microscopic
unit of inheritance. Major problem in his day: why do traits not
“blend out” by mixing.
Mendel (1822-1884) first to suggest the existence of factors conveying
traits from parent to offspring, and the pattern of their inheritance
(e.g., two copies per individual, one from each parent; segregation
during gamete production; different traits segregate independently),
solving the problem of blending.
1889: Hugo de Vries coined term “pangen”, later shortened to “gene”.
1910: Thomas Hunt Morgan: genes reside on specific chromosomes
1941: Specific genes code for specific proteins. “One gene one enzyme” hypothesis.
1977: Roberts and Sharp discover introns
2003: Genes often overlap; single genes have multiple product.
1. Genome structure - Genes
Eukaryotic protein-coding genes
consist of:
Upstream
region
- Upstream region (with regulatory signals)
- Promoter region, with transcription
initiation site (e.g. TATA box)
- 5’ untranslated region (5’ UTR)
- Translation initiation site
(includes start codon)
- Alternating sequence of exons (proteincoding) and introns
- Translation stop site (stop codon)
- 3’ UTR
- Polyadenylation (poly-A) signal
- Translation stop site
Promoter
3’ UTR
1. Genome structure – Transposable elements
TEs are “selfish genes” which when activated can insert copies of themselves into the
genome. When this happens in the germline, these insertions are transmitted to the
next generation.
Vast majority of TEs can be classified into four families, based on the mechanism by which
they copy themselves:
- LINEs (Long Interspersed Nuclear Elements, autonomous)
- SINEs (Short Interspersed Nuclear Elements, use LINE proteins for life cycle)
- LTR elements (Long Terminal Repeats; derived from retroviruses)
- DNA transposons (replicate without RNA intermediary)
1. Genome structure – Transposable elements
TEs were discovered by Barbara McClintock in the 1950s, in
maize where they are very active.
In human somatic cells, TE insertions can cause disease.
TEs are mostly neutral or deleterious. Despite most not
being useful (for us) so that there is no selection pressure
to keep them (in the human population), many have
remained just by chance. They
are useful as proxy for neutrally evolving
sequence.
A small proportion of TE-derived sequence
has in fact been recruited into useful biological roles, and is now highly conserved.
1. Genome structure – Transposable elements
Age of a TE can be determined (approximately) by counting average number of
substitutions from the consensus sequence, supposed to be the ancestral state.
Histogram of TEs versus age shows the activity over time. Alus have been very active, but
recently things have quited down in human.
2. Genome evolution - introduction
In the course of a human lifetime, the genome is used, damaged, repaired, copied and handed down to offspring
cells dozens of times. In the process, the genome is changed. This change is called a mutation.
At first this involves just a single individual. If the change is has no phenotypic consequence, no selection acts
against (or for) the mutation, and chance determines whether the individual’s offspring will carry the mutation,
and so on. The process by which the frequency of the mutation changes in the population is called (random)
genetic drift.
Of all neutral mutations in a population of 2N haploid genomes, a fraction 1/2N will eventually spread through the
entire population. The mutation is said to have gone to fixation. Mutations that have a beneficial effect have a
(much) larger probability of getting fixed (once they reach a non-negligible population frequency), while
deleterious mutations have almost no chance of going to fixation. Mutations that have become fixed in the
population are called substitutions.
(Note: “substitutions” usually refer to single nucleotide substitutions, but the term “indel substitution” is also used.)
When comparing genomes from different species, what you see are all the fixed mutations (substitutions) that have
occurred since the two species split. Mutations that are reside in either of the two individuals whose genomes
were sequenced are called polymorphisms, and will also be included. Usually these form a small proportion
and the distinction is ignored (but note that polymorphisms may well be deleterious, while substitutions rarely
are).
2. Genome evolution – nucleotide substitutions
Basically two causes: damage, and copy errors during replication.
The two causes can be teased apart by comparing species with
different generation times. More generations per unit of time
mean more copying errors, while the rate of damage might stay
relatively constant.
Errors are recognized and repaired by specific and highly efficient
repair mechanisms.
Resulting error rate is low: about 3x10-8 per nucleotide per
generation in humans.
The repair mechanism is extremely important: damage to this
system increases the likelihood of getting cancer.
The rate of mutagenesis is higher in males than in females (see e.g.
Berlin et al., J Molec Evol 62(2) 226-233), probably due to more
cell divisions in the male germline. This results in low mutation
rates on the X, and high mutation rates on the Y chromosome.
In mammals, the rate of transitions (pyrimidine-to-pyrimidine or
purine-to-purine) is about twice higher than the rate of
transversions (pyrimidine-to-purine or vice versa).
2. Genome evolution – CpG mutation rate
Methylation of Cytosine (mC) involves adding a methyl group (CH3) on
to the C5 carbon.
Accidental de-amination of the C4 carbon turns a mC into a normal
Thymine.
This results in a mismatch, but the “wrong” base cannot be identified,
since both are in the “alphabet”.
Result: substitution rate on CpG dinucleotides is about 15x higher than
for ordinary C’s or G’s.
(The same process on the reverse-strand mC causes a high mutation
rate on the “G”).
Over time, this causes CpGs to be about 4x underrepresented
compared to the expectation based on C and G frequencies.
For sequences that are not methylated (in the germline), this
mechanism does not apply, resulting in “high” (i.e. normal) levels
of CpG in so-called “CpG islands”. These are often promoters of
ubiquitously expressed genes.
2. Genome evolution –
Transcription-coupled repair
•
When RNA polymerase II encounters a
mutated nucleotide, it stops. This triggers
the TCR pathway which repairs the
mutation.
•
Failure of TCR leads to Cockayne
syndrome, extreme form of accelerated
aging.
•
TCR is strand-asymmetric (mutations in the
untranscribed strand are not corrected by
TCR), and leads to asymmetric mutation
rates in transcribed regions.
2. Genome evolution - Indels
CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA
CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA
Indel
Indel
Indel
When the ancestral sequence is not known, insertions and deletions cannot be
distinguished, and are often referred to as “indels”.
Indels form an important source of sequence change – more on this later.
Most small indels are in fact deletions (by a factor 3 in human).
Indels can have any size, up to several Mb. The majority are 1 nt indels.
2. Genome evolution – Indel mechanisms
During replication, the template and
copy can become separated.
If this happens in a tandem-repetitive
region, there is a possibility of
incorrect re-pairing (slippage)
This can lead to both short insertions
and deletions.
Long stretches of short-period tandem
repeats (microsatellites) are
particularly prone to slippage. This
is the reason behind the fast
evolution of microsatellite length.
The gene encoding for huntingtin
contains a repeat region of CAG
triplets. Expansion of the number
of CAG units beyond 36 causes
Huntington’s disease.
http://www.sci.sdsu.edu/~smaloy/
2. Genome evolution – indel mechanisms
Recombination between direct repeats in a single
chromosome leads to a (potentially Mb size)
deletion.
Recombination requires (near) sequence identity
over fairly large region (100s nt?), so these
deletions are mostly not very small.
Unequal recombination (involving similar or
identical regions at different chromosomal
locations) can also lead to insertions
(segmental duplications).
In the picture, unequal recombination between
sister chromatids at replication is shown.
The same process may also happen between
parental (homologous) chromosomes.
http://www.sci.sdsu.edu/~smaloy/
2. Genome evolution – Recombination
Mechanism of recombination:
1. Double-stranded break (DSB) formation
2. Broken ends get digested
3. Single strands invade region with
high sequence similarity
4. Repair and re-synthesis  Holliday junction
5. Holliday junction resolution:
Crossing over (black arrows), or…
… NO crossing over (grey arrows)
Gabriel Marais, Trends Genet, 19(6)2003 
2. Genome evolution - types of recombination

Double-stranded breaks appear:
Accidentally (somatic & germ cells)
“Repair” recombination
Deliberately (germ cells at meiosis)
“Sexual” (or “meiotic”) recombination

Different (but overlapping) pathways

Preference for:
sister chromatid in repair recombination
parental chromosome in sexual recombination

Recombination is obligatory during meiosis.
Rate of recombination is >1 per generation per chromosome.
2. Genome evolution – Gene conversion

Gene conversion = copying of one stretch
of DNA into another

Single-stranded DNA can invade sister chromatid
Identical DNA, so no mutations

If single strand invades parental chromosome:
Without crossing over: gene conversion
With crossing over: gene conversion + recombination

When the nicked strand invades a non-homologous but sequence-similar
region (as in unequal recombination), gene conversion causes “sideways
copying” of genetic material. Causes similarities to increase / persist.
The effect of gene conversion (without recombination) on the genome
sequence is equivalent to two recombination events happening close to
each other (order 1kb).
2. Genome evolution –
Biased Gene Conversion
Mutation bias for GC as a side effect of gene conversion

Two repair mechanisms:
Base Excision Repair (BER)
Targets “hetero-mismatches”, AG, TG, AC, TC
Efficient; replaces just one base
Favours GC
Nucleotide Excision Repair (NER)
Targets AA, CC, TT, GG mismatches
Digests ~1kb, resynthesizes
Favours unbroken strand, no nucleotide bias

Second source of biased gene conversion
AT sites seem to be target for DSBs in sexual recombination
DSB strand gets digested, copied back from other allele
Result: bias towards GC
2. Genome evolution - Recombination hotspots
•
Rate of recombination is measured in centiMorgans
(cM). Two genetic loci are 1 cM apart if 1 recombination
per 100 generations occurs between them.
•
Recombination rate not uniform:
– Background rate ~0.04 cM/Mb
– Average rate ~1 cM/Mb
– 0.5% of genome >15 cM/Mb
•
Recombination hotspot = gene conversion hotspot
•
Cause of hotspots not known:
– CCTCCCT motif?
– Bias for high GC
•
One mutation can change hotspot activity
– “DNA2” locus in MHC region,
CT suppresses hotspot
•
Perhaps differences in recombination rates have, over
time, caused the current isochore structure through
biased gene conversion.
Myers, Bottolo, Freeman, McVean, Donnelly,
Science 310 Oct 2005
2. Genome evolution –
double stranded break repair
Accidental breaks are also repaired
through the non-homologous end
joining (NHEJ) pathway. Does not
require homologous sequence.
Evolutionary very old pathway: yeast and
some bacterial species have NHEJ.
Repairs most breaks correctly, but is also
able to induce translocations
(chromosome rearrangements).
Gill and Fast BMC Molecular Biology 2007 8:24 doi:10.1186/1471-2199-8-24
2. Genome evolution –
chromosomal rearrangements
Mouse chromosomes (1-19 and X) coloured according to homology with human
chromosomes (1-22 and X). In the about 2 x 80 million years that separate humans and mice, many
chromosomal rearrangements have occurred.