Download Evolution – Phylogenetic trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Natural computing wikipedia , lookup

Transcript
Evolution – Phylogenetic trees
• Our goal would be to construct relationships such as those depicted below,
also known as phylogenetic trees.
Evolution – Phylogenetic trees
• Our goal would be to construct relationships such as those depicted below,
also known as phylogenetic trees.
• As expected we would need a model to explain the variability, preferably a
mathematical model.
• Equipped with a mathematical model of molecular evolution we will be able
to define the notion of Phylogenetic distance, and then try to go back and
address the original question, but there is a big detour before…
What is DNA?
The genetic code
DNA is a long polymer made from repeating units called nucleotides. These
nucleotides are Adenine (A), Guanine (G), Cytosine (C) and Thymine (T)
In living organisms, DNA does not usually exist as a single molecule, but
instead as a pair of molecules that are held tightly together
A, G = purines
C, T = pyrimidines
(Chemically similar)
Rungs of the ladder: A – T
G–C
What is DNA?
Local patriotism
King’s College’s scientists played a leading role in the discovery of
DNA structure in the 50’s
Maurice F. Wilkins
1916 - 2004
(Nobel Prize 1962)
Rosalind Franklin
1920 - 1958
What is DNA?
The genetic code
From an information point of view or computationally, the DNA is a sequence
of two homologous sequences (where knowing one recovers fully the other)
Or simply:
GAGTCTGGCAACAACTGTTGATA
CTCAGACCGTTG TTGACAACTAT
Or even just: (you can sort out the other one yourself … - good reasons for this)
AGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAA …
Directionality is important
What is DNA? - Genes
Some sections of the DNA (and BY NO MEANS all) encode instructions for the
manufacturing of proteins. These sections are called Genes.
97% of DNA does not code for encode and called “junk DNA”
In genes, triplets of consecutive bases for Codons, e.g. AAC, AGC, …
Each codon specifies an amino-acid to be placed in a protein chain,
i.e. amino acids are the building blocks of proteins
What is DNA?
Amino acids
There is some redundancy – let’s count:
There are 20 amino acids (+ one “stop” command), and in principle 43 = 64 codons.
More than one codon codes for the same amino acid!
(in many cases, changing the 3rd base in a codon does not change the amino acid)
DNA replication
There are many proofreading mechanisms
Nevertheless, mistakes occur, so called
Mutations:
1. Base substitution, e.g. G -> C
Either a transition – T <-> C, A <-> G
Or a transversion – T,C <-> A,G
P (transition) > P (transversion)
2. Deletion – more rare
3. Insertion – more rare
DNA replication. The double helix is unwound and each strand acts as
a template. Bases are matched to synthesize the new partner strands.
Mutations – Problems
A major problem is to deduce the amount of mutation that had occurred
during the evolution –
have a look at the following DNA sequences coming from 3 generations
S1 : ACCTGCGCTA …
S2 : ACGTGCACTA …
S3 : ACGTGCGCTA …
What is the mutation rate?
Comparing S1 and S2 suggests 2/10 per generation.
But comparing S1 and S3 suggests 1/10 per two generations – much lower
Actually, we had 3 mutations during these two generations.
Therefore, there is the risk of underestimating mutation rates due to
hidden mutations or back mutations.
Probabilistic DNA – a reminder
Given the following sequence of 40 base-pairs – what would be the next
one?
AGCTTCCGATCCGCTATAATCGTTAGTTGTTACACCTCTG
Assuming that this is a random sample, and that there is no systematic bias,
or dependence between successive pair bases
and the most probable base pair would be T.
Probabilistic DNA – example 1
S0 : AGCTTCCGATCCGCTATAATCGTTAGTTGTTACACCTCTG
S1 : AGCTTCTGATACGCTATAATCGTGAGTTGTTACATCTCCG
Estimate the mutation rate C -> T
Probabilistic DNA – example 1
S0 : ACTTGTCGGATGATCAGCGGTCCATGCACCTGACAACGGT
S1 : ACATGTTGCTTGACGACAGGTCCATGCGCCTGAGAACGGC
Frequencies of
S1 = i and S0 = j in 40-Site
Sequence Comparison
Probabilistic DNA – example 1
S0 : ACTTGTCGGATGATCAGCGGTCCATGCACCTGACAACGGT
S1 : ACATGTTGCTTGACGACAGGTCCATGCGCCTGAGAACGGC
Estimates of Conditional
Probabilities P(S1 = i | S0 = j )
Try this at home!