Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Evolution – Phylogenetic trees • Our goal would be to construct relationships such as those depicted below, also known as phylogenetic trees. Evolution – Phylogenetic trees • Our goal would be to construct relationships such as those depicted below, also known as phylogenetic trees. • As expected we would need a model to explain the variability, preferably a mathematical model. • Equipped with a mathematical model of molecular evolution we will be able to define the notion of Phylogenetic distance, and then try to go back and address the original question, but there is a big detour before… What is DNA? The genetic code DNA is a long polymer made from repeating units called nucleotides. These nucleotides are Adenine (A), Guanine (G), Cytosine (C) and Thymine (T) In living organisms, DNA does not usually exist as a single molecule, but instead as a pair of molecules that are held tightly together A, G = purines C, T = pyrimidines (Chemically similar) Rungs of the ladder: A – T G–C What is DNA? Local patriotism King’s College’s scientists played a leading role in the discovery of DNA structure in the 50’s Maurice F. Wilkins 1916 - 2004 (Nobel Prize 1962) Rosalind Franklin 1920 - 1958 What is DNA? The genetic code From an information point of view or computationally, the DNA is a sequence of two homologous sequences (where knowing one recovers fully the other) Or simply: GAGTCTGGCAACAACTGTTGATA CTCAGACCGTTG TTGACAACTAT Or even just: (you can sort out the other one yourself … - good reasons for this) AGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAA … Directionality is important What is DNA? - Genes Some sections of the DNA (and BY NO MEANS all) encode instructions for the manufacturing of proteins. These sections are called Genes. 97% of DNA does not code for encode and called “junk DNA” In genes, triplets of consecutive bases for Codons, e.g. AAC, AGC, … Each codon specifies an amino-acid to be placed in a protein chain, i.e. amino acids are the building blocks of proteins What is DNA? Amino acids There is some redundancy – let’s count: There are 20 amino acids (+ one “stop” command), and in principle 43 = 64 codons. More than one codon codes for the same amino acid! (in many cases, changing the 3rd base in a codon does not change the amino acid) DNA replication There are many proofreading mechanisms Nevertheless, mistakes occur, so called Mutations: 1. Base substitution, e.g. G -> C Either a transition – T <-> C, A <-> G Or a transversion – T,C <-> A,G P (transition) > P (transversion) 2. Deletion – more rare 3. Insertion – more rare DNA replication. The double helix is unwound and each strand acts as a template. Bases are matched to synthesize the new partner strands. Mutations – Problems A major problem is to deduce the amount of mutation that had occurred during the evolution – have a look at the following DNA sequences coming from 3 generations S1 : ACCTGCGCTA … S2 : ACGTGCACTA … S3 : ACGTGCGCTA … What is the mutation rate? Comparing S1 and S2 suggests 2/10 per generation. But comparing S1 and S3 suggests 1/10 per two generations – much lower Actually, we had 3 mutations during these two generations. Therefore, there is the risk of underestimating mutation rates due to hidden mutations or back mutations. Probabilistic DNA – a reminder Given the following sequence of 40 base-pairs – what would be the next one? AGCTTCCGATCCGCTATAATCGTTAGTTGTTACACCTCTG Assuming that this is a random sample, and that there is no systematic bias, or dependence between successive pair bases and the most probable base pair would be T. Probabilistic DNA – example 1 S0 : AGCTTCCGATCCGCTATAATCGTTAGTTGTTACACCTCTG S1 : AGCTTCTGATACGCTATAATCGTGAGTTGTTACATCTCCG Estimate the mutation rate C -> T Probabilistic DNA – example 1 S0 : ACTTGTCGGATGATCAGCGGTCCATGCACCTGACAACGGT S1 : ACATGTTGCTTGACGACAGGTCCATGCGCCTGAGAACGGC Frequencies of S1 = i and S0 = j in 40-Site Sequence Comparison Probabilistic DNA – example 1 S0 : ACTTGTCGGATGATCAGCGGTCCATGCACCTGACAACGGT S1 : ACATGTTGCTTGACGACAGGTCCATGCGCCTGAGAACGGC Estimates of Conditional Probabilities P(S1 = i | S0 = j ) Try this at home!