Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Evolutionary Genetics: Part 1 Polymorphism in DNA S. chilense S. peruvianum Winter Semester 2012-2013 Prof Aurélien Tellier FG Populationsgenetik Color code Color code: Red = Important result or definition Purple: exercise to do Green: some bits of maths Population genetics Evolution = changes between generations of frequency of characters, traits or alleles Why study the genetics of populations? the population is the main unit at which selection acts !!!!!!!!! Useful definitions DNA has 4 bases Adenine, Guanine, Cytosine, Thymine Haploid= organisms with one set of non-paired chromosomes Diploids, Polyploids Tetraploid for Maize, Up to Hexaploid for Wheat Decaploid for strawberries Pentapoid to Duodecaploid for sugarcane Chromosomal location of a gene is a locus Several alleles can be observed at a locus (one from mother and one from father) The complete set of alleles in a species or population = gene pool The occurrence of one allele in proportion to total in gene pool = allele frequency Population genetics Definition: population genetics is the study of the frequencies of alleles in populations as well as their temporal and spatial changes Populations and species show variability: what type and how much genetic variation exist within populations/species? what are the forces that influence the amount of variation within populations? First question: what is the variability at the genetic (DNA) level? Polymorphism in DNA: Sanger sequencing Polymorphism in DNA: Illumina NGS sequencing www.seqanswers.com Polymorphism in DNA: Illumina NGS sequencing Polymorphism in DNA: Illumina NGS sequencing Polymorphism in DNA: Illumina NGS sequencing Polymorphism in DNA: Illumina NGS sequencing www.seqanswers.com Polymorphism in DNA: Illumina NGS sequencing Polymorphism in DNA: new new generation… Polymorphism in DNA: how it looks like Some definitions Coding / non-coding DNA Some definitions For the coding DNA: start codon, stop codon Some definitions: point mutations Point mutation = change in the DNA base (e.g. T becomes G) Insertions – deletions = removal or insertion of bases exercise 1.1 and 1.2 use DNASP and the file 055twolines.fas Some definitions: point mutations Consequences on the protein: Synonymous mutations: do not change the codon and the Amino Acid Non-synonymous mutations: change the Amino Acid Silent sites = non-coding regions + synonymous sites - Frameshift mutation: change reading frame (due to indels) - Nonsense mutation: stop codon is introduced Positions in the codon: -A position is fourfold degenerate if any nucleotide change specifies the same AA (only 3rd position of a codon) ex: Glycine codons -Twofold degenerate if two our of 4 changes specifies the same AA (ex 3rd position of Glutamic Acid) -Threefold degenerate? 3rd position Isoleucine -Non-degenerate: any change specifies a different AA Practical exercise Using database to find info on sequences Exercise: Download the file: data-plants.fas 1) Open it with DNASP. What do you see? 2) Go to www.ncbi.nih.gov to the BLAST tools 3) Look at the options, these are plant sequences. Can you retrieve where the sequences are from? 4) You will be directed to the results of the BLAST: lets look at them by moving the cursor on the lines. Scores for aligments. 5) What are the best hits? Then you will be directed to the GenBank directory of sequences. What information are there for these sequences? Practical exercise Practical exercise Practical exercise Then we will insert information from the database into the sequences in DNASP. Place the coding region using the: data -> assign coding regions Then you see changes in the way DNASP shows the sequences (see example) Open also in Mesquite Can you find out how many changes are silent, synonymous or non-synonymous? How many SNPs are there? (Single Nucleotide Polymorphisms = mutational changes) Different types of data Patterns of diversity can be observed in populations, species or among species. Phylogenetic trees fall in the between species comparison class Type of data: DNA sequences (SNPs), proteines sequences, microsatellites Microsatellites are short stretches of repeated DNA: TATATATATATATA What matters is the number of motif repeats One can look at their size using electrophoresis gel, but they contain less information than SNP data. Their mutation rate is also higher due to the ripping of the Polymerase on them. Questions to ask when looking at data -Are the sequences already aligned? -Are the data from one population or more than one? Or different species? -Are the data from sexually or asexually reproducing organisms? -Are the sequences from coding or non-coding, or both DNA? -Are the data from one locus or several loci? -Do we see all sites or only the variable ones (SNP, indels or both)? -Do we see all sequences or only the different ones? -Are the data from microsatellites or SNPs? => Go to Exercise 1.3 for also the Solanum data Population genetics: 4 evolutionary forces random genomic processes (mutation, duplication, recombination, gene conversion) molecular diversity natural selection random spatial process (migration) random demographic process (drift) Population genetics investigates the laws governing the genetic structure of populations, and changes in allele frequencies over time Population genetics: 4 evolutionary forces random genomic processes (mutation, duplication, recombination, gene conversion) molecular diversity natural selection random spatial process (migration) phenotypic variability random demographic process (drift) We want to infer the role of the evolutionary forces from sequence data (very useful tool is the coalescent theory) Divergence and mutation rate Molecular clock when DNA is passed from one generation to the next there is a constant probability called µ that a mutation occurs because the polymerase is not error free we assume the rate is constant (though over long periods of time this may not be true) Probability of mutation of what? at a site (per site mutation rate) at an entire locus (locus mutation rate) genome wide mutation rate µ is the probability per generation of a mutation, and (1- µ) is the probability that no mutation occurs Molecular clock thus P[no mutation for t generations] = (1- µ)t as long as µ << t we can use an approximation: 1+x≈ex and 1 - x ≈ e -x result: P[no mutation for t generations] = e –µt graph of the exponential? Molecular clock Maths 1 : Probability, Expectation and Variance, exponential distribution Molecular clock T is the time to the next mutation P[no mutation for t generations] = P[T ≥ t ] = e –µt has E[T ] = 1 µ and Var[T ] = 1 µ2 Separated by at least 2t generations, and split between Drosophila populations = 100 000 generations (10 000 years) Molecular clock D is the number of polymorphic sites between the sequences, also called the divergence K mutations have appeared along the branches of the two descendants populations, which have length of 2t Molecular clock Maths 2: estimator assuming K = D thus E[D] = E[K] = 2µt so we obtain a first estimator for µ: D µˆ = 2t So we can calculate the mutation rate if we know the time of divergence and the divergence from molecular data!!!! Exercise 1.4: calculate the mutation rate in Drosophila