* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt6
Genome (book) wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genetic drift wikipedia , lookup
Dual inheritance theory wikipedia , lookup
Group selection wikipedia , lookup
Non-coding DNA wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Adaptive evolution in the human genome wikipedia , lookup
Human genome wikipedia , lookup
Pathogenomics wikipedia , lookup
Genomic library wikipedia , lookup
Minimal genome wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genome editing wikipedia , lookup
Population genetics wikipedia , lookup
Human Genome Project wikipedia , lookup
Microevolution wikipedia , lookup
Genome Evolution. Amos Tanay 2012 Genome evolution Lecture 5: Selection vs. mutation/recombination. Species Genome Evolution. Amos Tanay 2012 Mutation-Selection balance When an allele is weakly deleterious, mutations can play a major role in driving allele frequencies Genotype New allele frequency, without mutation pqw12 p 2 w11 p' 2 p w11 2 pqw12 q 2 w22 AA Aa aa 1 1 hs 1 s Fitness 2 2 pq q 2 Frequency(HW) p New allele frequency, assuming mutation pq(1 hs) p 2 p' 2 (1 ) p 2 pq(1 hs) q 2 (1 s) What is the equilibrium frequency of the deleterious allele? h 0, q' h 0, q' s hs A a ignore (q<<1) Genome Evolution. Amos Tanay 2012 Mutation-Selection balance: Huntington disease a neurological genetic disease appearing after age 35 Resulting from a dominant mutation – how does this disease survive in the human population? Although it may be fatal, the fitness is not very low due to the late age of onset (estimated w12=0.81) Human population: 70 per million (Europe) to 1 per million (Africa) h>0, and we can estimate the mutation rate at the Huntington locus, as hsq’ = 10-6 (1-0.81) = 1.9x107 to 70x10-6 (1-0.81) = 1.3x10-6 h 0, q' h 0, q' s hs Genome Evolution. Amos Tanay 2012 Mutation-Selection balance: Haldane-Muller h 0, q' The average fitness of the population, given recurrent mutations in rate at a locus with negative fitness s. Assume perfect recessivity (h=0): Assuming partial dominance (h>0) 1 qˆ 2 s 1 s h 0, q' s hs s 1 1 2 pˆ qˆhs qˆ 2 s 1 2(1 ) hs s 1 2 hs hs hs The Haldane-Muller principle: the effect of mutation on the average population fitness depends only on the mutation rate, not on the fitness of the alleles!! 2 Genome Evolution. Amos Tanay 2012 Overdominance A SNP affecting the beta-globin gene make the encoded protein defected. The resulted red blood cells are curved and elongated, and are removed from the circulation Homozygous for the mutation will usually die from anemia without intensive care Heterozygous individual will have mild anemia, but will deal better with the malaria parasite Plasmodium fliciparum (maybe because infected red cells become sickled) (historical) Malaria distribution Sickle-cell anemia wiki Genome Evolution. Amos Tanay 2012 Other types of selection Different fitness for different individuals. e.g., male vs. female For example male genes that take up female resources in mammals This was suggested to lead to the phenomenon of imprinting where cells are expressing only the maternal or paternal allele Imprinted genes are much like haploids Genome Evolution. Amos Tanay 2012 Other types of selection Frequency-, Density-dependent selection: when the fitness depend on the frequency of the allele or the population size. Fecundity selection: different reproductive potential for mating pairs. Effects of heterogeneous environment Effects that apply directly to the haplotype: gametic selection/meiotic drive (e.g., killing your homologous chromosome reproductive potential) Sexual selection: male advertising the reproductive potential, or confronting other males Kin selection: (“origin of altruism”) Genome Evolution. Amos Tanay 2012 Recombination and selection Genome Evolution. Amos Tanay 2012 Linkage and selection Linkage interfere with the purging of deleterious mutations and reduce the efficiency of positive selection! Beneficial Beneficial Beneficial Weakly deleterious Selective sweep or Hitchhiking effect or genetic draft (Gillespie) Hill-Robertson effect Genome Evolution. Amos Tanay 2012 Linkage and selection The variance in allele frequency is used to define the effective population size V ( p) p(1 p) /( 2 N e ) Simplistically, assume a neutral locus is evolving such that a selective sweep is affecting a fully linked locus at rate . A sweep will fixate the allele with probability p, and we further assume that the sweep happens instantly: 1 Ne V ( p) p(1 p) N l 2 N 1 2 N e e This is very rough, but it demonstrates the basic intuition here: sweeps reduce the effective selection in a way that can be quantified through reduction in the effective population size. Nl Ne 1 2 N eC C – the average frequency of the neutral allele after the sweep Genome Evolution. Amos Tanay 2012 Cost of sex Wasting half of your genes on non-reproductive individuals Selective advantage of an asexual gene = 2 fold! Still sex is prevalent among complex species It even persists when both asexual and sexual reproduction is available as in S. cerevisae: • Mating locus MAT type a and alpha • Haploids are growing quickly when all is well • Mating is occurring when time is rough • Meiosis take the diploid back to haploids… Genome Evolution. Amos Tanay 2012 Benefits of sexual reproduction Fighting genetic draft: clearing deleterious mutations •Can this add up to a factor of 2? •(Alexey Kondrashov theory: epistatsis of deleterious alleles make sex beneficial) Buffering variation DNA repair through recombination (even in somatic tissues) Fighting mutation interference: more effective/rapid adaptation • The red queen hypothesis Genome Evolution. Amos Tanay 2012 Moran et al., Running with the Red Queen: HostParasite Coevolution Selects for Biparental Sex Science 8 July 2011: vol. 333 no. 6039 216-218 Genome Evolution. Amos Tanay 2012 What is a species? • • • Multiple definitions.. free flow of genetic information within population Weak (or zero) flow of information across species barriers Strain 1 Strain 2 We change wright-fischer’s or Moran model, by removing the assumption of random mixing. Instead, we can assume subpopulations are more likely to mate among themselves. Different models are possible, all end up increasing the genetic distance between subpopulations Species 1 Species 2 Genome Evolution. Amos Tanay 2012 Speciation The Phenomenon of new species emergence is called speciation It is well accepted that speciation is driven by the formation of reproductive barriers Allopatric speciation – occurs through geographical separation Parapatric speciation – occurs without geographical separation but with weak flow of genetic information Sympatric speciation – occurs while information is flowing Barriers can genetic, physical, and behavioral Genome Evolution. Amos Tanay 2012 Allopatric speciation “Finally, then, I suppose that a large number of closely allied or representative species... were originally formed in parts formerly isolated" (Darwin) Åland Islands, Glanville fritillary population: same species Charis Butterflies in South America: different species Genome Evolution. Amos Tanay 2012 Reproductive barriers Factors that limit gene flows: geography Habitat Sexual preferences Season Pollinator Many factors can contribute to form a barrier: Physical incompatibility, Hybrid sterility (mule), pre-zygotic infertility post-zygotic lethality Genome Evolution. Amos Tanay 2012 Sympatric speciation Following Darwin, and prior to population genetics and genetics in general evolutionary biologists considered sympatric speciation as the leading factor generating new species. The idea was that species are adapting to niches while co-existing in the same habitat Sympatric speciation is however difficult to explain using standard population genetics of interbreeding populations. Myer (and Dobjhansky) have made strong arguments that suggested allopatric speciation is the major (or only) driver of bio-diversity Genome Evolution. Amos Tanay 2012 Evidence for sympatric speciation Studies of cichlid fish species in African lakes showed incredible diversity: 500 endemic species in lake victoria, up to 1000 in lake Malawi The history of some of these lakes may have included massive dry-out and geographical separation.. In smaller lake (shown here is Barombi Mbo in Cameron), dry-out is geographically unlikely several species (7) with a probable common ancestor do suggest sympatry Genome Evolution. Amos Tanay 2012 Selection vs. drift Direct selection on the barrier Indirect selection The character selected for cause the barrier The character selected for affect genes that also cause the barrier Hitchhiking Drift Plain drift Bottlenecks evolve detelreious alleles to fixation Reinforcement Dobjhansky’s scenario: partial separation in allopatry Hybrids are unfit (if they are dead we already have species) Selection create a reproductive barrier Many theoretical limitations – but solutions exists Still controversial Genome Evolution. Amos Tanay 2012 Species trees Speciation is irreversible! (with some minor exceptions – think parasites) We end up with a branching process: forming a tree Strain 1 Strain 2 Species 1 Species 3 Strain 1 Strain 2 Species 1 Species 2 Strain 1 Strain 2 Species 2 Species 4 extinction Present time Genome Evolution. Amos Tanay 2012 Genome Evolution. Amos Tanay 2012 Facts on trees •A tree is a connected graph without cycles •We will use directed trees: each edge/lineage have a direction (time) •Directed acyclic graph (DAG): a directed graph without cycles •a Binary tree: one or 0 parents (incoming edges), two or 0 children (outgoing edges) •A binary tree on n extant species will have n-1 inner nodes: (prove) •Each node partition a binary tree into three disconnected parts (up, left, right) •The root of the tree is the only node without parents •Topological order: a permutation of the nodes such that each node appears after its parents •BFS/DFS Genome Evolution. Amos Tanay 2012 Evolutionary inference We can usually observe only the extent populations But we want to infer the history of the evolutionary process -How did the ancestral populations/species looked like? (nodes in the tree) -What was the evolutionary process that brought an ancestral genome into an extant one? (edges in the tree) So we will develop methods for inference: estimating the values of missing variables based on partial observations Genome Evolution. Amos Tanay 2012 Do we need inference? Getting direct evidence on the evolutionary history is only partially possible: The fossil record had probably given us more evolutionary understanding than any other resource (definitely more than genomes) But it cannot teach us much on evolution at the genome level – and we cannot use it to learn how to read the genome itself New technologies promise to sequence the genome of extinct species (mammoth, Neanderthals). But this is inherently limited by material availability Genome Evolution. Amos Tanay 2012 Why do we have a chance with inference? We are trying to infer the past based on the present. Does this make any sense at all? The past is correlated with the present Low substitution probability A:past B:present Pr( B | A) High correlation A:pas t B:present COV ( A, B) Pr( B | A) Pr( A | B) Pr( A) Pr( B) Genome Evolution. Amos Tanay 2012 Maximum parsimony If we assume that the traits on the tree are changing slowly Then the ancestral traits is usually the same as the extant one We for each ancestral node, we have evidence coming in from 3 directions – almost always two of them should agree C A Formally: given a tree T, and observations (from some alphabet) Si on the extent species: 1) compute the minimal number of changes along the tree, 2) Find the possible values at each ancestral node given an evolutionary scenario involving the minimal number of changes ? 2 substitution substitutions 1 C A A ? C A Genome Evolution. Amos Tanay 2012 Computing the parsimony score Maximum Parsimony Algorithm (Following Fitch 1971): Start with D=0, up_set[i] a bitvector for each node Up(i): if(extant) { up_set[i] = Si; return} up(right(i)), up(left(i)) up_set[i] = up_set[right[i]] ∩ up_set[left[i]] if(up_set[i] = 0) D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]] Compute the minimal number of changes by calling Up(root) ? S3 up_set[5] ? S2 up_set[4] S1 Genome Evolution. Amos Tanay 2012 Parsimony “inference” ? up_set[3] S3 down_set[5] ? down_set[4] Set[i] = up_set[i] ∩ down_set[i] S2 S1 Algorithm (Following Fitch 1971): Up(i): if(extant) { up_set[i] = Si; return} up(right(i)), up(left(i)) up_set[i] = up_set[right[i]] ∩ up_set[left[i]] if(up_set[i] = 0) D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]] Down(i): down_set[i] = up_set[sib[i]] ∩ down_set[par(i)] if(down_set[i] = 0) { down_set[i] = up_set[sib[i]] + down_set[par(i)] } down(left(i)), down(right(i)) Algorithm: D=0 up(root); down_set[root] = 0; down(right(root)); down(left(root)); Genome Evolution. Amos Tanay 2012 Genomic sequencing In its first 100 years, evolutionary theory was about organismal traits Starting from the 1960’s, molecular traits became available (mostly looking at proteins) Since the 1990’s, and to its full extent today, we can cheaply sequence whole genomes It is expected that within a few years, technology will allow routinely to study whole genomes in large population samples. For example: The 3 billion dollars human genome project can now be done by a single lab within a few weeks for 5,000$, and the price rapidly dropping The 1000 genomes project Genome Evolution. Amos Tanay 2012 Sequencing technology is rapidly evolving: Illumina GAII (here at WIS) ~40,000,000 reads of ~36bp on each, 5k-10k$ Jan 2010: 300 million reads, 150bpx2… Genome Evolution. Amos Tanay 2012 Genome evolution: nucleotides are not simple traits A AAA AA C AA AAA Deletion Insertion Point mutation (substitution) GGAACC GGAAGGAACC duplication We transform nucleotides to traits using alignment An alignment specifies which positions in two or more genomes represent the same “trait” – assuming they are the outcome of a single genealogy As we are seeing this needs not be well defined! (e.g. duplications) – but we will have to usually assume it is. A basic pairwise alignment optimization problem is solved using dynamic programming Pairwise alignment: find the alignment minimizing the number (or some linear cost) of mismatches (including deletions/insertions characters) Affine gap pairwise alignment: find the alignment minimizing the cost of mismatches + the cost of gaps (fixed cost for a new gap, another cost for a gap character) (see any standard text on comp-genomics) Genome Evolution. Amos Tanay 2012 The alignment dynamic programming graph (for reference) a.k.a: Smith-Waterman, Needleman-Wunsch Species 1 Species 1 0 A T 1 C 2 T 3 G 4 A 5 T 6 C 7 i 0 Species 2 T1 8 Species 2 j G2 Match/Mismatch Initialize 0,0 to C3 Global Alignment A4 si,j = si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) max T5 Local Alignment A6 0 si,j = max C7 How can we align all Query to part of the database? si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) Genome Evolution. Amos Tanay 2012 Multiple alignment The problem: given a set of sequences (each from a difference species), find their optimal multiple alignment. Multiple alignment cost: many possible definitions. In most of these the problem is NPhard. In fact, we should be looking for the complete evolutionary history of these sequences Therefore, the optimal alignment should in principle define the genealogy of each nucleotide, such that these histories are reasonable In practice, multiple alignment algorithms are using heuristics based on these ideas. Designing and implementing a really principled version of these algorithms is not easy 1. Pairwise alignment (distances) 2. Build a “guide tree” 3. Align from leaves to root, each time a pair (sequences or profiles) …ACGAATAGCAGATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGAT… …ACGTTTAGCAAATGGGCAGATGGCAGTCTAGA-----AGCATGAGACTAGATAGAT… …ACGAATAGCAAAT------ATGCCAGTCTAGATCGAAAGCATGCCACTAGATAGAT… Genome Evolution. Amos Tanay 2012 Genome alignment Given a set of genomes, each consisting of several billion nts - Problem becomes quite intensive Heuristics are used to search for pieces of alignment (Blast) Pieces are then combined into chains of large fragments Genome alignment can be projected over some reference genome, complex situations with duplications, large deletions and insertion requires complex solutions and are routinely ignored