* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download genomic diversity and differentiation
Biology and consumer behaviour wikipedia , lookup
Oncogenomics wikipedia , lookup
Frameshift mutation wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression programming wikipedia , lookup
Genetics and archaeogenetics of South Asia wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Human genetic variation wikipedia , lookup
Koinophilia wikipedia , lookup
Genetic drift wikipedia , lookup
genomic diversity and differentiation heading toward exam 3 genome region of arbitrary size, what can you measure and describe? what else might you want to know? if given these data and nothing else, what could you say about them? learning goals for coalescent theory • how do patterns in sequence data tell us about effective population size? • what if there are multiple populations contributing information? • how is our answer changed if the population changes in size, or if there is selection for a particular allele? • why is this important for understanding phylogenetics (species trees)? patterns • mutations happen at a more-or-less constant rate at random location along genome (assumptions can be tested) • drift, selection, gene flow, recombination, etc. influence how these mutations turn into patterns • we interpret with statistical models - mostly beyond this class assume genealogy descent with modification focus on non-reticulate gene trees assume every mutation happens at new genome location AVISE 1987, 1994 neutral model • assume all these mutations have NO effect on fitness (null model) • thus, only drift influences whether allele goes to fixation • remember: probability allele goes to fixation is its frequency in population • so every new mutation has low but equal probability that will get FIXED (frequency 100%) SPECIES GENE COPY POPULATION(DEME) so you are collecting data not generally knowing the history of inheritance or how discrete these units may be (actually discrete, resolvably discrete) we are working on how to infer (at least probabilities) how this diversity partitions in space (population), time (frequencies), across genome (paralogs), across species (orthologs) also: copy number variation among loci, among populations, among species how many whales? Roman and Palumbi 2003 currently ~10,000 humpback whales; prewhaling (genetic estimate) maybe ~250,000 how could there be so many? 1. count whales - currently done using censusing and monitoring of whaling vessels, about 10,000 right whales in Atlantic 2. collect DNA samples from some of them, and sequence at least one gene (more is better!) 3. remember π is proportional to effective population size (times mutation rate µ) 4. we know µ (~0.00000001 substitutions per DNA replication/reproduction) from fossil and biogeographic data, and we can calculate π (average # differences between every pair of sequences) 5. Ne = π/µ, adjusted for inheritance of marker (haploid, maternally inherited mtDNA, versus diploid, biparental nuclear gene) 6. Ne of right whales ~250,000 even though only 10,000 whales now! 7. the genetic diversity is older than human whaling efforts and tells us about the past AUTOSOMES: ALL 4 COPIES CAN CONTRIBUTE MUTATIONS MTDNA: ONE COMPONENT CONTRIBUTES MUTATIONS WHEN PEOPLE REFER TO THE SMALLER EFFECTIVE SIZE OF MITOCHONDRIAL GENOME, THEY ARE REFERRING TO COPY NUM NOT THE NUMBER OF INDIVIDUALS IN THE POPULATION! another look at Ne: drift neutrality: mean Time to Most Recent Common Ancestor (tmrca)=time to homozygosity = NOT MEMORIZE DO -4Ne[ plnp + (1-p)ln(1-p) ] gens THIS proportional to Ne; for p=0.5, ~2.77Ne gens heterozygosity declines by 1/(2Ne) per generation compare nuclear gene vs. mitochondrial gene...? basic summary stats S, number of segregating sites (how many below?) π, average number of differences among sequences (what is it below?) ηi, folded site pattern: how many segregating sites appear i times? caccgtattagcattatgctggtata cgccgtactggcattatgctggtata caccgtactagcattgtgctggtatg caccgtactagcattatgccggtatg cactgtactggcattatgctggtgta cactgtactggcattatgctggtata standard coalescent sample size n has n-1 coalescent events steps of extant size Ti ,E[Ti]=2/(i(i-1)) measured in units of N genetic (label) differences have no fitness consequence single population constant population size (for now) TREE IS UNKNOWN, ANALYSIS IS ASKING WHICH TREES FIT THE D WHAT THAT TELLS US ABOUT THE INTERVAL BETWEEN BRANCH NO mutation # mutations (K) Poisson distributed on genealogy, based on total time t = (Ttotal) Poisson process: stochastic, each time interval is independent, waiting time is exponentially distributed across time intervals (but when many branches, multiplies opportunity in interval) Applications The classic example of phenomena well modelled by a Poisson process is deaths due to horse kick in the Prussian army, as shown by Ladislaus Bortkiewicz in 1898.[4][5] The following examples are also well-modeled by the Poisson process: •Requests for telephone calls at a switchboard. •Goals scored in a soccer match.[6] •Requests for individual documents on a web server.[7] •Particle emissions due to radioactive decay by an unstable substance. In this case the Poisson process is nonhomogeneous in a predictable manner - the emission rate declines as particles are emitted. Ewens distribution under neutral model, mutations arise at rate µ and are lost or drift to higher frequency (frequency proportional to AGE) thus we’ve come to expect a certain distribution of allele frequencies, DO NOT MEMORIZE e.g. p=q is unlikely THIS generally a small number of very common alleles, and increasing number of very rare alleles DO RECOGNIZE THIS um, huh? • here is the context: DRIFT causes some alleles to increase in frequency, some to be lost (moving forward in time) • moving back in time from NOW, the same process • • can explain the frequency of alleles in the context of how individuals are related (most recent common ancestor) this means we have expectations for how long it takes for a sample of sequences from NOW to coalesce to a common ancestor in the past (about 2 times effective population size) one reason two separate evolutionary populations may not APPEAR completely different, it takes time for ancestral diversity to sort out (now) >1 population? this pop descended from ‘red allele’ ancestor this pop descended from ‘green allele’ ancestor lets imagine two populations that rarely exchange migrants but have a common ancestry in the recent evolutionary past drift (moving forwards in time from ancestral population) leads to many that descended from one particular allele different in each population -> how do we know two populations? • • • • • • evolutionary biology: the populations tell us who they are! shown at right are two LOCATIONS, not necessarily two distinct populations may be one evolutionary population however: if one is 90% A1 and 10% A2, the other is 10% A1 and 90% A2 that means overall 50% A1, 50% A2 should see 25% A1A1 homozygotes, 25% A2A2 if Hardy-Weinberg fits instead see overall ~41% A1A1, 41% A2A2 because we are ‘pooling’ 2 diverged populations excess of common alleles • excess homozygosity could mean that two evolutionary populations are being analyzed as though they are one • so we don’t trust “even” allele frequencies: now think frequency dependent selection, balancing selection, or pooling of multiple evolutionary populations neutral theory: sort of like Goldilocks story just right = “neutral” η1=2 η2=2 η3=1 η4=1 excess rare alleles = purifying selection or population expansion η1=3 η2=2 η3=1 η4=0 excess common alleles = positive selection or long-term decline η1=0 η2=1 η3=2 η4=3 (2, +1 for “η5”) learning goals for coalescent theory • how do patterns in sequence data tell us about effective population size? • what if there are multiple populations contributing information? • how is our answer changed if the population changes in size, or if there is selection for a particular allele? • why is this important for understanding phylogenetics (species trees)? • why is this important for understanding phylogenetics (species trees)? • coalescent theory lets us test our assumptions of how DNA sequences evolve before we use them to reconstruct phylogeny • coalescent theory explains why recently-diverged populations may not yet have synapomorphies despite already being on different evolutionary paths • this model gives us basis for estimating time to ancestor of ANY two sequences • • • DNA characters are just like phenotypic characters 4 character states A,C,T,G plus information in insertion-deletion, gene copy number, etc. same concerns of homology and shared descent apply • • • “mitochondrial Eve” sets up misunderstanding every locus sampled now has a point in the past where all current alleles coalesce to a common ancestor in recently diverged species, diversity is often older than the species human population isolated ~200kya Ne isolation isolation understanding coalescence 1. larger effective size (Ne), more diversity 2. when time between branching events short relative to Ne, more likely that allelic diversity is older than branching event "This coalescence does not mean that the population originally consisted of a single individual with that ancestral allele. It just means that particular individual’s allele was the one that, out of all the alleles present at that time, later became fixed in the population." phylogeny inference • 2 basic approaches: algorithm vs. criterion • “neighbor joining” shown in book is an algorithm that generates a single tree by finding shortest “distances” (proportion of differences at nucleotide sites) • algorithm approaches do not help identify our uncertainty: one answer comes out, whether well supported or not criterion-based phylogeny 30 tips results in 8.7 x 1036 possible trees computer search necessary 3 of >10,000 possible trees which fits data best? depends on the criterion 11 changes 7 changes = most parsimonious of these 3 11 changes 3 of >10,000 possible trees which fits data best? depends on the criterion • • • criteria used in phylogeny parsimony - the fewest # of changes indicates the most acceptable tree topology maximum likelihood - both topology (arrangement of branches) and branch lengths are iteratively searched for tree(s) that fit statistical model of molecular evolution (e.g. transitions > transversions) Bayesian - criterion is still maximum likelihood, search strategy is different (sums result over many similar-likelihood trees) why different criteria? 1. we are making our assumptions explicit for inference of the unknown 2. different scientists have different backgrounds that drive their assumptions 3. using multiple methods/criteria lets us test how safe our assumptions are 4. next time: how do we decide if a tree hypothesis is strongly supported?