Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
QTL mapping Simple Mendelian traits are caused by a single locus, and come in the ‘all-or-none’ flavor. A Quantitative Trait is one in which many loci contribute. The phenotype can therefore vary in a ‘quantitative’ manner. Ades 2008, NHGRI Modified from Mike White slides, 2010 1 Goals of QTL mapping To identify the loci that contribute to phenotypic variation 1. Cross two parents with extreme phenotypes 2. Score the progeny for the phenotype 3. Genotype the progeny at markers across the genome 4. Associate the observed phenotypic variation with the underlying genetic variation Ades 2008, NHGRI Modified from Mike White slides, 2010 5. Ultimate goal: identify causal polymorphisms that explain the phenotypic variation 2 Backcross Phenotype: Drug tolerance 80% 20% viability Usually have at least 100 individuals Broman and Sen3 2009 Intercross Phenotype: Drug tolerance 80% 20% viability Broman and Sen4 2009 Backcross vs. Intercross • An intercross recovers all three possible genotypes (AA, BB, AB). This allows detection of dominance with both alleles and provides estimates of the degree of dominance. • A backcross has more power to detect QTL with fewer individuals. • A backcross may be the only possible scheme when crossing two different species. 5 Genetic map: specific markers spaced across the genome Markers can be: • SNPs at particular loci • Variable-length repeats e.g. ALU repeats • ALL polymorphisms (if have whole genomes) Ideally, markers should be spaced every 10-20 cM and span the whole genome 6 Genotype data: Determine allele at all markers in each F2 7 Phenotype data 8 Test which markers correlate with the phenotype 1. Missing Data Problem Use marker data to infer intervening genotypes 2. Model Selection Problem How do the QTL across the genome combine with the covariates to generate the phenotype? 9 Broman and Sen 2009 Test which markers correlate with the phenotype Marker regression: simple T-test (or ANOVA) at each marker Marker 1: no QTL Marker 2: significant QTL (population means are different) 10 Marker regression Advantages: • Simple test – standard T-test/ANOVA • Covariates (e.g. Gender, Environment) are easy to incorporate • No genetic map necessary, since test is done separately on each marker Disadvantages: • Any individuals with missing marker data must be omitted from analysis • Does not effectively consider positions between markers • Does not test for genetic interactions (e.g. epistasis) • The effect size of the QTL (i.e. power to detect QTL) is reduced by incomplete linkage to the marker 11 • Difficult to pinpoint QTL position, since only the marker positions are considered Interval mapping • Lander and Botstein 1989 • In addition to examining phenotype-genotype associations at markers, look for associations between makers by inferring the genotype Q • The methods for calculating genotype probabilities between markers typically use hidden Markov models to account for additional factors, such as genotyping errors 12 Interval mapping 13 Broman and Sen 2009 Interval mapping Advantages: • Takes account of missing genotype information – all individuals are included • Can scan for QTL at locations in between markers • QTL effects are better estimated Disadvantages: • More computation time required • Still only a single-QTL model – cannot separate linked QTL or examine for interactions among QTL 14 LOD scores • Measure of the strength of evidence for the presence of a QTL at each marker location LOD(λ) = log10 likelihood ratio comparing the hypothesis of a QTL at position λ versus that of no QTL Phenotype log10 { Pr(y|QTL at λ, µAAλ, µABλ, σλ) Pr(y|no QTL, µ, σ) } LOD 3 means that the TOP model is 103 times more likely than the BOTTOM model 15 LOD curves How do you know which peaks are really significant? 16 LOD threshold •Consider the null hypothesis that there are no QTLs genome-wide one location genome-wide 1. Randomize the phenotype labels on the relative to the genotypes 2. Conduct interval mapping and determine what the maximum LOD score is genome-wide 3. Repeat a large number of times (1000-10,000) to generate a null distribution of maximum LOD scores 17 Broman and Sen 2009 LOD threshold • 1000 permutations 10% ‘Genome-wide Error Rate’ = LOD 3.19 (means that at this LOD cutoff 10% of peaks could be random chance) 5% GWER = LOD 3.52 • Boundary of the peak is often taken as points that cross (Max LOD – 1.5) (or - 1.8 for an intercross) •Often these regions are very large & encompass many (hundreds) of genes 18 Lessons from QTL mapping studies about Genetic Architecture * Often have a few big effect QTL and many small modifier QTL with small effects on the phenotype need lots of power (good phenotypic measurements and many individuals) to detect QTLs with small effects * Recombination in F2’s can reveal negative effects segregating in the parents e.g. can find resistant-parent allele associated with sensitivity MacKay review: often have loci with complementary effects found nearby * Effects of an allele can be context dependent Environment-specific effects: Gene x Environment (GxE) interactions Genomic context: epistatic (i.e. gene-gene) interactions are likely very common … but difficult to detect 19 An alternative approach: Genome Wide Association Studies (GWAS) Here the phenotypes and genotypes come from many different individuals from a population 12 10 8 6 4 2 0 Identify SNPs that are significantly associated with the trait across a bunch of individuals An alternative approach: Genome Wide Association Studies (GWAS) across many individuals Genotypes for 65 strains Phenotypes Population Phylogenetic Random for 65 strains Structure Relatedness Error 1.0 0.8 0.6 0.4 0.2 0.0 Laboratory BY4741 S288c W303 FL100 SK1 Y55 YJM975 YJM981 YJM978 322134S 273614N YJM789 378604X YJM326 YJM428 YJM653 YJM320 YJM421 YJM451 YS9 YS2 YS4 CLIB215 CLIB324 JAY291 CBS7960 DBVPG1788 DBVPG1106 DBVPG1373 DBVPG6765 L-1374 L-1528 RM11_1A BC187 YIIc17_E5 WE372 T73 NCYC110 DBVPG6044 Y12 K11 Y9 DBVPG6040 NCYC361 DBVPG1853 CLIB382 UC5 PW5 YPS163 YPS606 YPS128 NC-02 YPS1009 T7 UWOPS05-227.2 UWOPS05-217.3 UWOPS03-461.4 Y10 IL-01 YJM269 M22 I14 MUSH LEP CRB UWOPS87-2421 UWOPS83-787.3 Clinical Wine Strains Bio Baking Fuel Other fermentation Oak Nature 0.0000 0.2000 0.4000 Typically use a mixed linear model to test for significance Phenotypic variance y = μ + a + other stuff + Error Phenotypic Additive Genetic Effects mean across all involved genes Random Error Identify SNPs that are significantly associated with the trait 12 Phenotype 10 8 6 4 2 0 AA TT Genotype A very important control for both types of mapping: controlling for covariates Sometimes a SNP can appear correlated with phenotypic variation … but it can be due to some other feature that co-varies with the SNP and the phenotype The clearest example: population structure Other examples: - gender of the individuals - shared environments for subgroups - an example from our yeast studies: ploidy differences when some F2s are haploid and some are diploid 23 Example: S. cerevisiae strains (Liti et al. 2009) Vineyard strains Oak strains Phenotype 15 10 5 0 AA TT Genotype 24 Mixed linear model identifies SNPs with a significant p-value. Often plot the –log(p) across the genome (Manhattan plot) Again, the p-value cutoff comes from permutations (randomize the strain-phenotype labels and perform mapping on randomized data 10,000 times) How to find the causative SNP/polymorphism in giant regions? Often very challenging to find which SNP(s) or polymorphisms (copy-number differences, rearrangements, etc) are causal Some strategies people use: - Look at what’s known about the genes in the peak CAUTION: very easy to get led by what ‘seems likely’ - Look at signatures of selection within the population e.g. differences in FST - Look for derived alleles - Look for coding changes, genes in the region with severe expression differences - Combine with other data e.g. other mapping studies (QTL + GWAS), genomic datasets