Download Association

Strategies for gene identification in complex traits --- Association studies --- What is an association study? Objective:Is there a statistical relation? • Genomic Variation at one or more sites • Phenotypic variation - Presence/Absence of a disease - Levels of a disease-related trait Principle: Compares 2 groups that are expected to differ in their prevalence of disease-susceptibility alleles Analytical Issues in Genetic Association Studies • • • • Sampling Design Markers (typed; Map density) Unit of Analysis Statistical testing Linkage disequilibrium between 2 tightly linked loci • Marker 2 Marker 1 A2 a2 A1 A1A2 A1a2 f(A1) a1 a1A2 a1a2 f(a1) f(A2) f(a2) Allelic association  f(i,j)  f(i) x f(j)  Haplotype frequency  product of allele frequencies LD decays with time/generations and genetic distance (recombination) Measures of allelic association D’ (Lewinson’s); r2 (correlation) 0  r2  D’  1 D’ ~ recombinational events in the genomic region r2 ~ The 2 SNPs carry same information D’ can be high but not r2 D’ =1; r2 =1 A1A2 f(A1) a1a2 f(a1) f(A2) f(a2) D’ =1; r2 <1 A1A2 f(A1) a1A2 a1a2 f(a1) f(A2) f(a2) Power in Population-based vs. Family-based Analysis TDT Case-Control • Genotype 3 Subjects/Family • Phenotype 1 Subject/Family • Increased power with multiple affected sibs • Generally, Immune to population stratification • Family structure provides some error-checking and haplotype information • Full trios may not be available • Genotype 2 Subjects to equal one trio • Phenotype 2 Subjects to equal one trio • Increased power with 3:1 controls:cases • Susceptible to population stratification Most common forms of markers • Repeated sequences of 2,3 or 4 nucleotide (Microsatellites) – reasonably frequent in genome – highly polymorphic/informative  useful in linkage analysis – few disease susceptibility gene variants are likely STRs • Single Nucleotide Polymorphisms (SNPs) “one” letter of the code is altered – very frequent in genome (1/500 to 1/1000 base pairs) – Exonic SNPs may or may not cause an amino acid change – many disease susceptibility gene variants are likely SNPs Unit of Analysis in Genetic Association Studies • Allele vs. Genotype – Dominance can be considered in genotype analysis – Extra degree of freedom in genotype analysis – Not clear which is optimal • Single SNP vs. Haplotype – Haplotypes capture evolutionary history – Need for haplotype imputation – Single SNP optimal if functional SNP is included What are we hoping from a genetic association study? Situation of Interest:  Trait variation is influenced by OR The typed variant A second variant Marker= Causal Variant Marker in LD with Causal variant Direct Association Indirect Association Likelihood of detecting a true association? • Genetic effects of the causal allele on trait susceptibility/variation --Relative Risk & allele frequency • LD between the marker and the causal variant (Marker map & LD patterns in the genomic region of the causal variant) Detectable Genetic effects (1) Power Variant with modest effects (OR~1.6): Power as a function of allele frequency 100% 80% 60% 40% 20% 0% 5% 1% 0.1% 0.01% f=0.05 f=0.10 Power under different Nominal P-values N=2,000 (1,000 cases + 1,000 controls) Detectable Genetic effects (2) Power Rare (f=0.02) causal variant : Power as a function of OR 100% 80% 60% 40% 20% 0% OR=1.5 OR=2.0 Power under different Nominal P-values N=2,000 (1,000 cases + 1,000 controls) 5% 1% 0.1% 0.01% Detectable Genetic effects?  Association is powerful to detect causal variants that are - Common (>10%) with relatively modest effects (RR) - Less common (~5%) but with substantial effects (RR>2) Likelihood of detecting a true association? Direct r2=1 r2=0 0< r2<1 • For a given Power, required N with 1/r2 r2= N= 1 1,000 0.8 0.5 1,250 2,000 0.20 5,000 0  • For a given N, Power Max nul Hot spots and Haplotype blocks • LD is variable : Recombination does not occur with equal probability at all points in the genome ---- there are « hot » and « cold » spots • Recently, it has been suggested that the genome falls into « blocks », with little haplotype diversity within blocks: Mean block size seems to be about ~14kb in Caucasians, and ~8 kb in Africans Detectable Causal Variants? • Causal polymorphism is known and typed (direct association) or • There are markers that are highly correlated to the causal variant: - The causal locus lies in a « cold » spot (« LD blocks ») - The « best » map density to be used will depend on the LD patterns of the region  implications on statistical significance (multitest correction) Human Genome • The human genome consists of about 3x109 base pairs (3-6 x106 SNPs) and contains about 25,000 genes • Much of the DNA is either in introns or in intergenic regions  Trait variation: A few hundred of (functional) variants may make a meaningful contribution to variation in any single phenotype  Prior probability that a variant selected at random will influence a given trait is very Genetic variants to be typed? --- Choices have to be made --Two complementary approaches: • Functional: incorporates assessments of the likely functional effect of variation within a gene or region of interest. • Tagging: exploits presence of LD in many parts of the genome. Significance of association with AD, for SNPs immediately surrounding APOE (<100 kb)[Martin et al., AJHG, 2000] Selection of variants: Functional approach Target polymorphisms which are themselves putative causal variants. Critical issues: • Identification of candidate polymorphisms – Beyond mutations altering aminoacid sequence (nSNPs), little is known on the potential effect of non-coding sequence on gene regulation & expression? • MAF of functional variants is skewed (MAF<5%) Power to detect uncommon variants with modest effects?  Potential to be the most powerful (Direct association) design, but may be limited to the discovery of some of the genetic causes of Selection of variants: Indirect Association The polymorphism is a surrogate for the causal variant But, necessary to type several surrounding markers to have a high chance of picking up the indirect association Questions: Do we need to type all markers in the region? Can we reduce genotyping costs & multi-test burden without decreasing « too much » the power? Tagging approaches Type a subset of variants that captures a high amount of the information in common regional haplotypes Various strategies ---SNP & haplotype tagging --- but still debate as to the best methods [Johnson et al. Nat Genet, 200] Power as a function of average spacing of tags [De Bakker, Nat Genet, Nov 2005] r2=1 r2=0.8 r2=0.3 random kb Tags picked at r2 = 1, 0.8, 0.5 and 0.3  A marker map density of ~1 tagSNP/5kb (r2>0.8) captures >80% of common variation Tagging approach: Limits • Less powerful than direct studies, • There cannot be a definite negative result, since we cannot exclude the possibility that a causal variant exists but is not picked up by the markers chosen, • Intrinsic biological merit of tagSNPs as markers for complex trait susceptibility variants?  « Common disease, common variant » hypothesis Supported by the few variants consistently shown to be associated to common diseases: -- APOE & Alzheimer --- Macular degeneration & In practical terms, an observed statistical association will be due to … 1. Direct association: The allele itself is functional and directly affects the expression of the phenotype 2. Indirect association: The allele is in linkage disequilibrium with an allele at another locus that directly affects the expression of the phenotype 3. The finding could be due to chance or artifact, e.g., confounding or selection bias  Study design aims to maximize detection of “true” findings while controlling (minimizing) rate of “false” findings “False” Association findings 1. Chance: measured by the nominal P value of the test, i.e., prior probability that a typed marker is found associated when HO (no association) is true.  Multi-test problem: The rate of “false” findings of a given experiment increases with the number of markers tested. • Solutions – Simulation: Empirical p-values – Replication and/or use Multi-Phases design Multi-phase designs Are efficient to reduce the multi-test problem For example: 1. 2,000 cases + 2,000 controls with 500,000 SNP chip 2. Further 2,000 + 2,000 for best 100,000 SNPs 3. Further 4,000 + 4,000 for best 10,000 SNPs • Computation of the characteristics of “False” Association findings 2. Artifact (confounding, selection bias, pop stratification, genotyping): affects the Prior probability of a “chance” finding  The significance of a finding is no longer controlled by the nominal P-value. • Solutions - Careful matching of cases & controls - use homogenous populations - use family-based controls - use genomic control or other similar methods - use QC methods for scoring genotyping errors (Clayton et al., Nat Genet, 2005) Prospects for whole-genome screens: Estimated numbers of «common» SNPs (MAF>5%) • Direct studies of nsSNPs: ~30,000 - 50,000 SNPs • Indirect studies of genes: ~300,000 -500,000 SNPs • «Nearly» whole genome: 500,000 - 1,000,000 • Whole genome: ~ 2,000,000 – 4,000, 000 Choice of markers • Optimal choice of markers requires detailed mapping of LD, e.g. based on HapMap data • Truly optimal solutions are computationally intensive. Current chip designers are using single marker r2 clusterbased algorithms Choices of markers have to be made • The strategy used to define the subsets of variants to be typed has a substantial effect on the power & the quality of the study. • Greater understanding of genomic variation has allowed more logical choices. Nonetheless, variant selection is always a pragmatic compromise. Research key questions • Are common human diseases due to common variants or multiple rare variants? • Will rare or common SNPs be better candidates for a particular disease? • Can large differences between populations in the frequency of an allele be merely due to chance?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Association