Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
IBD Estimation in Pedigrees Gonçalo Abecasis University of Oxford 3 Stages of Genetic Mapping Are there genes influencing this trait? Where are those genes? Epidemiological studies Linkage analysis What are those genes? Association analysis Relationship Checking Where are those genes? Tracing Chromosomes Sometimes it is easy… 1 2 1 1 1 2 2 1 2 2 1 1 1 1 2 1 Sharing, or Not? ? 1 ? 1 ? 1 1 ? 1 ? 1 1 1 1 1 ? 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ? 1 Data Polymorphic markers Eg. Microsatellite repeats, SNPs Allele frequency Location Task Phase markers Place recombinants Complexity of the Problem For each meiosis For each location In a pedigree with n non-founders, there are 2n meioses each with 2 possible outcomes One for each of m markers Up to 4nm distinct outcomes Elston-Stewart Algorithm Factorize likelihood by individual Each step assigns phase for all markers for one individual Complexity n * 4m Small number of markers Large pedigrees With little inbreeding Lander-Green Algorithm Factorize likelihood by marker Each step assigns phase Complexity m * 4n Large number of markers For one marker For all individuals in the pedigree Assumes no interference Relatively small pedigrees Markov-Chain Monte-Carlo Approximate solutions Explore only most likely outcomes Remove restrictions Pedigree size Number of markers Inbreeding Assuming no interference Computationally intensive Popular Packages Elston-Stewart Algorithm Lander-Green Algorithm LINKAGE / FASTLINK (Lathrop et al, 1985) VITESSE (O’Connell and Weeks, 1995) Genehunter (Kruglyak et al, 1995) Allegro (Gudbjartsson et al, 2000) MCMC Simwalk2 (Sobel et al, 1996) LOKI (Heath, 1998) 1. Enumerate Possibilities Enumerate geneflow patterns Gene-flow pattern: Sets transmitted allele for each meiosis Implies founder allele for each individual Meiosis 1 Meiosis 2 Meiosis 2 V1 V2 V3 V4 2. Founder Allele Sets For each gene flow pattern v Enumerate set A(G,v) All allele states a = [a1, …, a2f] Compatible with both: Gene flow v Genotypes G The likelihood is L(v|G) = 2-2nai f(ai) f(ai) is the frequency of allele ai For example ... Genotypes ? ? 1 1 ? 1 ? Gene Flow ? ? ? Founder Alleles ? 1 Four meioses. Three one alleles required. Likelihood = ½4 f(a1)3 1 ? 1 1 1 1 1 1 Single Marker Probabilities We now have ... Likelihood for each gene flow pattern Conditional on genotypes Conditional on allele frequencies Conditional on a single marker Probability for each gene-flow pattern P(v) = L(v) / vL(v) 3. Allowing for Recombination Transition Probability T(vavb, ) = (1-)nr(Va,Vb)r(Va,Vb) Transition Matrix Location A Location B v1 v1 v2 … v2 … (1-)n_meiosis (1-)n_meiosis-1 … (1-)n_meiosis-1 (1-)n_meiosis … … … … Moving along chromosome Input Vector v of likelihoods at location A Matrix T of transition probabilities AB Output Vector v’ of likelihoods at location B Conditional on likelihoods at A For k vectors, requires k2 operations L( v'i | v) j L( v j )T ( vi v' j , ) Elston and Idury Algorithm T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Requires k log2 k operations (1-) T 1 2 3 4 5 6 7 8 T 9 10 11 12 13 14 15 16 = T (1-) T 9 10 11 12 13 14 15 16 T 1 2 3 4 5 6 7 8 = Moving Along Chromosome v1 v2 … vk L ( 1 | G1 ) v1 v2 … vk L( 2 | G1,12 ) v1 v2 … vk L( 2 | G2 ) v1 v2 … vk L ( 2 | G1 , 12 , G 2 ) T ( 1, 2 ) * = Markov-Chains Single Marker Left Conditional Right Conditional Full Likelihood MERLIN Fast multipoint calculations Non-parametric linkage analyses Error detection e.g., unlikely obligate recombinants Haplotyping most likely, exhaustive lists, sampling Sparse Gene Flow Trees PACKED TREE L1 L2 L1 L2 L1 L2 L1 L2 SPARSE TREE Legend Node with zero likelihood L1 Node identical to sibling L2 L1 L2 Likelihood for this branch Dense maps Computational challenge Require more memory Require Lander-Green algorithm Limited pedigree size Computational advantages Reduced recombination between markers Approximate solutions possible if steps with many recombinants are ignored MERLIN: Example Pedigrees MERLIN: Timings Timings for Simultaneous Linkage Analysis, Haplotyping and IBD Estimation A (x1000) Genehunter Exact Allegro Exact Merlin Exact Merlin Approximations 2 recombinants Grandparents Genotyped B C D 36s 59m44s - - 17s 2m06s 4h29m02s* - 10s 44s 42m37s - 13s 2s 5s 32s Simulations generated a map of 50 microsatellite markers at 1 cM spacing. The expected number of recombinants between consecutive markers is 0.4 (pedigree D). All timings are for 700 Mhz Pentium computer, using 2 GB of RAM. * Also using 20 GB of RAID storage for disk swapping MERLIN: Memory Usage Timings for Haplotyping the Data of Keavney et al (1998) Allegro Exact Merlin Exact Merlin Approximations 0 recombinants 1 recombinant 2 recombinants 3 recombinants 21 min 25 sec (1500 MB) 48 sec (128 MB) <1 sec 3 sec 23 sec 1 min 50 sec (4 MB) (4 MB) (32 MB) (64 MB) Command Line Options Effect of Genotyping Error Modest levels are likely Mendelian inheritance checks Up to 1% may be typical Detect up to 30% of errors for SNPs Effect on power Linkage vs. Association SNPs vs. Microsatellites Affected Sib Pair Sample 4 3 Average LOD 2 1 0 0 -1 -2 -3 -4 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 Unselected Sample Average lod retained (% of maximum) 100% 80% 60% 40% 20% 0% 0 10 20 30 40 50 60 Map position (cM) 70 80 90 100 Association Analysis 100% Average LOD retained (% of maximum) 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 2% 4% 6% Error rate 8% 10% Error Detection Genotype errors can introduce unlikely recombinants Change likelihood Replace (1-) with Test sensitivity of likelihood to each genotype Detects errors that have largest effect on linkage 2 2 2 2 1 2 1 2 1 1 2 1 1 2 1 2 1 2 2 1 1 1 2 1 2 1 2 2 2 2 1 2 X 2 X 2 1 1 2 1 1 2 1 2 1 2 2 X 2 X 1 1 2 1 2 1 Practical Exercise Lon Cardon Stacey Cherny