Download Clustering for Accuracy, Performance, and Alternative

Disease Gene Identification: A Practical Guide to Techniques Candidate gene, interval (linkage), association (disequilibrium) Difficult for both groups (biological / computational). Biologists tend to understand the biological justifications, computerists are better qualified to tackle the underlying math and associated computations. (The statistics befuddle us both.) Overview   Review Linkage analysis – – – – markers SNPs and micro arrays pooling – parallel genotyping what is a LOD score   2-point multipoint – TDT – programs  pros and cons – files and formats  Linkage disequilibrium – experimental results  Demo 2 Review  meiosis – produces haploid gametes and is mechanism for transmission of genetic material, independent assortment, and recombination between loci  marker – an informative marker is used to observe the genetic state at a particular genomic location which enables the observation of the transmission  links to marker D15S160 3 Marker D15S643 (153) and Genotypes – (min, max)143,159; CHEPH 147,145; 149;149 AATTGCTCTGAGTTCTGAGGC >chr15:72,091,076-72,091,409 CAGCTGATCTTTAGGAAACATTTAGGGGGAGGAGGCACTCCTTTCAAATA ACCTTTCTTTAGACAGGTTTCTGATCTGATTCAAGGCCACATCCTGGCCA TCTGGTTTCTGTAACTCAGAGAATTACTGCTCCTGAT AAATTGCTCTGAG TTCTGAGGC (22) TACTGCTGTCATATTGCATTCTCCGACCATTTTCCAGGTCT 41 CTCAAG 6 acacacacacacacacacacacacacacacacacacacacacac acacac (50) TCCTCAAGC (9) CGTTAGACTCCATTCCCATGTAGTA (25) TCCAAATAAG TTTTACAGCAAGACACACTGGAGAGATTGAAGCT TACTACATGGGAATGGAGTCTAACG ATGATGTACCCTTACCTCAGATTGC 4 D15S160 >chr15:72091076-72091409 CAGCTGATCTTTAGGAAACATTTAGGGGGAGGAGGCACTCCTTTCAAATA GTCGACTAGAAATCCTTTGTAAATCCCCCTCCTCCGTGAGGAAAGTTTAT ACCTTTCTTTAGACAGGTTTCTGATCTGATTCAAGGCCACATCCTGGCCA TGGAAAGAAATCTGTCCAAAGACTAGACTAAGTTCCGGTGTAGGACCGGT TCTGGTTTCTGTAACTCAGAGAATTACTGCTCCTGAT AAATTGCTCTGAG AGACCAAAGACATTGAGTCTCTTAATGACGAGGACTA TTTAACGAGACTC TTCTGAGGC TACTGCTGTCATATTGCATTCTCCGACCATTTTCCAGGTCT AAGACTGGC ATGACGACAGTATAACGTAAGAGGCTGGTAAAAGGTCCAGA CTCAAGacacacacacacacacacacacacacacacacacacacacacac GAGTTCtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtg acacacTCCTCAAGC CGTTAGACTCCATTCCCATGTAGTA TCCAAATAAG tgtgtgAGGAGTTCG GCAATCTGAGGTAAGGGTACATCAT AGGTTTATTC TTTTACAGCAAGACACACTGGAGAGATTGAAGCT AAAATGTCGTTCTGTGTGACCTCTCTAACTTCGA 5 30 cycles yields 2^30 = 1.07x10^9 molecules 6 heat Primers, nts, enzymes extension heat Primers, nts, enzymes 30 cycle, then run on gel 7 143,159; CHEPH 147,145; 149;149, 153 (genomic) 159 157 155 153 151 149 147 145 143 8 DNA Pooling 9 Pooling is Potential Alternative to Genotyping  Pool – parents and offspring – affecteds and unaffecteds – fathers, mothers, offspring  Advantage – high-throughput – cost  Disadvantage – does it work? 10 Example fathers mothers offspring 11 “SNPs”    Single-Nucleotide Polymorphisms 1 every 1000 bp (estimated) 2,972,052 SNPs submitted to dbSNP – 50% of all SNPs are in question – 10% of UTRs have SNPs   100,000 - 500,000 SNPs needed (for association) Why don’t we do this? – $$$ 12 Strachan, Human Molecular Genetics 2, pg 412. Mutation Detection 13 Strachan, Human Molecular Genetics 2, pg 412. Minisequencing 14 Fundamental Genetics  meiosis – Hs are diploid – meiosis produces haploid gametes – mechanism for transmission of genetic material to offspring – recombination by cross-over (Holliday structure) or by independent segregation of homologous pairs 15 Fundamental Genetics (Background for Linkage Analysis)  Rule of Segregation – offspring receive ONE allele (genetic material) from the pair of alleles possessed by BOTH parents  Rule of Independent Assortment – alleles of one gene can segregate independently of alleles of other genes – (Linkage Analysis relies on the violation of Independent Assortment Rule) 16 Genetic Marker … Prelude to LA – A genetic marker allows for the observation of the genetic state at a particular genomic location (locus). A genotype is the measured state of a genetic marker.  May never be feasible to sequence cases directly.  – An “informative” marker is often “heterozygous,” or “polymorphic” and enables the observation of the inheritance of genetic material. 17 Monogenic and Polygenic Diseases – monogenic (Mendelian) -- one gene  “simple” (dominant and recessive) Mendelian inheritance  direct correspondence between one gene mutation and one disorder  majority of disease genes found are monogenic – polygenic -- (complex, nonmendelian) multiple genes  heterogeneity and epistasis  combinatorics  no longer have direct correspondence between one gene and disorder  majority of disorders are probably polygenic – complexity of organisms and observed 18 pathways ...Mongenic and Polygenic Diseases phenocopy  reduced penetrance  – Example -- sickle cell anemia “classic” recessive disorder  defect in red blood cells (hemoglobin)  but… infant hemoglobin gene can “leak”  wide range of phenotypes  19 20 H-W f(AA) = p2  f(Aa) = 2pq  f(aa) = q2   (p+q)2  (p2 + q2 + r2 + 2pq + 2pr + 2qr)= (p+q+r)2 21 Dominant and Recessive Penetrance Modeled penetrance = P(pt | gt) DD Dd dd 1 1 0 DD Dd dd 0 0 1 DD Dd dd 0.9 0.9 0.0 DD Dd dd 0 0 0.8 22 D-R Heterogeneous, DD Epistatic AA BB 1 Bb 1 bb 1   Aa 1 1 1 aa 0 0 1 AA BB 1 Bb 1 bb 0 Aa 1 1 0 aa 0 0 0 reduced penetrance 3,9,27,81,243… 3n 23 Linkage theta = recombination fraction = R/(NR + R) M2 and B 3/(4+3) = 0.43 Close = 0 Far = 0.5 A and B 2/7 = 0.29 What about M1 24 and B (0.57)???? Linkage Analysis Goal: find a marker “linked” to a disease gene.  LOD score = log of likelihood ratio  LR[θ;data] == k P[data; θ]  theta = estimate of genetic distance (recombination fraction) between marker and disease  = proportion of recombinant gametes/total gametes 25  …Linkage Analysis  Linkage analysis calculates the likelihood that the inheritance pattern of the phenotype (disease) is supported by the observed inheritance patterns (genotypes) in a pedigree.  parametric – requires a precise genetic model – linkage analysis nonparametric – no model – association (linkage disequilibrium), TDT, IBD, ASP, etc.  26 So which one (parametric or nonparametric)???     Even with the dependence on a genetic model for parametric analysis, in general parametric analysis is typically as powerful, or more powerful for identifying candidate loci…. however, parametric (linkage), in general, requires families (people have to be related)… however, some nonparametric methods (association) are not limited by this… however, it has been shown that some nonparametric methods would require unfeasibly large samples to detect susceptibility – ASP (Risch and Merikangas 1996). 27 Study Design genome-wide screen by linkage  then narrow candidate region by disequilibrium mapping OR  candidate gene approaches  28 Linkage Analysis and Problems with Nonmendelian Disorders – few monogenic models, easy to test – more difficult to find models explaining inheritance in polygenic models – nonmendelian disorders are often more difficult to establish diagnostic criteria BP  obesity  psychiatric disorders (autism, schizophrenia)  Bardet-Biedl syndrome  29 How to Address Difficulties Seek families in which the disease segregates in a near-mendelian manner.  Use affected pedigree members only in a parametric analysis.  Use nonparametric (model-free) method of linkage analysis.  30 Elston-Stewart Algorithm     Human Heredity, 21: 523-542 (1971) Example? – see handout Take home: Linkage calculation is difficult, complicated, and tedious – best left to computer programs. Kruglyak L, et. al. Parametric and nonparametric linkage analyis: a unified multipoint approach. Am J. Hum Genet 58:1347-1363. 31 Linkage Analysis Programs  FASTLINK - 2 point – O(n2), where n = number of markers – O(n), where n = number of people  GeneHunter - multipoint, 2 point, and parametric and non-parametric LOD (NPL) – – – – NPL == alleles shared IBD O(n2), where n = number of people O(n), where n = number of markers 2n-f < 16 (Nonfounders, and Founders) 32 GeneHunter  GeneHunter – – multipoint – 2-point – parametric – non-parametric LOD (NPL) NPL == alleles shared IBD  typically expressed as “p-value”  “significant” threshold is not as obvious  33 GeneHunter  genome-wide p value == probability that the observed value will be exceeded anywhere in the genome, assuming the null hypothesis of no linkage 34 Criteria for linkage in complex dieases (Lander and Kruglyak 1995) – Suggestive Linkage is lod or p value that would be expected to occur once by chance in a whole genome scan – Significant linkage is lod or p value that would be expected to occur by chance 0.05 times in a gnome scan (p = 0.05) – Highly suggestive linkage is a lod score or p value that would be expected to occur by chance 0.001 times in a whole genome scan – Confirmed linkage – linkage is regarded as confirmed when a significant linkage is observed in one study is confirmed by finding a lod score or p value that would be expected to occur 0.01 times by chance in a specific search of the candidate region 35 2-point VS multipoint M M Disease M2 Disease theta M3 M4 36 Linkage Disequilibrium – Association: particular alleles at 2 or more loci show allelic association if they occur together with frequencies significantly different from those predicted from the individual allele frequencies – aka disequilibrium – disease-bearing chromosomes must desced from one or a few individuals – Generally need case vs controls – TDT (heterozygous marker transmitted), HRR (untransmitted alleles as control) – IBD – allelic associations (outbred populations) maintained at only <<= 1cM – ASHG 2001  LD maintained up to 20 – 30 KB 37 Complex Disorders: Case Study  Schizophrenia – LOD of 6 – tried many different diagnostic criteria, then selected the best  Breast cancer – – – – – 1990 locus mapped to 17q21 (BRCA1) confirmed by 2 more groups narrowed to 8 cM 1994, second locus to 13q12 (BRCA2) BRCA1 cloned in 1994, BRCA2 in 1995 38 Transmission Disequilibrium Test (TDT) – Linkage Dissequilibrium  Spielman, et. al. Transmission Test for Linkage Disequilibrium: The Insulin Gene Region and Insulin-dependent Diabetes Mellitus(IDDM). Am J Hum Genet 52:506516, 1993.  TDT = (n12 – n21)^2/(n12 + n21) where n12 and n21 and n12 are the number of heterozygous offspring from heterozygous parents of N families 39 LOD Properties    Lods are additive across pedigrees “Significant” linkage for LOD >= 3.0 Heterogeneity LOD (Het-LOD) – LOD calculation over another parameter, alpha, where alpha is the proportion of families linked to the disease – can only raise the LOD score  Typically perform LODs over 2 models, dom and recessive – which may affect your cuttoff – if you maximize over all parameters, you run the risk of erroneously obtaining a “significant” LOD 40 Files and Formats  datain.dat – genetic model – marker description – allele frequencies  pedin.ped – pre-makeped file – used by genehunter – pedigree information, affection status, genotypes 41 Example – pedin.ped 1000 1 0 0 1 1 5 4 6 4 1000 2 0 0 2 1 4 3 6 6 1000 3 1 2 1 2 5 4 6 6 1000 4 1 2 1 2 4 3 6 4 1 2 3 4 42 Example – pedin.dat 1000 1000 1000 1000 1 2 3 4 0 0 1 1 0 0 2 2 3 3 0 0 0 0 4 0 01115464 02014366 41025466 01024364 Ped: 1000 Ped: 1000 Ped: 1000 Ped: 1000 Per: 1 Per: 2 Per: 3 Per: 4 43 Example -- datain.dat 11 0 0 5 << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM 0 0.0 0.0 0 << MUT LOCUS, MUT MALE, MUT FEM, HAP FREQ (IF1) 1 2 3 4 5 6 7 8 9 10 11 1 2 << AFFECTION, NO. OF ALLELES 0.9 0.1 << GENE FREQUENCIES 1 << NO. OF LIABILITY CLASSES 0 0.99 0.99 << PENETRANCES 3 5 D13S794 << ALLELE NUMBERS, NO. OF ALLELES GATA48C10 Marker name goes here 0.22957894 0.51578945 0.24736843 0.005263158 0.0020 << GENE FREQUENCIES 44 Where does the data come from?  A lab A spreadsheet A napkin A database  Where do allele frequencies come from?    – Published estimates – You calculate them  Example: autism, 400 people, 300 markers = 120,000 genotype pairs 45 Demo Code sources and documents (links)  Linkage Analysis  – FASTLINK – GeneHunter2 46 47 Fundamental Genetics and Probability Concepts meiosis and sampling  patterns of inheritance  monogenic and complex inheritance  – phenocopy – reduced penetrance  DNA variation – polymorphisms, SNPs, and mutations  positional cloning 48 Examples 49 Examples 50 Example 51 BBS4 Pedigree 52 Dom-Rec Heterozygous Screen genes A, B?, b 53 Uninformative Marker 54 Informative Marker 55  Given the following observations: family structure, affection status, genotypes, and disease allele frequencies. Assuming a model for the disease, can we calculate the probability that these observations “fit” an assumed model??? 56 BBS2 genetic mapping C16 1 2 3 4 5 6 7 8 9 10 11 12 57 BBS2 genetic mapping affected unaffected C16 1 2 3 4 5 6 7 8 9 10 11 12 58 Summary  Disease Gene Identification – challenges – interval localization  genotyping and genetic markers, linkage analysis, allele sharing, association studies (“SNiPs”), homozygosity mapping – disease gene identification techniques  Take home – A complex disorder (with interacting genes) has yet to be characterized 59 Allele Sharing  tries to show that affected family members inherit the same chromosomal regions more often than expected by chance 60 Allele Sharing Example Needs at least sibs. 61 Association Studies   “Allelic association studies provide the most powerful method for locating genes of small effect contributing to complex diseases and traits.” Daniels, Am J Hum Genet 62:1189-1197, 1998. Linkage analysis – genome wide screen, 400 markers ~ 10 cM (10 MB), association needs 4000+ polymorphic markers – generally need nuclear family or larger  Association finds “linkage disequilibruim” 62 Association Studies  “Association is simply a statistical statement about the co-occurrence of alleles or phenotypes. Allele A is associated with disease D if people who have D also have A more (or maybe less) often than would be predicted from the individual frequencies of D and A in the population.” Pg. 286 Human Molecular Genetics 2, Tom Strachan 63 Examples  HLA-DR4 (antigen marker) – 36% in UK – 78% with rheumatoid arthritis  CF( RFLP markers XV2.c (X1,X2), KM19(K1,K2)) – Marker Alleles CF(case) Normal(control) – X1, K1 3 49 – X1, K2 147 19 – X2, K1 8 70 – X2, K2 8 25 – CF associated with X1, K2 in ‘89 (Strachan)64 Hardy-Weinberg Equilibrium   Rule that relates allelic and genotypic frequencies in a population of diploid, sexually reproducing individuals if that population has random mating, large size, no mutation or migration, and no selection Assumptions – allelic frequencies will not change in a population from one generation to the next – genotypic frequencies are determined in a predictable way by allelic frequencies – the equilibrium is neutral -- if perturbed, it will reestablish within one generation of random mating at the new allelic frequency 65 Equilibrium 66 Homozygosity Mapping 67

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering for Accuracy, Performance, and Alternative