* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Intro to Computational Genetics
Human genome wikipedia , lookup
Gene expression programming wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Medical genetics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Genome evolution wikipedia , lookup
Heritability of IQ wikipedia , lookup
SNP genotyping wikipedia , lookup
Genetic studies on Bulgarians wikipedia , lookup
Genetic engineering wikipedia , lookup
Genetic testing wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Behavioural genetics wikipedia , lookup
Koinophilia wikipedia , lookup
Genetics and archaeogenetics of South Asia wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Genome (book) wikipedia , lookup
Public health genomics wikipedia , lookup
Population genetics wikipedia , lookup
Human genetic variation wikipedia , lookup
Genetic drift wikipedia , lookup
Workshop in Bioinformatics Eran Halperin The Human Genome Project “What are announcing “But “I our would work be previously willing to make has shown… a predication thatwe within 10 years, we thatopportunity we have reached thatwill having haveone thegenetic potential code of is offering important, anytoday of youis the to finda milestone…that is,increased covering the but out it's not what allparticular that useful.” genetic conditions you may be at genome in…a working draft of risk for…” the human sequence.” Washington, DC June, 26, 2000 The Vision of Personalized Medicine Genetic and epigenetic variants + measurable environmental/behavioral factors would be used for a personalized treatment and diagnosis Example: Warfarin An anticoagulant drug, useful in the prevention of thrombosis. Example: Warfarin Warfarin was originally used as rat poison. Optimal dose varies across the population Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose. Association Studies Genetic variants such as Single Nucleotide Polymorphisms ( ), Copy Number Variants ( ) are tested for association with the trait. Usually are Nucleotide bi-allelic SNP SNPs =Where Single should Polymorphism we look? Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC Controls: Associated SNP AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x 10-8 NHGRI GWA Catalog www.genome.gov/GWAStudies Genetic Factors Complex disease Multiple genes may affect the disease. Therefore, the effect of every single gene may be negligible. Environmental Factors How does it work? • For every pair of SNPs we can construct a contingency table: n ab c d A G Total Cases a b n Control s c d n p1 a / n p2 c / n p1 p2 p 2 n( p1 p2 ) 2 T p(1 p) Results: Manhattan Plots The curse of dimensionality – corrections of multiple testing • In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs. • If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease. • This needs to be corrected. Bonferroni Correction • If the number of tests is n, we set the threshold to be 0.05/n. • A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad: – Example: If all SNPs are identical, then we lose a lot of power; the false positive rate reduces, but so does the power. Data International consortium that aims in genotyping the genome of 270 individuals from four different populations. HUJI 2006 - Launched in 2002. - First phase (2005): ~1 million SNPs for 270 individuals from four populations - Second phase (2007): ~3.1 million SNPs for 270 individuals from four populations - Third phase (ongoing): > 1 million SNPs for 1115 individuals across 11 populations HUJI 2006 Other Data Sources • Human Genome Diversity Project – 50 populations, 1000 individuals, 650k SNPs • POPRES – 6000 individuals (controls) • Encode Project – Resequencing, discovery of new SNPs • 1000 Genomes project • dbGAP Haplotypes Haplotypes • Can 1,000,000 SNPs tell us everything? • No, but they can still tell us a lot about the rest of the genome. – SNPs in physical proximity are correlated. – A sequence of alleles along a chromosome are called haplotypes. Haplotype Data in a Block (Daly et al., 2001) Block 6 from Chromosome 5q31 LD structure Phasing - haplotype inference Haplotypes ATCCGA AGACGC Genotype C C T chromosome mother A CG father chromosome G A A • Cost effective genotyping technology gives genotypes and not haplotypes. Possible phases: ATACGA AGCCGC AGACGA ATCCGC …. Inferring Haplotypes From Trios Parent 1 Parent 2 Child 122112 1??11? 10011? 10?11? 1??11? 11111? 11?11? 210022 1100?? ?100?? 11000? 0100?? ?100?? 01001? 120222 1?0??? 10011? 100??? 1?0??? 11000? 110??? Assumption: No recombination 25 Population Substructure • Imagine that all the cases are collected from Africa, and all the controls are from Europe. – Many association signals are going to be found – The vast majority of them are false; Why ??? Different evolutionary forces: drift, selection, mutation, migration, population bottleneck. Natural Selection • Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene different allele frequencies in LCT Genetic Drift • Even without selection, the allele frequencies in the population are not fixed across time. • Consider the following case: – We assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population. – We assume a constant population size, no mutation, no selection Genetic Drift: The Wright-Fisher Model Generation 1 Allele frequency 1/9 Genetic Drift: The Wright-Fisher Model Generation 2 Allele frequency 1/9 Genetic Drift: The Wright-Fisher Model Generation 3 Allele frequency 1/9 Genetic Drift: The Wright-Fisher Model Generation 4 Allele frequency 1/3 Genetic Drift: The Wright-Fisher Model Genetic Drift: The Wright-Fisher Model Ancestral population Ancestral population migration Ancestral population Genetic drift different allele frequencies Population Substructure • Imagine that all the cases are collected from Africa, and all the controls are from Europe. – Many association signals are going to be found – The vast majority of them are false; What can we do about it? Jakobsson et al, Nature 421: 998-103 Principal Component Analysis • Dimensionality reduction • Based on linear algebra • Intuition: find the ‘most important’ features of the data Principal Component Analysis Plotting the data on a one dimensional line for which the ‘spread’ is maximized. Principal Component Analysis • In our case, we want to look at two dimensions at a time. • The original data has many dimensions – each SNP corresponds to one dimension. HapMap Populations MKK LWK YRI GIH ASW CHD CEU MEX JPT CHB TSI 43 HapMap PCA 1-2 44 HapMap PCA 1-3 45 HapMap PCA 1,2,4 46 Ancestry Inference: • To what extent can population structure be detected from SNP data? • What can we learn from these inferences? Novembre et al., 2008 Ancestry inference in recently admixed populations Puerto Rican Population (GALA study, E. Burchard) 100% 80% 60% 40% 20% Individual subjects 1-90 88 85 82 79 76 73 70 67 64 61 58 55 52 49 46 43 40 37 34 31 28 25 22 19 16 13 7 10 Percent racial admixture 4 1 0% European African Native American Recombination Events Copy 1 Copy 2 Probability ri for recombination in position i. child chromosome Recently Admixed Populations After generation 1 Recently Admixed Populations After generation 2 Recently Admixed Populations After generation 10 W Recombination Indicators g Generations Z Ancestral states r Recombination rate X Alleles α Admixture fraction p,q Allele frequencies Chromosome Overall Accuracy Applications: • Population genetics (admixture events, recombination events, selection forces, migration patterns) • Potential applications in personalized medicine • Finding new associations (through admixture mapping) 55 Admixture Mapping