Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Complex inheritance/disease Many Other Genes SNP Resources: Variation Discovery, HapMap and the EGP Variant Gene Environment Disease Mark J. Rieder Department of Genome Sciences mrieder@u. washington..edu [email protected] Diabetes Obesity Cancer Strategies for Genetic Analysis C/C C/C C/T C/T C/C C/C C/T Strategies for Genetic Analysis (GWAS) Populations Association Studies Samples with phenotype data (continuous, case-control) (n = 100's - 1000's) Genotype samples with commercial 'chips' • Affymetrix - Random SNP design (v.5, v.6) • lllumina - preselected tagSNP design (650Y, 1M) C/T C/T C/T Schizophrenia Celiac Disease Autism Two hypotheses: 1- common disease/common variant? 2- common disease/many rare variants? NIEHS SNPs Workshop Jan 10-11, 2008 Families Linkage Studies Heart Disease Multiple Sclerosis Asthma C/C C/C C /C C/C 40% T, 60% C Cases 15% T, 85% C Perform statistical association with each SNP Calculate p-value for each SNP Controls Simple Inheritance Complex Inheritance Single Gene Multiple Genes Rare Variants Common Variants ~600 Short Tandem Repeat Markers Polymorphic Markers > 300,000 -1,000,000 Single Nucleotide Polymorphisms (SNPs) Region with plausible statistical association 1 SNP Resources: SNP discovery and cataloging Development of a genome-wide SNP map: How many SNPs? 1. SNP discovery/genotyping: Genome-wide approaches SNP Consortium HapMap 2. The current state of SNP resources (dbSNP) 3. Comprehensive SNP discovery NIEHS SNPs - Environmental Genome Project Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (> 1- 5% MAF) - 1/300 bp SNP Databases - “How to” to” Manual for finding SNPs In class - Tutorial How has SNP discovery progressed toward this goal? HapMap Project: Create a genome-wide SNP map Phase I 1M SNPs Phase II 4M SNPs Genotype SNPs in four populations: • • • • CEPH (CEU) (Europe - n = 90, trios) Yoruban (YRI) (Africa - n = 90, trios) Japanese (JPT) (Asian - n = 45) Chinese (HCB) (Asian - n =45) Density ~ 1 SNP/kb 626 genes 15 Mbp 91,000 SNPs To produce a genome-wide map of common variation Common Variant/Common Disease Density - 1 SNP/166 bp 2 Finding SNPs: Marker Discovery and Methods Finding SNPs: Sequence-based SNP Mining Genomic mRNA SNP discovery has proceeded in two distinct phases: RT errors? cDNA Library 1 - SNP Identification/Discovery Define the alleles Map this to a unique place in the genome Sequencing Quality EST Overlap 2 - SNP Characterization (HapMap genotyping) Determination of the genotype in many individuals Population frequency of SNPs DNA SEQUENCING BAC Library RRS Library BAC Overlap Shotgun Overlap Sequence Overlap - SNP Discovery GTTACGCCAATACAGG GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGC GTTACGCCAATACAGCATCCAGGAGATTACC HapMap SNP Discovery: Prior to Genotyping Finding SNPs: Sequence-based SNP Mining Initiation of project planning (July 2001): 2.8 million SNPs (1.4 million validated) - 1/1900 bp Nov 2003 - 5.7 million (2 million validated) - 1/1500 bp Feb 2004 - 7.2 million (3.3 million validated) - 1/900 bp BAC = Bacterial Artificial Chromosome Primary vector for DNA cloning in the HGP DNA from multiple individuals Clone large fragments into BACs (unknown sequence) Generate more SNPs: Random Shotgun Sequencing Fragment DNA Sequence and Reassemble (known sequence) Genomic DNA (multiple individuals) Other Sources of SNPs: Perlegen (Affymetrix (Affymetrix chips) SNP data (chr22) Sequence chromatograms from Celera project Assembly with other overlapping BACs Sequence and align (reference sequence) TACGCCT TACGCCT ATA TCA TCA AGGAGAT GTTACGCCA GTTACGCCAATACAGGATCC ATACAGGATCC AGGAGATTACC GTTACGCCAATACAGG GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGC GTTACGCCAATACAGCATCCAGGAGATTACC 3 Draft Human Genome SNP Discovery: dbSNP database SNP data submitted to dbSNP: Clustering dbSNP processing of SNPs dbSNP -NCBI SNP database SNPs submitted by research community (submitted SNPs = ss#) Unique mapping to a genome location (reference eference SNP NP = rs#) Validated Unvalidated (by 2hit-2allele) Finding SNPs: Marker Discovery and Methods HapMap Discovery Increased SNP Density and Validated SNPs 12.0 TotalTotalSNPs Reference SNPs HapMap SNP Discovery 12.00 10.0 Validated SNPs Validated SNPs 11+ million rs SNPs 10.00 8.0 SNPs(millions) SNPs (millions) SNP discovery has proceeded in two distinct phases: 14.00 6.0 4.0 5.6 million validated rs SNPs 6.00 4.00 2.0 1 - SNP Identification/Discovery Define the alleles Map this to a unique place in the genome 8.00 2 - SNP Characterization (HapMap genotyping) Determination of the genotype in many individuals Population frequency of SNPs 2.00 0.00 0.0 Jan-03 Mar-03 Jan-03 Mar-03 Jun-03 Aug-03 Oct-03 Jan-04 Mar-04 Jun-04 Sep-04 Jan-05 Mar-05 Jun-06 Mar-07 Jun-03 Aug-03 Nov-03 Jan-04 Mar-04 Jun-04 dbSNP Release Sep-04 Jan-05 Mar-05 dbSNP Release 4 HapMap Project: Create a genome-wide SNP map Finding SNPs: Genotype Data Adds Value to SNPs Indiv.#1: Indiv.#1: GTTACGCCAATACAG[G GTTACGCCAATACAG[G/G]ATCCAGGAGATTACC Indiv.#2: Indiv.#2: GTTACGCCAATACAG[C GTTACGCCAATACAG[C/G]ATCCAGGAGATTACC Indiv.#3: Indiv.#3: GTTACGCCAATACAG[C GTTACGCCAATACAG[C/C]ATCCAGGAGATTACC Genotype SNPs in four populations: • • • • Confirms SNP as “real” real” and “informative” informative” Minor Allele Frequency (MAF) - common or rare MAF differs by different population Detection of SNP x SNP correlations (Linkage Disequilibrium) Determine tagging SNPs (tagSNPs (tagSNPs)) Determine haplotypes CEPH (CEU) (Europe - n = 90, trios) Yoruban (YRI) (Africa - n = 90, trios) Japanese (JPT) (Asian - n = 45) Chinese (HCB) (Asian - n =45) To produce a genome-wide map of common variation Interleukin 6 – Collapsed Dataset: Minor Allele Frequencey (MAF) > 10 % dbSNP: All validated SNPs now have been genotyped! 6.00 African-Am. Validated SNPs SNPs with Genotypes SNPs(millions) 5.00 European-Am. HapMap Genotyping 5.6 million validated rs SNPs 4.00 3.00 2.00 1.00 0.00 Jan-03 Mar-03 Jun-03 Aug-03 Oct-03 Jan-04 Mar-04 Jun-04 Sep-04 Jan-05 Mar-05 May-06 Mar-07 dbSNP Release • Most of the SNPs in the genome are rare • Significant differences in SNP frequency can exist • Correlation between SNPs can be used to select the most informative SNPs HapMap release = 4 M validated QC SNPs 5 Development of a genome-wide SNP map: How many SNPs? Finding SNPs: Sequence-based SNP Mining Genomic DNA SEQUENCING Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (> 1-5% MAF) - 1/300 bp Feb 2001 - 1.42 million (1/1900 bp) Nov 2003 - 2.0 million (1/1500 bp) Feb 2004 - 3.3 million (1/900 bp) Mar 2007 - 5.6 million (validated - 1/535 bp) Fraction of SNPs Discovered 1.0 BAC Overlap Shotgun Overlap Random Shotgun cDNA Library EST Align to Reference Overlap GTTACGCCAATACAGG GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGC GTTACGCCAATACAGCATCCAGGAGATTACC SNP discovery is dependent on your sample population size { RRS Library RANDOM Sequence Overlap - SNP Discovery When will we have them all? 2 chromosomes BAC Library mRNA SNP Characterization/Genotyping GTTACGCCAATACAGG GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGC GTTACGCCAATACAGCATCCAGGAGATTACC HapMap data doesn’ doesn’t capture low frequency SNPs 88 0.5 Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (>1-5% MAF) - 1/300 bp Mar 2007 - 5.6 million (validated/mapped - 1/535 bp) 2 0.0 0.0 0.1 0.2 0.3 0.4 When will we have them all? 0.5 HapMap data doesn’ doesn’t capture low frequency alleles Used as a proxy for all common variation through linkage disequilibrium Minor Allele Frequency (MAF) 6 Targeted SNP Discovery Increasing Sample Size Improves SNP Discovery Directed analysis: cSNPs GTTACGCCAATACAGGATCCAGGAGATTACC { GTTACGCCAATACAGG GTTACGCCAATACAGC GTTACGCCAATACAGCATCCAGGAGATTACC 2 chromosomes Arg-Cys 3’ Val-Val Fraction of SNPs Discovered 5’ 1.0 PCR amplicons Complete analysis: cSNPs, Linkage Disequilbrium and Haplotype Data 5’ 3’ Arg-Cys Val-Val 96 Resequencing (n=48) 48 24 16 HapMap Based on ~ 8 chromosomes 88 0.5 2 PCR amplicons 0.0 •Generate SNP data from complete genomic resequencing (i.e. 5’ regulatory, exon, intron, 3’ regulatory sequence) 0.0 0.1 0.2 0.3 0.4 0.5 Minor Allele Frequency (MAF) Increasing SNP Density: HapMap ENCODE Project ENCODE = ENCyclopedia ENCyclopedia Of DNA Elements Catalog all functional elements in 1% of the genome (30 Mb) Goal: Comprehensively identify all common sequence variation in candidate genes 10 Regions x 500 kb/region (Pilot Project) David Altschuler (Broad), Richard Gibbs (Baylor) 16 CEU, 16 YRI, 8 HCB, 8 JPT Comprehensive PCR based resequencing across these regions Initial biological focus: Candidate environmental response genes involved in DNA repair, cell cycle, apoptosis, metabolism, cell signaling, and oxidative stress. 15,367 dbSNP 16,248 New SNPs 50% of SNPs in dbSNP Approach: Direct resequencing of genes Samples: PDR-90 ethnically diverse individuals representative of U.S. population (397 genes) EGP95-95 samples from four ethnic groups (227 genes) 5 Mb/31,500 SNPs = 1/160 bp (24 HapMap Asians, 22 HapMap Europeans, 12 HapMap Yorubans, 15 African Americans, 22 Hispanic) 7 Summary of NIEHS SNPs genotypes in dbSNP Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million SNPs (>1% MAF-estimated) - 1/300 bp 626 genes sequenced 15 Mb scanned > 90,000 genotyped SNPs identified > 8 million genotypes deposited in dbSNP NIEHS SNPs = 1/166 bp (n = 95, 4 pops) HapMap ENCODE = 1/160 bp (n = 48, 3 pops) Comprehensive resequencing can identify the vast majority of SNPs in a region Nov 2005 - Zaitlen et al. Genome Research 15:1594-1600 SNP Discovery: dbSNP database dbSNP (Perlegen/HapMap) Transcription coupled repair Mismatch repair NIEHS SNPs Double strand break repair 60% Pathways 15% { { 25% 25% Cell cycle Apoptosis Minor Allele Freq. (MAF) Rarer and population specific SNPs are found by resequencing 8 Protein disrupting SNPs in EGP data nsSNPs are found in ~60% of genes Deleterious nsSNPs are preferentially low frequency Computational inference of nsSNP function SIFT = Sorting Intolerant From Tolerant Evolutionary comparison of non-synonymous SNPs Classifications: Tolerant, Intolerant Ng and Henikoff, Genome Research, 12:436-446, 2002 PolyPhen - Polymorphism Phenotyping Structural protein characteristics and evolutionary comparison Classifications: benign, possibly damaging, probably damaging Ramensky, et al., Nucleic Acids Research, 30:17:3894-3900, 2002 9 Illumina NIEHS SNPs Genotyping EGP and HapMap Genotyping PDR = 90 ethnically diverse individuals representative of U.S. population Genotypes Allele Frequency HWE European (CEU,n=60) EGP p1 panel PDR ethnically diverse representation of US n = 90 (anonymous) 397 genes/55,000 SNPs African (YRI, n =60) Each well samples 1536 SNPs in one individual For each HapMap sample 5 x 1536 (7680 genotyped SNPs) 3,000,000 genotypes generated (total ~400 samples) Asian (HCB, n = 45) Select SNPs Asian (JPT, n = 45) ~7600 SNPs Illumina Afr.-American Afr.-American (n = 62) Hispanic (n = 60) NIEHS SNPs Genotype Data CEPH 0.341 Japanese Chinese African American 0.274 0 . 282 0.931 0 . 610 0.626 0.533 Hispanic 0.415 Yoruban PDR (397 genes) SNPs characterized in six different major major populations. populations. 0.841 CEPH 0 . 952 0 . 410 0 . 761 Japanese 0 . 420 0 . 765 Chinese 0 . 589 African American Population Allele Frequency Correlations Illumina NIEHS SNPs Genotyping 10 Hispanic EGP and HapMap Compatibility HapMap/US Pop. n = 332 Panel 2 (HapMap based, n=95) European (CEU,n=60) European (CEU,n=22) African (YRI, n =60) EGP p2 DNA panel HapMap Integration (~4 million SNPs) High Density Genic Coverage (EGP) African (YRI, n =12) Asian (HCB, n = 45) Asian (JPT, n = 45) Asian (JPT & HCB, n = 24 Low Density Genome Coverage (HapMap) Afr.-American Afr.-American (n = 15) Afr.-American Afr.-American (n = 62) Hispanic (n = 60) = EGP SNP discovery (1/200 bp) = HapMap SNPs (~1/1000 bp) Hispanic (n = 22) Summary: The Current State of SNP Resources Approximately 10 million common SNPs exist in the human genome (1/300 bp). Random SNP discovery processes generate many SNPs (HapMap) Random approaches to SNP discovery have reached limits of discovery and validation (1/600 bp; 50% SNP validation) Most validated SNPs (5+ million) have been genotyped by the HapMap (3 pops) Resequencing approaches continue to catalog important variants (rare and common not captured by the HapMap) NIEHS SNPs has generated SNP data on >600 candidate genes and 90 K SNPs along with large-scale HapMap+ characterization data for of SNPs generated using the PDR 11