Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology [email protected] http://bioinformatics.bc.edu/marthlab Pfizer visit, March 7. 2006 Our lab focuses on three main projects… 1. software tools for clinical case-control association studies 2. software for SNP discovery in clonal and resequencing data, 3. connecting HapMap and pharmaco-genetic data 1. We developing computer software to aid tagSNP selection and association testing GUI tags input data views study specification user input 1 2 representative computational samples gene annotations 5-site Computaionally Generated LD (r ) reference samples computational sample database tag evaluation marker selection association testing 0.8 LD views 0.6 0.4 0.2 1-4 Mrk Sep. association statistics 5-9 Mrk Sep. 10-17 Mrk Sep. user control interface (discussed in more detail) 18-26 Mrk Sep. 0 0 0.2 0.4 0.6 LA LD (r2) 0.8 1 2. We build computer tools for SNP discovery 1. • inherited (germ line) polymorphisms are important as they can predispose to disease P( SNP ) all var iable • looking for SNPs and short INDELs P( S N | RN ) P( S1 | R1 ) ... PPr ior ( S1 ,..., S N ) PPr ior ( S1 ) PPr ior ( S N ) P( SiN | R1 ) P( Si1 | R1 ) S ... PPr ior ( Si1 ,..., SiN ) ... PPr ior ( SiN ) S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] PPr ior ( S i1 ) Marth et al. Nature Genetics 1999 • we have a 5-year NIH R01 grant to re-develop our computer package, PolyBayes© , our SNP discovery tool originally developed while the PI was at the Washington University Medical School Apply our tools for genome-scale SNP mining genome reference EST WGS BAC ~ 10 million Sachidanandam et al. Nature 2001 Extend our methods for SNP detection in medical resequencing data from traditional Sanger sequencers… Homozygous C Heterozygous C/T Homozygous T … and in 454 pyrosequence data • detection of heterozygotes in medical re-sequencing data • accurate base calling for de novo sequencing 454 sequence from the NCBI Trace Archive (discussed in more detail) Figure from Nordfors, et. al. Human Mutation 19:395-401 (2002) Developing methods to detect somatic mutations (as distinguished from inherited polymorphisms) • the detection of somatic mutations, and their distinction from inherited polymorphism, will be important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer © Brian Stavely, Memorial University of Newfoundland (discussed in more detail) Process DNA methylation data obtained with sequencing DNA methylation is important e.g. because hypo- and hypermethylation is consistently present in various cancers we are developing methods to interpret DNA methylation data obtained with sequencing, in the presence of methodological artifacts such as incomplete bi-sulfite conversion of unmethylated cytosines Issa. Nature Reviews Cancer, 4, 2004: 988-993 Lewin et. al. Bioinformatics, 20:3005-30012, 2004 … and tools to integrate genetic and epigenetic data from varied sources to find “common themes” during cancer development somatic mutations chromosome rearrangements methylation profiles chromatin structure copy number changes gene expression profiles repeat expansions 3. We are planning a project to connect multi-marker haplotypes to drug metabolic phenotypes • predicting metabolic phenotypes (ADR) based on haplotype markers • evolutionary origin of drug metabolizing enzyme polymorphisms Computer software to aid case-control association studies: tagSNP selection and association testing (details) 2 5-site Computaionally Generated LD (r ) 1 0.8 0.6 0.4 0.2 1-4 Mrk Sep. 5-9 Mrk Sep. 10-17 Mrk Sep. 18-26 Mrk Sep. 0 0 0.2 0.4 0.6 0.8 1 LA LD (r2) Dr. Eric Tsung Clinical case-control association studies – concepts • association studies are designed to find disease-causing genetic variants • genotyping cases and controls at various polymorphisms clinical cases • searching “significant” marker allele frequency differences between cases and controls AF(controls) clinical controls AF(cases) Association study designs • region(s) interrogated: single gene, list of candidate genes (“candidate gene study”), or entire genome (“genome scan”) • direct or indirect: causative variant • single-SNP marker or multiSNP haplotype marker • single-stage or multi-stage marker that is co-inherited with causative variant causative variant Marker (tag) selection for association studies for economy, one cannot genotype every SNP in thousands of clinical samples: marker selection is the process where a subset of all available SNPs is chosen 1. hypothesis driven (i.e. based on gene function) 2. LD-driven – based entirely on the reduction of redundancy presented by the linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are correlated with causative variant The International HapMap project The international HapMap project was designed to provide a set of physical and informational reagents for association studies by mapping out human LD structure http://www.hapmap.org LD varies across samples there are large differences in LD between different human populations… European reference (CEU) African reference (YRI) … and even between samples from the same population. Other European samples Sample-to-sample LD differences make tagSNP selection problematic groups of SNPs that are in LD in the HapMap reference samples may not be in a future set of clinical samples… … and tags that were selected based on LD in the HapMap may no longer work (i.e. represent the SNPs they were supposed to) in the clinical samples… … possibly resulting in missed disease associations. Natural marker allele frequency differences confound association testing • the HapMap reference samples are much smaller than clinical sample sizes cases: 500-2,000 chromosomes reference samples: ~ 120 chromosomes controls: 500-2,000 chromosomes • therefore difficult to assess statistical significance of candidate associations AF(controls) • difficult to accurately assess both marker allele frequency (single-SNP or haplotype frequency) in the clinical samples and naturally occurring variation of marker allele frequency differences between cases and controls AF(cases) We are developing technology for assessing sample-tosample variance in silico we estimate LD differences between HapMap and future clinical samples… cases association testing reference tag evaluation tag selection …by generating “computational” samples representing future clinical samples… controls “cases” … and use computational “proxy” samples for tabulating LD and allele frequency differences. “controls” Two methods of computational sample generation Method 1. “Data-relevant Coalescent”. This algorithm uses a population genetic model to connect mutations in the HapMap reference to mutations in future clinical samples. Full model but computationally slow. “HapMap” HapMap “cases” “controls” Method 2. The PAC method (product of approximate conditionals, Li & Stephens). This method constructs “new” samples as mosaics of existing haplotypes, mimicking the effects of recombination. An approximation but fast. Computational samples HapMap (CEU) Computational (PAC) Extra genotypes (Estonia) Computational (Coalescent) MARKER EVALUATION with computational samples test if markers selected from the HapMap continue to “tag” other SNPs in their original LD group MARKER SELECTION with computational samples selecting tags in multiple consecutive sets of computational samples and choosing for the association study the best-performing tags ASSOCIATION TESTING with computational samples “cases” tabulating ΔAF in “cases” vs. “controls” in multiple consecutive computational pairs of samples provides the natural range of allele frequency differences to decide if a candidate association is statistically significant “controls” “cases” AF(controls) “controls” “cases” “controls” AF(cases) Do computational samples represent future clinical genotypes realistically? 1 0.8 0.6 0.4 0.2 0 0 we quantify the quality of representation by comparing the correlation of LD between corresponding pairs of markers (i.e. ask if two markers were in strong LD in one set of samples, are they ALSO in strong LD in the other set? 0.2 0.4 0.6 0.8 1 LD difference -- comparison to extra experimental genotypes • we have analyzed two extra genotype sets collected at the HapMap SNPs in three genome regions, from our clinical collaborators (Prof. Thomas Hudson, McGill; Prof. Stanley Nelson, UCLA) 0.949 +/- 0.013 0.963 +/- 0.014 0.978 +/- 0.010 AF difference -- comparisons to extra experimental genotypes 0.06 AF Diff, Comp Samples 0.05 0.04 0.03 0.02 0.01 0 0 0.01 0.02 0.03 0.04 0.05 0.06 AF Diff, Estonian Data • according to our limited initial test, computational samples can represent future clinical samples well for estimating sample-to-sample variability A new marker selection and association testing software tool • data visualization • gene annotations overlaid on physical map of SNPs (i.e. the human genome sequence) tags gene annotations • representative computational sample generation LD views • advanced tag selection functionality • advanced association testing functionality reference samples 2 5-site Computaionally Generated LD (r ) 1 representative computational samples 0.8 0.6 0.4 0.2 1-4 Mrk Sep. association statistics 5-9 Mrk Sep. 10-17 Mrk Sep. 18-26 Mrk Sep. 0 0 0.2 0.4 0.6 0.8 1 LA LD (r2) • multi-level user customization including user conveniences e.g. tag prioritization based on SNP assay score User community • companies designing new generations of whole-genome or specialized SNP arrays • researchers comparing alternative platforms (e.g. Affymetrix 500K and the Illumina 300K ) most suitable for their study • clinical researchers designing candidate gene studies • researchers designing second-stage follow-up studies in specific genome regions after an initial genome scan (our methods can take advantage of first-stage data already available in the clinical samples) • the association testing features should be useful for analysts regardless of study design Base calling and SNP detection in sequence traces including 454 data Aaron Quinlan Base calling and SNP detection in sequence traces including 454 “pyrogram” data • PolyBayes was originally written to find SNPs in clonal sequences in large SNP discovery projects • medical re-sequencing projects require the detection of SNPs in heterozygous diploid sequence traces 5’ 3’ 5’ 3’ C G C G C G T A Heterozygote detection in sequence traces Ind. 1 Ind. 2 Ind. 3 Ind. 4 Individual traces • we use a machine learning method (Support Vector Machine, SVM) to recognize characteristic features of homozygous vs. heterozygous positions Aggregating information from multiple traces P(GT | Read) = .98 resultant genotype call P(GT ) = .993 P(GT | Read) = .87 forward/reverse sequences from same individual Discovery vs. genotyping discovery: “uninformed prior” don’t know if site is polymorphic have to test each site Prior(CT) = .001 genotyping: “informed prior” 1. site is known to be polymorphic 2. allele frequency estimate Prior(CT) = 0.34 Our heterozygote detection works better than other methods Fraction of Data Analyzed False Discovery Rate Fraction of Heterozygotes Found Fraction of Homozygotes Found PolyBayes+ 85.1 0.0375 86.60% 97.8% Polyphred 5 86.17 0.0389 83.16% 82.63% Performance Measured on ~1000 Alignments covering 500Kb Region of Chromosome 4 Base calling for “pyrograms” 26 55 24 15 10 7 5 4 2 1 0 0 TCAGGGGGGGGGGGACGACAAGGCGTGGGGA • readout in pyrosequencing is based on instantaneous detection of base incorporation… multiple bases of the same type are incorporated in the same cycle • the identity of consecutive bases is very reliable but the length of mononucleotide runs (base number) is difficult to quantify (great for re-sequencing; but problematic for de novo sequencing) • we have access to standardized data formats From NCBI Trace Archive SNP genotyping with pyrosequencers we are in the process of identifying discriminating pyrogram features to use in our machine-learning methods to recognize polymorphic positions within traces Nordfors, et. al. Human Mutation 19:395-401 (2002) Somatic mutation detection Michael Stromberg Somatic mutations the detection of somatic mutations, and their distinction from inherited polymorphism, is important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer © Brian Stavely, Memorial University of Newfoundland 1. detect the mutations 2. classify whether somatic or inherited Detecting somatic mutations with comparative data • based on comparison of cancer and normal tissue from the same individual • often cancer tissue is highly heterogeneous and the somatic mutant allele may represent at low allele frequency Detecting somatic mutations with subtraction • if normal tissue samples are not available, we detect SNPs in cancer tissue against e.g. the human genome reference sequence • search for evidence that these mutations are genetic • subtract apparent mutations that are present in sequence variation databases Detecting somatic mutations with subtraction • we have applied our methods for somatic mutation detection in murine mitochondrial sequences heteroplasmy homoplasmy • we will be applying our methods for human nuclear DNA from our collaborators Using new haplotype resources to connect genotype and clinical outcome in pharmaco-genetic systems • the HapMap was designed as a tool to detect high-frequency (common) phenotypic (e.g. disease-causing) alleles • important drug metabolizing enzymes are relatively few in number, well studied, are at known genome locations, many associated phenotypes are well described • many functional alleles are known, and of high frequency (common) • multi-SNP alleles are highly predictive of metabolic phenotype • clinical phenotype (adverse drug reaction) less predictable • ideal candidate for applying haplotype resources Multi-marker haplotypes as accurate markers for ADRs? genetic marker (haplotype) in genome regions of drug metabolizing enzyme (DME) genes computational prediction based on haplotype structure functional allele (known metabolic polymorphism) clinical endpoint (adverse drug reaction) molecular phenotype (drug concentration measured in blood plasma) Resources • functional alleles • LD and haplotype structure in the HapMap reference samples, based on high-density SNP map • specifics of enzymedrug interactions • existing DME P genotyping chips Evolutionary questions • mutations single-origin or recurrent? • geographic origin of mutations? • mutation age? • analysis based on complete local variation structure and haplotype background of functional mutations • specifics of the selection process that led to specific functional alleles? Proposed steps of analysis • complete polymorphic structure? • ethnicity? haplotype block? • additional functional SNPs? • haplotypes vs. functional alleles? • haplotypes vs. metabolic phenotype? • haplotypes vs. ADR phenotype? clinical phenotype (ADR) haplotype functional allele (genotype) metabolic phenotype