Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
From Genotype to Phenotype Tim Conrad Head of AG Computational Proteomics Institut für Mathematik & Informatik, Freie Universität Berlin Today Personalized Genomes SNP Analyses Effects on Genes „One Genes -> One Protein“ hypotheses revisited • Role of proteins • Gene / Proteins for disease detection • • • • Different genes different traits It‘s the genes` fault! Personalized Genomics What are they analyzing? What determines the individual? • Two unrelated organisms of the same species (higher mammals) share about 99.5% of their DNA sequence. • The rest of about 0.5% determines the “individual”. • These differences can be cause by e.g. SNPs. • A SNP is defined as a genomic locus where two or more alternative bases occur with appreciable frequency (>1%). • Occurs every several hundred bases (~ 10.000.000). • Most of these have no phenotypic effect - only <1% of all human SNPs impact protein function (due to non-coding regions) • SNPs can have proxy function • Whole genome SNP analysis is possible. Picture source: Wikipedia Single Nucleotide Polymorphism (SNP) Next Generation Sequencing • 23andMe uses Illumina's DNA Analysis Beadchips • HumanHap550-Quad+ platform • Reads more than 550.000 SNPs, together with a 23andMe custom-designed set that analyzes more than 30.000 additional SNPs. Common Haplotypes • Tag SNPs that contain most of the information about the patterns of human genetic variation are estimated to be about 300,000 to 600,000, which is far fewer than the 10 million common SNPs. What does it tell us? Genome-wide Association Studies Idea: • Scan (Tag-)SNPs across many individuals to associate alleles with a particular disease • Use a detected association to detect, treat and prevent the disease E.g. eye color, blood type, disease X, … Very simple possible proposition “The SNP at APOE position 4075 affects lipid levels!!!” Apolipoprotein E 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5 Exon 4 Exon 3 Exon 2 Exon 1 5361 5229B 5229 A 4951 4075 4036 3937 3701* 3673 3106 2907 2440 1998 1575 1522 1163 832 624 560 545 471 308 73 The One Gene-One Enzyme Hypothesis • George Beadle and Edward Tatum (1942) showed a direct relationship between genes and enzymes in the haploid fungus Neurospora crassa. • This led to their one gene-one enzyme hypothesis, and a share of the 1958 Nobel Prize in Physiology or Medicine. • They built their theory on Archibald Garrod work who proposed a relationship between genes and protein production in 1902 Garrod: Biochemical Genetics …is commonly caused by a mutation on chromosome 12 in the phenylalanine hydrolase gene Phenylalanine is an essential amino acid, but excess is harmful, and so is normally converted to tyrosine. Excess phenylalanine affects the CNS, causing mental retardation, slow growth and early death. “Real-World Biochemistry” Aspartame = a dipeptide: aspartylphenylalanine methyl ester Aspartame is metabolized in the body to its components: aspartic acid, phenylalanine, and methanol. Like other amino acids, it provides 4 calories per gram. Since it is about 180 times as sweet as sugar, the amount of aspartame needed to achieve a given level of sweetness is less than 1% of the amount of sugar required. Thus 99.4% of the calories can be replaced. (NutraSweet) Look on your diet soda cans and read the warning Contributing factors in causal models Janssens 2008 Onset age and secular trends reveal context-specific gene expression North Karelians in Finland had the world’s highest heart disease rate. In a period of 20 years, this was reduced by 75% by dietary change Yet there were no changes in the Finnish gene pool Finish Institute of Public Health Newsletter Publication Contributing factors in causal models Janssens 2008 Complex traits involve many loci with many alleles each Each locus contributes a small amount that depends on its genomic context. Usually, a few have a few alleles that contribute more, in a given population or sample. These Quantitative Trait Loci (QTL) are what mapping finds Classical polygenes (individual effects too small to identify, but substantial in aggregate) Gene networks http://www.jmdbase.jp/JmdBaseExt/img/AllGeneNetwork.png A more realistic proposition “Some combination(s) of SNPs at (or near) APOE affect(s) some lipid levels in some way(s) at least under some condition(s), at least to some extent.” (Possibly !) How would we know? Are propositions like these Falsifiable? Verifiable? Replicable? Predictive? Significant? Deductive? Probabilistic? Intuitive? Based on preconceptions or scientific inertia? From genes to proteins Proteomics In Disease Treatment • Nearly all major diseases—more than 98% of all hospital admissions—are caused by an particular pattern in a group of genes. • Isolating this group by comparing the tens of thousands of genes in each of many genomes would be very impractical. • Looking at the proteomes of the cells associated with the disease is much more efficient. Why Proteomics? Contrarily to the static Genome, the Proteome is highly dynamic! Central Dogma and Degradation DNA Genome Transcriptome mRNA Protein ? Metabolite Degradation ? ? Proteome ? Metabolome Clinical Response Clinical Response Drug Response Biomarker Drug Response Biomarker Central Dogma - REVISITED Tim Stefan Pooja + Axel, Sharon N/N Sandro Current Projects Biomarker identification Protein Bioprint identification degradation + QTL Analysis Raw Data Analysis, Feature Selection, Protein ID Introduction Blood is a complex mixture of thousands of different molecules. Some of these proteins only occur if we suffer from a disease. • How can we identify these proteins and use them in diagnostics ? • Does a disease has a „fingerprint“ ? • Can it help matching patients and drugs ? 31 Medical Analysis Pipeline # Proteins Analysis Pipeline Molecule Seq.: AIILLQY, Similarity to p53: 98% Possible Protein Fingerprint for this disease Protein Mass Proteomics Fingerprinting Pipeline TC et a. (2006), Lecture Notes in Computer Science, 4216 (1) Very sensitive peak finding across spectra from same group Aim: Enable reliable peak assignment & property comparisons across spectra Assign peaks to clusters (Masterpeaks) (6d) Multi-dimensional Bayesian Clustering based on property similarity (since clusters are not spherical) Preprocessing Superimposing Clustering ℜ5000 Each spectrum: Point in ℜ100.000 Point in ℜ 2000 Why bother for the small peaks? • Detection and processing of much more (i.e. smaller) peaks in less time • Small peaks are often hormones or other important low-abundant peptides Better fingerprints! w/o Cell ~80% of all peaks Benchmarks – Cell Port Peak Detection of a 1d-spectrum (2d case ~1500 1d spectra) Platform Cores Time Intel 3.2GHz 1 9 sec Intel 3.2GHz 2 7 sec Intel 3.2GHz 1 250ms (6min) PS3 1 110ms PS3 6 22ms QS22 12 18ms Joint work with Christopher Thöns, Marco Ziegert (~3h) (0.5min) „Bad“ Java code Highly Optimized C code Find the Features 100s … … (1000s) 100s Find alignments (pairs) of Masterpeaks by Calculate Jensen-Shannon divergence of distributions Bipartite Graph Matching for each alignment Height Peak Center … Width Each spectrum: Point in ℜ500 Find the Fingerprints Fingerprint of dimension d=100s ColonCA, n=86 … Healthy, n=150 Fingerprint of dimension m <= d Perform manifold-learning for dimensionality reduction Each spectrum: Point in ℜ2 Assumption: informative part of data lies on manifold of lower dimension. Find mapping such that reconstruction error is small. 38 Mass Spectrum Representation Peak Finding & Clustering (spectra of same group, eg. healthy) Each 1d spectrum: Point in ℜ10.000 to filter out noise Each spectrum: Point in ℜ 3000 Feature selection Disease A Healthy Each spectrum: Point in ℜ (values are peak intensities) 10 Mass Data Processing + the above within Grid or Cluster Systems (~240 CPU Cores) It works • Biomarker identification • Fiedler, TC et al. (2009) Serum Peptidome Profiling Revealed Platelet Factor 4 as a Potential Discriminating Peptide Associated With Pancreatic Cancer. Clinical Cancer Research • Strenziok, TC et al. (2009) Serum proteomic profiling by surfaceenhanced laser desorption/ionization time-of-flight mass spectrometry in testicular germ cell cancer patients. World J of Urology . Preliminary results: fingerprints for 5 different cancer types two of which have proven successful in first clinical studies. Current Projects Biomarker identification Protein Bioprint identification degradation + QTL Analysis Raw Data Analysis, Feature Selection, Protein ID De novo peptide identification Transform MS/MS spectrum into spectrum graph − generate a node for each peak − paths correspond to peptides − set of directed edges (E) and undirected edges (H) − search for (longest) antisymmetric paths (NP-hard) − connect nodes by directed edge if their mass difference equals the mass of some amino acid − connect nodes by undirected edges if they correspond to complementary ions De novo peptide identification Solve the longest antisymmetric path problem by linear optimization. For each directed edge (i,k) one variable xik ∊ {0,1} if xik = 1 edge (i,k) belongs to the path if xik = 0 edge (i,k) does not belong to the path De novo peptide identification Solve the longest antisymmetric path problem by linear optimization. For each directed edge (i,k) one variable xik ∊ {0,1} if xik = 1 edge (i,k) belongs to the path if xik = 0 edge (i,k) does not belong to the path Lagrangian Relaxation -relax the antisymmetry constraints and add penalty term to objective function -Solve dual problem by iteratively solving the relaxed (Lagrange) problem by computing longest paths in a DAG (easy) Current Projects Biomarker identification Protein Bioprint identification degradation + QTL Analysis Raw Data Analysis, Feature Selection, Protein ID From degradomes towards disease-associated proteases Experiment: Take blood from one donor, and produce mass spectra every x hours. Yi et al. Journal of Proteome Research 2007, 6, 1768-1781 Generation of serum peptide signatures abundant blood proteins A-C-D-E-F-G-H-I-K-L-M-N-P-Q-R-S-T-V-W-Y + endoproteases + A-C-D-E-F-G-H-I-K-L-M-N +exoprotease A-C-D-E-F-G-H-I-K-L-M-N A-C-D-E-F-G-H-I-K-L-M A-C-D-E-F-G-H-I-K-L A-C-D-E-F-G-H-I-K A-C-D-E-F-G-H-I A-C-D-E-F-G-H + P-Q-R-S-T-V-W-Y + +exoprotease A-C-D-E-F-G-H-I-K-L-M-N C-D-E-F-G-H-I-K-L-M-N D-E-F-G-H-I-K-L-M-N E-F-G-H-I-K-L-M-N F-G-H-I-K-L-M-N G-H-I-K-L-M-N Degradation Graph Transform multiple spectra into a degradation graph using • known precursor peptides • MS2 identifications Model generation dC1 = −k1_ 2C1 − k1_ 34 C1 dC2 = k1_ 2C1 dC3 = k1_ 34 C1 dC4 = k1_ 34 C − k 4 −5C4 dC5 = k4 −5C4 Generate mathematical models based on the graph to • Estimate reaction parameters • classify diseased/healthy samples • ... Glueing it together Genomics & Proteomics = BioPrints + (and even other infos) Extensions BioPrints AIM: given bio-material (fluids, DNA, skin, …) of ONE individual determine multi-dimensional fingerprint and relate to particular diseases • • • • Fingerprinting based Disease based Integration of different data sources How to link ? Acknowledgements BioInformatics @ Free University Berlin AG BioComputing AG Algorithmic Bioinformatics Prof. Christof Schütte Prof. Knut Reinert AG Comp. Proteomics Stephan Aiche, Sandro Andreotti, Sharon Bruckner, Pooja Gupta, Axel Rack ILM / University Hospital Leipzig Charité Berlin Prof. Joachim Thiery Microsoft Research, Cambridge Dr. Andre Hagehülsmann IBM Deutschland Mehr Informationen im Internet unter msproteomics.net Vielen Dank! Tim Conrad ([email protected]) AG Computational Proteomics www.msproteomics.net Summary Weitere Fragen