Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein phosphorylation wikipedia , lookup
Histone acetylation and deacetylation wikipedia , lookup
Cell nucleus wikipedia , lookup
Magnesium transporter wikipedia , lookup
Signal transduction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein moonlighting wikipedia , lookup
List of types of proteins wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Protein–protein interaction wikipedia , lookup
What is Biology? “A branch of knowledge that deals with living organisms and vital processes” Computational Methods in Systems Biology Nir Friedman The hottest scientific frontier of our times • Many great processes have been figured out • Much is still unknown Maya Schuldiner Tremendous impact on Medicine • Both diagnosis, prognosis, and treatment . 2 Biological Systems are Complex Bakers Yeast Saccharomyces Cereviciae •The System is NOT just a sum of its parts 3 •Used to make bread and beer •The simplest cell that still resembles human cells 4 The Age of Genomes What is Systems Biology? “Systems biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behavior of that system” 404 Complete Microbial Genomes (Thousands in progress) 31 Complete Eukaryotic Genomes (315 in progress!) 3 Complete Plant Genomes (6 in progress) • The last decades lead to revolution on how we can examine and understand biological systems Characterized by • High-throughput assays • Integration of multiple forms of experiments & knowledge • Mathematical modeling 5 Bacteria 1.6Mb 1600 genes 95 96 97 Eukaryote Animal Human 13Mb 100Mb 3Gb ~6000 genes ~20,000 genes ~30,000 genes? 98 99 00 01 02 03 04 05 06 Individual Genomes? 07 08 09 10 6 1 Ask Not What Systems Biology Can do For you…. . 8 Why Biology for NIPS Crowd? Flow of Information in Biology Quantity • Data-intense discipline: Too vast for manual interpretation Systematic • Collection of data on all genes/proteins/… Multi-faceted • Measurements of complementary aspects of cellular function, development and disease states • Challenge of integration and fusion of multiple data DNA RNA Recipe (in safe) Working copy Protein The resulting dish Phenotype The Review Has the potential to be medically applicative! 9 10 The “Post-Genomic Era” Systematic is Not Just More Outline Assays DNA Genomic sequences Variations within a population 11 … RNA Quantity Structure Degradation rate … DNA Protein Quantity Location Modifications Interactions … RNA Protein Phenotype Phenotype Stores genetically inherited information of four nucleotide types (A, C, G, T) Two complementary strands creating base pairs (bp) 105 bp in bacteria, 3x109 in humans 6 X1013 in wheat Genetic interventions Environmental interventions … Sequence 12 2 Understanding Genome Sequences ~3,289,000,000 characters: aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc cagcactttgggagatcgaggagggaggatcacctgaggtcaggagttac agacatggagaaaccccgtctctactaaaaatacaaaattagcctggcgt ggtggcgcatgcctgtaatcccagctactcgggaggctgaggcaggagaa tcgcttgaacccgggagcggaggttgcggtgagccgagatcgcaccgttg cactccagcctgggcgacagagcgaaactgtctcaaacaaacaaacaaaa aaacctgatacatggtatgggaagtacattgtttaaacaatgcatggaga tttaggttgtttccagtttttactggcacagatacggcaatgaatataat tttatgtatacattcatacaaatatatcggtggaaaattcctagaagtgg aatggctgggtcagtgggcattcatattgagaaattggaaggatgttgtc aaactctgcaaatcagagtattttagtcttaacctctcttcttcacaccc ttttccttggaagaaagctaaatttagacttttaaacacaaaactccatt ttgagacccctgaaaatctgggttcaaagtgtttgaaaattaaagcagag gctttaatttgtacttatttaggtataatttgtactttaaagttgttcca . . . Goal: Identify components encoded in the DNA sequence 13 14 Open Reading Frame Finding Open Reading Frames ATGCTCAGCGTGACCTCA . . . CAGCGTTAA M L S V T S . . . Q ATGCTCAGCGTGACCTCA . . . CAGCGTTAA R STP Protein-encoding DNA sequence consists of a sequence of 3 letter codons Starts with the START codon (ATG) Ends with a STOP codon (TAA, TAG, or TGA) M L S V T S . . . Q R STP Try all possible starting points 3 possible offsets 2 possible strands Simple algorithm finds all ORFs in a genome Many of these are spurious (are not real genes) How do we focus on the real ones? 15 16 Using Additional Genomes Phylogentic Tree of Yeasts S. cerevisiae Basic premise “What is important is conserved” ~10M years S. paradoxus S. mikatae S. bayanus C. glabrata S. castellii Evolution = Variation + Selection • Variation is random • Selection reflects function K. lactis A. gossypii K. waltii D. hansenii C. albicans Y. lipolytica Idea: Instead of studying a single genome, compare related genomes A real open reading frame will be conserved 17 N. crassa M. graminearum M. grisea A. nidulans S. pombe 18 Kellis et al, Nature 2003 3 Evolution of Open Reading Frame S. cerevisiae S. paradoxus S. mikatae S. bayanus ATGCTCAGCGTGACCTCA ATGCTCAGCGTGACATCA ATGCTCAGGGTGACA--A ATGCTCAGG---ACA--A . . . . . . . . Conserved Variable Frame shift Spurious ORF . . . . ATG not conserved Frame shift changes interpretation of downstream seq Conserved positions Examples Confirmed ORF Variable positions A deletion Greedy algorithm to find conserved ORFs surprisingly Sequencing effective (> 99% accuracy) on verified yeast data error 19 Defining Conservation Naïve approach Consensus between all species Problem: A A A A grained distances between species Ignores the tree topology Ignores A C A G Variable A C A C A C A A A G C A A G C A A T C C Goal: More sensitive and robust methods Probabilistic Model of Evolution Conserved A T C C A C C A Rough % conserv 100 33 55 55 21 Aardvark Bison Chimp Dog Assumptions: Positions (columns) are independent of each other Each branch is a reversible continuous time discrete state Markov process 22 Conserved vs. unconserved Two hypotheses: P (a " c |t + t ') = ! P (a " b |t )P (b " c |t ') b 2 3 4 1 Conserved Short branches (fewer mutations) P (a )P (a ! b |t ) = P (b )P (b ! a |t ) governed by a rate matrix Q Q a,b = d P(a " b | t) dt t= 0 P(a " b | t) = [e tQ ] Elephant Random variables – sequence at current day taxa or at ancestors Potentials/Conditional distribution – represent the probability of evolutionary changes along each branch Parameterization of Phylogenies 23 [Kellis et al, Nature 2003] 20 Use log a,b 24 2 3 4 1 Unconserved Long branches (more mutations) P(position | unconserved ) P(position | conserved) [Boffelli et al, Science 2003] ! 4 Genes Are Better Conserved Challenges % conserved log Fast/Slow Other types of genomic elements Small polypeptides (peptohormones, neuropeptides) RNA coding genes • rRNA, tRNA, snoRNA… • miRNA Regulatory regions [Boffelli et al, Science 2003] 25 27 Transcription Factor Binding Sites Regulatory Elements Relatively short words (6-20bp) Recognition is not perfect • Binding sites allow variations Often conserved *Essential Cell Biology; p.268 28 29 Challenges Other types of genomic elements Small polypeptides (peptohormones, neuropeptides) RNA coding genes • rRNA, tRNA, snoRNA… • miRNA Regulatory regions Recognition of elements without comparisons Clearly sequence contains enough information to “parse” it within the living cell 30 Outline DNA RNA Protein Phenotype Copied from DNA template Conveys information (mRNA) Can also perform function (tRNA, rRNA, …) Single stranded, four nucleotide types (A,C, G, U) For each expressed gene there can be as few as 1 molecule and up to 10,000 molecules per cell. 31 5 Gene Expression High Throughput Gene Expression Transcription Translation Extract Same DNA content different phenotype Difference is in regulation of expression of genes Very 33 RNA expression levels of 10,000s of genes in one experiment Microarray 34 Dynamic Measurements Conditions Expression: Supervised Approaches Labeled samples Time courses perturbations (genetic & environmental) Biopsies from different patient populations … Feature selection + Classification Classifier confidence Genes Different Potential diagnosis/prognosis tool the disease state ⇒ insights about underlying processes Characterizes 35 Gasch et al. Mol. Cell 2001 Segman et al, Mol. Psych. 2005 36 Expression: Unsupervised P-value =< 0.027 Papers Compendia 26 datasets from Whitehead and Stanford Various tumors Stimulated PBMC B lymphoma Breast cancer Stimulated immune PCA Viral infection Fibroblast EWS/FLI Prostate cancer Cluster Fibroblast infection Neuro tumors Fibroblast serum NCI60 Gliomas HeLa cell cycle Lung cancer 37 Eisen et al. PNAS 1998; Alter et al, PNAS 2000 39 Leukemia Liver cancer Segal et al Nat. Gen. 2004 6 Apoptosis DNA damage / nucleotide metabolism Apoptosis Immune MMPs Immune Cancer types Goal: Reconstruct Cellular Networks Signaling & growth regulation Signaling Immune Muscle Immune Immune Cytoskeleton & ECM Adhesion & signaling Synapse & signaling Metabolism Chromatin Breast Cytoskeleton (IF & MT) Modules Cancer span wide range of phenomena •Tumor type specific •Tissue specific •Generic across many tumors Signaling & development Protein biosynthesis Translation, degradation & folding Nucleotide metabolism Signaling, development & oxidative phos. Cell lines Cell cycle ECM IF & keratins Metabolism, detox & immune Signaling & growth regulation Metabolism, detox & immune Signaling Signaling Signaling Signaling & CNS Biocarta. http://www.biocarta.com/ Tissues Immune Liver Immune Lung / Hemato Immune AD/CNS CNS Hemao Hemato Cell lines Liver Hemato Immune Breast\liver Hemato Lung/AD Hemati 40 >0.4 Hemato 0 Leukemia >0.4 Segal et al Nat. Gen. 2004 41 First Attempt: Bayesian networks Gene A Second Attempt: Module Networks Gene B MAPK of cell wall integrity pathway SLT2 Gene C Gene D RLM1 Regulation Function 1 Gene E One gene One variable An instance: microarray sample Use standard approaches for learning networks Friedman et al, JCB 2000 42 CRH1 Regulation Function 2 Validation Module 25 (59 genes) Tpk1 Kin82 Msn4 Usv1 How do we evaluate ourselves? Statistical validation • Ability to generalize (cross validation test) Nrg1 150 Module 4 (42 genes) Hap4 Mth1 Module 1 (55 genes) Atp1 44 Atp3 PTP2 Regulation Function 4 Segal et al, Nature Genetics 2003 43 Test Data Log-Likelihood (gain per instance) et al. 2001: Yeast Response to Environmental Stress Module 2 173 Yeast arrays (64 genes) 2355 Genes Tpk2 50 modules Regulation Function 3 Idea: enforce common regulatory program Statistical robustness: Regulation programs are estimated from m*k samples Organization of genes into regulatory modules: Concise biological description Learned Network (fragment) Gasch YPS3 One common regulation function 100 50 -50 -100 -150 Atp16 45 Bayesian network performance 0 0 100 200 300 400 Number of modules 500 7 Validation Visualization & Interpretation How do we evaluate ourselves? Statistical validation Biological interpretation • Annotation database • Literature reports • Other experiments, potentially different experiment types GO Cis-regulatory motifs Functional annotations Molecular Pathways (KEGG GeneMAPP) Visualization Interpretation Hypotheses Function Dynamics Regulation Expression profiles 47 Hap4 Msn4 Oxid. Phosphorylation (26, 5x10-35) Mitochondrion (31, 7x10-32) Aerobic Respiration (12, 2x10-13) 46 HAP4 Motif 29/55; p<2x10-13 STRE (Msn2/4) 32/55; p<103 HAP4+STRE 17/29; p<7x10-10 -500 -400 -300 -200 -100 Gene set coherence (GO, MIPS, KEGG) Match between regulator and targets Match between regulator and cis-reg motif Validation How do we evaluate ourselves? Statistical validation Biological interpretation Experiments • Test causal predictions in the real system • Lead to additional understanding beyond the prediction • Experimental validation of three regulators ♦ 3/3 Match between regulator and condition/logic p-values using hypergeometric dist; corrected for multiple hypotheses 48 successful results Segal et al, Nature Genetics 2003 49 Challenges Outline New methodologies for the huge amount of existing RNA profiles • Meta analysis • Better mechanistic models • Contrasting new profiles with existing databases • Visualization DNA RNA Protein Phenotype Proteins Other 50 measurements • Degradation rates • Localization are the main executers of cellular function Building blocks are 20 different amino-acid Synthesized from mRNA template Acquires a sequence dependent 3-D conformation Proteomics: Systematic Study of Proteins 51 8 Why Measure Proteins? Challenges in Proteomics RNA Level ≠ Protein level Protein quantity is not a direct function of RNA levels Problematic recognition: No generic mechanism to detect different protein forms Protein Thousands Level ≠ Activity level Activity of proteins is regulated by many additional mechanisms • Cellular localization • Post-translational modifications • Co-factors (protein, RNA, …) of different proteins in the typical cell Protein abundances vary over several orders of magnitude 52 53 Making a Protein Generic TAP-Tag Libraries for Abundance ~4500 Yeast strains have been TAP tagged TAG • • Tags make a protein generic Underlying assumption is that the tag does not change the protein All proteins have the same tag 1. Inability to pool strains 2. Each experiment is done on a “different” strain • •How much is each protein expressed? •What is the proteome under different conditions? 54 55 Why Study Protein Complexes? Using TAP-Tag to Find Complexes # Most proteins in the cell work in protein complexes or through protein/protein interactions # To understand how proteins function we must know: - who they interact with - when do they interact - where do they interact - what is the outcome of that interaction 56 . 9 Large Scale Pull Downs Provide Information on Protein Complexes •Both labs used the same proteins as bait •Each lab got slightly different results •The results depended dramatically on analysis method *Gavin et al. Nature 2006 *Gavin et al. Nature 2006 *Krogan et al. Nature 2006 . *Krogan et al. Nature 2006 59 Making a Protein Generic We can now define a yeast “interactome” •Isnt full use of data •Static picture . 1. Fluorescent proteins allow us to visualize the proteins within the cell. 2. Allow us to measure individual cells and the variation/ noise within a population 61 Cellular Localization Using GFP Tags What can it teach us? Challenges in Fluorescence-based Approaches Better Vision processing will allow to do this in High-Throughput and answer questions like: • Changes in localization in response to cellular cues • Changes in localization in response to environment cues • Changes in localization in various genetic backgrounds • Dynamics of localization changes A library of yeast GFP fusion strains has been used to localize nearly all yeast proteins 62 Huh et al Nature 2003 A collection of cloned C. elegans promoters is being created for similar purposes Genome Research 14:2169-2175, 2004 63 10 Single Cell Measurements: Flow Cytometry Cells THROUGHPUT THE MAJOR BOTTLENECK pass through a flow cell one at a time Lasers focused on the flow cell excite fluorescent protein fusions Allows multiple measurements (cell size, shape, DNA content) Applications: Protein abundance Protein-protein interactions Single-cell measurements . 65 Comparison of mRNA to Protein Levels Allows Identification of Post-transcriptional Regulation High Throughput Flow Cytometer Rich Poor media media Observed behaviors No change in both change Change in protein, but not mRNA Coordinated 7 seconds/sample counts per sample Log2 Poor/Rich protein Compare ~50,000 66 Log2 Poor/Rich mRNA 67 Newman et al Nature 2006 Noise in Biological Systems Challenges Proteomics is in its infancy - easier to make an impact Measurement of 10,000 individual cells allows measurement of variation (noise) in a biological context factors that affect levels of noise in gene expression: • Abundance, mode of transcriptional regulation, subcellular localization 68 Nature 441, 840-846(15 June 2006) Integrating this data with other proteomic/genomic data to better predict protein function Higher Throughput methods such as flow cytometry will allow generation of varied data: Different growth conditions, Cell cycle, Stress, Mating Tagging is mammalian cells becoming more feasable near future should bring proteomic data on human cells 69 11 Outline Single Gene KO Phenotypic RNA DNA Protein Screen Phenotype Traits that selection can apply to, the observable characteristics in the DNA can cause a change in a phenotype. • Shape and size • Growth rate • How many years your liver can survive alcohol damage…. Mutations 70 Giaver et ., 2002 71 Starting to Probe the Cellular Network What is a genetic interaction (Epistasis)? The effect of a mutation in one gene on the phenotype of a mutation in a second gene. Genotype WT Δ geneA Δ geneB Genetic Interaction •The effect of a mutation in one gene on the phenotype of a mutation in a second gene •Different type of interaction - not physical Δ Growth Rate 1 x (x <= 1) y (y <= 1) geneA Δ geneB xy (Product) DIFFERENT TYPE OF INTERACTION - NOT PHYSICAL 72 74 What is a genetic interaction? Genetic Interaction None Aggravating Alleviating Growth Rate xy less than xy greater than xy Systematic Method of Analyzing Double Mutants X ∆X:NAT ∆Y:KAN ∆X:NAT A X ×B ×Y 75 C ∆Y:KAN ×A ×B X WT Y C 76 ∆X:NAT ∆Y:KAN ∆X:NAT ∆Y:KAN Double deletion mutants are made systematically Colony sizes are measured in high throughput Tong et al., 2001 12 E-MAPS Defining Protein Complexes Epistasis Mini Array Profiles On A Co-complex proteins have • similar interaction patterns • alleviating interactions B C Off 77 Aggravating Alleviating Aggravating Schuldiner et al., Cell 2005 Alleviating 78 Outline Challenges for the future Only a small fraction of the information has been utilized in E-MAPS made so far E-MAPS to cover all yeast cellular processes to come out until the end of 2007 Extending this to human cells is now feasible using gene silencing techniques Amount of data scales exponentially - Higher organisms - more genes RNA DNA Protein Phenotype Combined Insights Model-free approach Model-based approach . 80 Why Integrate Data? Model-Free Approach Location Gene Nuc Cyto Expression Phenotype Mito Rich Poor Salt Kan Binding sites RAP1 HSF1 GCN4 YAL001C YAL002W YAL003W YAR040W attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat High-throughput assays: • Observations about one aspect of the system • Often noisy and less reliable than traditional assays • Provide partial account of the system 81 YAR041C Treat different observations about elements as multivariate data • Clustering • Statistical tests 82 13 Model-Free Approach Finding bi-clusters in large compendium of functional data Tanay et al PNAS 2004 83 Model-Free Approach Pros: No assumptions about data • Unbiased • Can be applied to many data types Can use existing tools to analyze combined data Cons: No assumptions about data • Interpretation is post-analysis • No sanity check Cannot deal with data from different modalities (interactions, other types of genetic elements) 84 Model-Based Approach Explaining Expression DNA binding proteins Non-coding region Gene What is a model? “A description of a process that could have generated the observed data” attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat Idealized, simplified, cartoonish Describes the system & how it generates observations 85 Activator Repressor Coding region Binding sites RNA transcript Key Question: Can we explain changes in expression? General concept: Transcription factor binding sites in promoter region should “explain” changes in transcription 86 Explaining Expression A Stab at Model-Based Analysis Relevant data: Expression under environmental perturbations Expression under transcription factors KOs Predicted binding sites of transcription factors Protein-DNA interactions of transcription factors Protein levels/location of transcription factors … 87 ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC CCAAT AGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGT GATAC TCGACTGC CCAAT ACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTC CCAAT GATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATG TCGACTGC GATAC CCAAT TCGACTGC GATAC CTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCAC GCAGTT CCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTA CCAAT TCGACTGC GCAGTT CCAAT GCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCC GATAC TCGACTGC CCAAT GCAGTT CGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG Motifs TCGACTGC Motif Profiles TCGACTGC + GATAC GATAC CCAAT CCAAT Genes Sequence GCAGTT GCAGTT + CCAAT Expression Profiles 88 14 Unified Probabilistic Model Sequence Sequence S1 S2 S3 S4 S1 Motifs Motif Profiles Gene Expression Segal et al, RECOMB 2002, ISMB 2003 Sequence S1 S2 S3 Expression Segal et al, RECOMB 2002, ISMB 2003 Probabilistic Model Observed S4 Experiment Gene Expression Profiles 90 Sequence Regulatory Modules Sequence S1 Motifs S2 S3 S4 Motif profile Motifs R1 R2 R3 Expression Profiles 91 S4 Module Unified Probabilistic Model Motif Profiles S3 R1 R2 R3 Experiment Expression Profiles 89 Sequence S2 Motifs R1 R2 R3 Motif Profiles Sequence genes Sequence Unified Probabilistic Model Expression profile R1 R2 R3 Experiment Module Motif Profiles ID Gene Level Observed Expression Segal et al, RECOMB 2002, ISMB 2003 Expression Profiles 92 Model-Based Approach Experiment Module ID Gene Level Expression Segal et al, RECOMB 2002, ISMB 2003 Physical Interactions Pros: Incorporates biological principles • Suggests mechanisms • Incorporate diverse data modalities Declarative semantics -- easy to extend Cons: Reconstruction depends on the model Biological principles • Bias 93 94 15 Physical Interactions Relational Markov Network Interaction between two proteins makes it more probable that they • share a function • reside in the same cellular localization • their expression is coordinated • have similar genetic interactions •… Can we exploit this to make better inference of properties of proteins? Probabilistic Represent patterns hold for all groups of objects local probabilistic dependencies P1 .N 0 0 1 1 Protein Nucleus Cytoplasm P2.M 0 1 0 1 φ 0 0 0 -1 Mitochndria P1 .N 0 0 0 0 1 1 1 1 P2.N 0 0 1 1 0 0 1 1 I.E 0 1 0 1 0 1 0 1 φ 0 0 0 -1 0 -1 0 2 Interaction Exists Protein Nucleus Cytoplasm Mitochndria 95 96 Relational Markov Network Compact Adding Noisy Observations Add model Allows to infer protein attributes by combining • Interaction network topology (observed) • Observations about neighboring proteins class for experimental assay View assay result as stochastic function (CPD) of underlying biology GFP image Protein Cytoplasm Nucleus Nucleus Cytoplasm Mitochndria Mitochndria Interaction Exists Directed CPD Protein Nucleus Cytoplasm Mitochndria 97 98 Uncertainty About Interactions Design Plan Add interaction assays as noisy sensors for interactions GFP image Protein Nucleus Cytoplasm Nucleus Cytoplasm Mitochndria Mitochndria Simultaneous prediction Interaction Exists Taf1 Protein Nucleus Mitochndria 99 Relational Markov Network Tbf1 Med17 Cytoplasm Assay Cln5 Srb1 Interact Mcm1 Pre7 Cdk8 Pre9 Pup3 Med1 Taf10 Pre5 Med5 100 16 Relational Markov Network Add Relational Markov Models potentials over interactions Protein Combine (Noisy) interaction assays (Noisy) protein attribute assays Preferences over network structures Protein Nucleus Nucleus Interaction Exists Potential over Interaction Interaction Exists Exists To find a coherent prediction of the interaction network Protein Nucleus 101 102 Discussion The Need for Computational Methods Every day papers are published with highthroughput data that is not analyzed completely or not used in all ways possible Experiment The bottlenecks right now are the time and ideas to analyze the data Modeling & Simulation Low-level analysis High-level analysis 104 106 What are the Options? Analyze published data • Abundant, easy to obtain • Method oriented • Don’t have to bump into biologists • Two million other groups have that data too Collaborate with an experimental group • Be involved in all stages of project • Understand the system and the data better • Have priority on the data • Involved in generating & testing biological hypotheses • Goal oriented Start 107 your own experimental group…(yeah, sure) Questions to Keep in Mind Crucial questions to ask about biological problems What quantities are measured? Which aspects of the biological systems are probed How are they measured? How this measurement represents the underlying system? Bias and noise characteristics of the data Why are these measurements interesting? Which conclusions will make the biggest impact? 108 17 Acknowledgements Slides: The Computational Bunch •Yoseph Barash •Ariel Jaimovich •Tommy Kaplan •Daphne Koller •Noa Novershtern •Dana Pe’er •Itsik Pe’er •Aviv Regev •Eran Segal The Biologist Crowd •David Breslow •Sean Collins •Jan Ihmels •Nevan Krogan •Jonathan Weissman Special thanks: Gal Elidan, Ariel Jaimovich 109 18