* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Informatics Software Development and Computational Biology
Survey
Document related concepts
Protein folding wikipedia , lookup
Homology modeling wikipedia , lookup
Degradomics wikipedia , lookup
Protein structure prediction wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Protein domain wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Protein purification wikipedia , lookup
Protein moonlighting wikipedia , lookup
Western blot wikipedia , lookup
List of types of proteins wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Transcript
proteomics myriad Bioinformatics Industrial Applications in High Throughput Proteomics Alan F. James Director of Software Development Myriad Proteomics, Inc., Salt Lake City What is Proteomics? • Proteomics refers to the study of the protein constituents and protein activities of a cell, a tissue or an organism. • Proteomics may be seen from several viewpoints: – Protein Expression – Protein Interaction (Interactome) –… proteomics myriad Challenges in Proteomics • Proteins are all different - some degrade easily, some are sticky, many require accessory factors • Proteins are more complex than DNA - there are several protein forms per gene - proteins are post-translationaly modified • There isn’t really ONE proteome in humans • Proteins change: • with cell type • during differentiation • during development • in response to stimuli • with cell cycles • So which Proteome do you study? proteomics myriad Methods of Analyzing Proteomes – Expression, Abundance, Distribution Normal cell Cancer cell – Structural Genomics – Protein-Protein Interaction Analysis • Yeast two-hybrid system • Mass spectrometry of protein complexes PDIRP5 novel novel CASP3 MPO-XYZ NCALD OS-9 proteomics myriad Methods of Analyzing Proteomes by Comprehensive Surveys of Protein-Protein Interactions Yeast two-hybrid (Y2H) • Measures association between two proteins. • Allows very high throughput. Mass Spectrometry Allows identification of the proteins in a complex of many proteins (2-100) that carry out some cellular function. proteomics myriad Y2H Background Information: Gene Activity in Yeast Yeast transcription factor are composed of a DNA Binding Domain and a Transcriptional Activation Domain. The DNA Binding Domain recruits the Activation Domain to the yeast gene, which allows the yeast gene to be active. Transcription Factor Activation Domain DNA Binding Domain Activation Activation Activation Yeast Gene proteomics myriad Principles of the Yeast Two-Hybrid System 1. The DNA Binding Domain is separated from the Transcriptional Activation Domain of a transcription factor. 2. Libraries of human proteins are fused to both domains to create “hybrid” proteins. 3. The recruitment of the Activation Domain to the yeast gene is now mediated by interactions of the human proteins. Activation Domain 2. 1. Human Proteins Human Activation Domain DNA Binding Domain Human Proteins Activation Domain Human Protein Y 3. 2. Proteins 2. Human Human Proteins Proteins 3. 1. DNA Binding Domain 2. Human Protein X DNA Binding Domain Activation Domain Human Protein Y Human Protein X DNA Binding Domain Activation Activation Activation Yeast Gene proteomics myriad Yeast Two-Hybrid Screens: Assay for Interactions Scenario A: Human Proteins X and Y do not Interact Prey Activation Human Domain Protein Y Human Protein X Bait DNA Binding Domain ( No Reporter Gene Activity ) Readout: No growth of yeast colonies Reporter Gene Scenario B: Human Proteins X and Z do Interact Prey Human Protein X Bait Activation Human Domain Protein Z DNA Binding Domain Readout: Yeast colonies grow Reporter Gene proteomics myriad Directed vs. Random Approach Directed: selecting specific proteins as baits for Y2H analysis. Random: using individual baits picked at random from libraries of baits. The random approach can be used to rapidly generate large amounts of interaction data. proteomics myriad Random Two-Hybrid (R2H) Process Overview Library Construction 1. Produce DNA Binding Domain (BD) and Activation Domain (AD) libraries from cDNA synthesized from mRNA libraries using random primers. Pick BD-Colonies 2. Put yeast colonies containing BD-hybrid proteins into 96-well culture plates Mating w/AD-Library 3. Add yeast containing the AD-hybrid proteins to the 96-well plates with the yeast colonies picked in (2.); allow yeast mating to occur. Selection Plating 4. Plate yeast matings onto dishes containing selective medium that allows yeast to grow only if the human hybrid proteins interact. Incubation 5. Allow several days for yeast that contain interacting human proteins to grow. Pick Growing Yeast 6. Pick yeast colonies containing interacting human proteins (“Positives”) and put them into 96-well culture plates. Amplify Human DNA 7. Amplify the human DNA that encodes the interacting proteins by PCR. DNA Sequencing 8. Sequence the amplified DNA and identify the interacting proteins. proteomics myriad Mass Spectrometry Vital tasks that in cells are often performed by Multi-Protein Complexes (MPC) proteomics myriad Mass Spectrometry Cell Biology Gene Cloning pDEST5 . pDEST1 pENTR . . pDEST4 . pDEST2 . . . . . pDEST3 Protein “Baits” Protein Purification Handles (Affinity Tags) Mass Spectrometry Protein “Preys” proteomics myriad Mass Spectrometry Cell Biology cDNA Cloning pDEST5 . pDEST1 pENTR . . pDEST4 . pDEST2 . . . pDEST3 Protein “Baits” . . Protein Purification Protein “Preys” Handles Mass Spectrometry proteomics myriad Pulldown Assay Bait Protein Purification Tag Incubate with cell extract Complex formation Non-binding Proteins Associated Proteins Affinity Beads Elute Separate proteins Identify by Mass Spectrometry proteomics myriad Mass Spectrometry Cell Biology cDNA Cloning pDEST5 . pDEST1 pENTR . . pDEST4 . pDEST2 . . . pDEST3 Protein “Baits” . . Protein Purification Protein “Preys” MPC Handles Mass Spectrometry proteomics myriad Mass Spectrometry Procedure Purified protein complex Protein separation Mass spectrum Protein digestion Database Searching (Peptide Mass Fingerprint Search) Mass Spec. analysis Protein ID proteomics myriad Summary of Protein-Protein Interaction Analysis Methods I Random Yeast Two-Hybrid: Yields sets of binary associations between protein fragments (that may represent protein-protein interactions). G H A K C Mass Spectrometry: Yields sets of n-ary associations among proteins (that may represent protein complexes). K B FG B A H I K C D J E L proteomics myriad The Goal: Biological Relevance New Protein-Protein Interaction Known Protein-Protein Interaction Transduction Pathway Known Pathway Member Identified Interactor Novel Transcript Traditional “Drugable” Enzyme Other Enzymes fibril formation, deposition Amyloid Plaque, Neurofibrillary Tangle Formation APOPTOSIS Underlying Pathway Adopted from http://www.kegg.com proteomics myriad Role of Bioinformatics in Proteomics Knowledge Identification as potential drug target Identification of participation in disease pathway Manual/ Experimental Data Analysis Identification of participation in protein complexes Identification of protein interaction networks Information Identification of binary and n-ary interactions Automated Data Analysis Identification of Loci/Domains/Proteins Blast/PMF Searches Mass Peak List Determination Automated Data Reduction Base Calling Data Warehousing Data Data Collection LIMS Data Collection, Analysis, and Interpretation Biology Computational Biology Software Development proteomics myriad Bioinformatics Techniques Used in Proteomics • • • • • • • • • • • • • Robot programming Software engineering Database modeling and design Data warehouses and Data Marts Database federation Grid Computing Information Visualization Graph analysis, graph layout and display Hidden Markhov Models Bayesian networks Statistical models Signal Processing Algorithm development • … proteomics myriad Objectives of Bioinformatics in Proteomics 1. Automate and manage highthroughput laboratory processes. 2. Retrieve, collect, and store experimental interaction data. 3. Analyze, reduce, and extend experimental interaction data. 4. Mine and visualize interaction analysis results. proteomics myriad Automate and Manage Laboratory Processes Laboratory Automation • • High-throughput proteomics is not possible without a high degree of laboratory automation. Instruments and robotics must interact directly and reliably with LIMS (Laboratory Information Management System). proteomics myriad Automate and Manage Laboratory Processes Laboratory Management Information System (LIMS) • • • • • • • High-throughput proteomics is not possible without a sophisticated LIMS. The LIMS provides the foundation for all automated data collection, reduction, and analysis. Multiple LIMS systems are required (e.g., Y2H, Sequencing, Gene Cloning, Protein Pull-down, Mass Spec., etc. May collect very large amounts of data. Fast runtime performance of the LIMS is essential to deal with the high volume of transactions and possible near real-time interactions between the LIMS and robotics and instruments. High availability of the LIMS and supporting computer systems is required to support production laboratories and time-critical operations. May be one of the most (if not the most) labor intensive (programming, database management, and system management) and expensive software systems in the enterprise. proteomics myriad Automate and Manage Laboratory Processes Functions of the Laboratory Management Information System (LIMS) • Track samples consistently through a protocol so that each sample: – – – – – • • • • • • • Is identified. Is linked to the appropriate results. Is linked to the protocol used to process the sample. Is linked to any related samples, reagents, etc. Can be located physically. Manage and enforce the protocol used to process a sample. Capture laboratory quality control information and provide displays, reports, statistical analyses, etc. to allow management and quality control of the laboratory. Provide interfaces for laboratory personnel, robotics, and instruments to support high-throughput operations. Capture results directly from laboratory instruments. Provide experimental results in a format suitable for analytical programs. Provide the interface between analytical systems and instruments (such as Mass Spectrometers) that require real-time (or near real-time) analysis during operation. Manage laboratory personnel work lists, incident alerting, reporting and correction, etc. proteomics myriad Automate and Manage Laboratory Processes LIMS Architecture ... Web-based Management Client Web-based Management Client Web-based Management Client (Servlets, JSP, CGI Script) (Servlets, JSP, CGI Script) (Servlets, JSP, CGI Script) LIMS Data Warehouse(s) (ODS) Web Application Server Lab Workstation XML (Java Application) L XM Lab Workstation (Java Application) XML ... SQL Net LIMS Database(s) LIMS SERVER (Java Socket Application) XML Lab Workstation XM L (Java Application) Analysis Databases Robot or Instrument XM L XML Robot or Instrument ... Robot or Instrument proteomics myriad Collect, Store, and Retrieve Experimental Data Yeast two-hybrid Data • • • • Electropherograms for sequence forward and reverse reads Sequences and sequence quality scores from base-calling Robot/Instrument Operational Parameters Quality control data – – Distributions of positive colonies within a search Distributions of sequencing reaction success/failure within a plate. Yeast two-hybrid Data Collection Challenges • • Transmission of electropherograms from remote sequencing facility and associated error handling. Relating/correlating data received from remote sequencing facility with LIMS data. Archival of electropherograms. • Retrieval of archived electropherograms. • proteomics myriad Collect, Store, and Retrieve Experimental Data Mass Spectrometry Data • Spectrograms – – – – • • Multiple Instruments (MALDI-TOF, Electrospray/Ion Trap, etc.) Multiple spectrogram types (MS, MS/MS) Individual samples may be analyzed with multiple instruments, mass spectrogram types. False Positive/Contamination Control Sample Spectrograms Mass Peak Lists derived from spectrograms Mass Spectrometry Instrument Operational Parameters Mass Spectrometry Data Collection Challenges • • • • Individual experiments will generate many spectrograms. Interfacing with instrument to retrieve spectrograms and mass peak lists. Archival of spectrograms and mass peak lists Retrieval of archived spectrograms and mass peak lists proteomics myriad Collect, Store, and Retrieve Experimental Data External Data Sources • • • • • NCBI LocusLink, RefSeq, GenBank, … SwissProt, PFAM, … Gene Ontology, … KEGG, … PubMed, Manually curated papers, … External Data Sources Challenges • • Wide variety of data formats. Integrating or federating disparate data sources with internal data bases. Sometimes questionable quality of data. Data sources frequently change/evolve • • – – Changes may invalidate previous analysis results. May require analysis databases to support versioning of results. proteomics myriad Analyze, Reduce, and Extend Experimental Data • The goal of data analysis is to extract or discover biological relevance from the raw data. Raw data must be “cleaned”, filtered, and transformed • – – – – – • Vector/adaptor identification & clipping Sequence assembly Consensus sequence identification Peptide mass fingerprint (PMF) searching False positive detection/filtering. Data representations must be modeled and developed. – How to represent interaction data? • • • – How to organize data structures to enable querying (analysis) involving • • • • • Sequences? Electropherograms? Mass Peak Lists? Interactions? Pathways? Sequence Annotations? Many other biological concepts / processes / functions? Many tables >1 million rows in some tables filtering, aggregation, and computation of data Analysis algorithms must be developed/adapted. Statistical models must be developed/validated. proteomics myriad Example: consequences of naïve data modeling proteomics myriad Example: Y2H Data Analysis Process Flow Y2H Laboratory Send/Receive Lab Sequence Track Sequence Submitted Versioning Perform Basecalling Sequence String Quality Score Quality Matrix Perform QC and Clean Lab Sequence Failed Requeue Vector Clipping Repeat Masking Low Quality Filter Annotate/Identify Lab Sequences BLAST, Parameters, Version Homologous Seqs, Splice Variants Domain Search Construct Interaction Pair Frequency of Interaction Confidence Level Collect False Positive, Self Activators Construct Interaction Map Visualization Query Compare Difference Integrate External Evidence Gene Expression Pathway Disease Perform Downsteam Analysis proteomics myriad Dealing with False Positives • False positives will always be generated. – Y2H • • – Mass Spectrometry • • • • “Self-activating” baits. “Promiscuous” preys. Proteins that interact directly with affinity beads. Proteins that interact directly with affinity tags. Contaminants. False positives are very hard to detect and distinguish from real positives. False positives must be addressed both biologically and informatically: • – – – – Known false positives can be “subtracted” from Y2H AD/BD libraries before experiments. Mass spectrometry control experiments with affinity beads, affinity tags, and background contaminants can be “subtracted” from results. Known false positives can be “subtracted” during analysis. Statistical tests can be developed to help identify possible false positives during analysis. proteomics myriad Mine and Visualize the Results of Analysis • Proteomics-specific data mining tools are required to extract meaningful knowledge from massive amounts of data. – – – – – • Flexible searching capabilities. Flexible filters to reduce the amount of data. Multiple views of the data. Ad-hoc query tools for unanticipated data mining needs. Data warehouses and/or data marts are required to support data mining without impacting performance sensitive LIMS and analytic systems. Visualization tools are required to visually organize the data and reveal meaningful patterns. – – – – Quality control visualizations. Interaction network visualizations. Interaction network visualizations with experimental data overlays. Disease and metabolic pathway visualizations with interaction network overlays. proteomics myriad Quality Control Visualization (1) Scatter Plot 160 140 120 100 80 60 40 20 0 24148 24774 26485 27617 28532 36250 37511 39413 SEARCHID proteomics myriad Quality Control Visualization (2) Plate-by-plate Sequencing Purity Monitor RA0000055 RA0000058 RA0000059 RA0000060 RA0000061 RA0000101 RA0000103 RA0000104 RA0000106 RA0000108 RA0000109 RA0000110 RA0000113 RA0000119 RA0000122 RA0000123 RA0000124 RA0000125 RA0000126 RA0000127 RA0000128 RA0000129 RA0000130 RA0000131 RA0000132 RA0000133 RA0000134 RA0000135 RA0000149 RA0000150 RA0000151 RA0000152 RA0000154 RA0000155 RA0000156 RA0000158 RA0000161 RA0000162 RA0000164 RA0000165 RA0000166 RA0000169 RA0000170 RA0000171 RA0000172 RB0000041 RB0000043 RB0000044 RB0000045 RB0000046 RB0000047 RB0000048 RB0000050 RB0000051 RB0000052 RB0000053 RB0000054 RB0000055 RB0000056 RB0000058 RB0000059 RB0000060 RB0000061 RB0000063 RB0000064 RB0000066 RB0000067 RB0000068 RB0000069 RB0000070 RB0000072 RB0000073 RB0000074 RB0000075 RB0000076 RB0000077 RB0000078 RB0000101 RB0000103 RB0000104 RB0000106 RB0000109 RB0000110 RB0000113 RB0000119 RB0000122 RB0000123 RB0000124 RB0000127 RB0000136 P K F A P K F A P K F A P K F A P K F A P K F A P K F A P K F A P K F A 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 well proteomics myriad Quality Control Visualization (3) Y2H Interaction Map with Curated Promiscuous Protein Annotation interacting baits highlighted with their pronet annotation 58528 26289 11244 9114 6670 4343 1198 20 38 577 3691 6421 9090 10814 23469 55216 84619 prey Prey Annotated Bait Annotated proteomics myriad Interaction Network Sub-Graph Visualization proteomics myriad Y2H Interaction Network Sub-Graph Visualization with Protein Pull-down Overlay l o c6 loc21 l o c7 loc23 l o c1 l o c1 0 loc22 loc24 l o c8 loc25 l o c4 l o c5 l o c3 l o c3 2 1 l o c3 3 l o c3 4 l o c3 5 proteomics myriad Pathway with Interaction Network Annotation New Protein-Protein Interaction Known Protein-Protein Interaction Transduction Pathway Known Pathway Member Identified Interactor Novel Transcript Traditional “Drugable” Enzyme Other Enzymes fibril formation, deposition Amyloid Plaque, Neurofibrillary Tangle Formation APOPTOSIS Underlying Pathway Adopted from http://www.kegg.com proteomics myriad Acknowledgements proteomics myriad