Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HAPPI: A BIOINFORMATICS DATABASE PLATFORM ENABLING NETWORK BIOLOGY STUDIES SudhaRani Mamidipalli Submitted to the faculty of the Bioinformatics Graduate Program in partial fulfillment of the requirements for the degree Master of Science in Bioinformatics in the School of Informatics Indiana University May 2006 Accepted by the Graduate Faculty of Indiana University, in partial Fulfillment of the requirements for the degree of Master of Science _________________________________ Dr. Jake Yue Chen, Ph.D., Chairperson __________________________________ Dr. Snehasis Mukhopadhyay, Ph.D. Master’s Thesis Committee __________________________________ Dr. Bonnie Blazer-Yost, Ph.D. ii Dedicated to Amma, Daddy, Sudhakar & Laasya iii Acknowledgements There are many people whose time and efforts have been instrumental in making this thesis a success. First and foremost, to my best friend and husband Sudhakar, a source of wisdom, encouragement, strength, and especially entertainment … thank you for supporting me always and without question in whatever intellectual pursuit I attempt. Special thanks to my daughter, Laasya for being patient and for reminding me what truly matters… and for being my inspiration through this effort. It is difficult to overstate my gratitude to my research advisor, Dr. Jake Yue Chen. With his enthusiasm, his inspiration, and his great efforts to explain things clearly and simply, he helped to make bioinformatics fun for me. Through out my research work, he provided encouragement, sound advice, and good teaching to achieve greater heights and realize my full potential. To Dr. Snehasis Mukhopadhyay, I thank for being a supportive, strong guiding force as an academic advisor and member of my committee. I would also like to thank Dr. Bonnie Blazer-Yost for all the support and guidance through out the whole process. It was absolutely invaluable. My heartful thanks are due to Stephanie Burks and Teresa Hunter of Research and Technical Services and Kimberly Melluck of school of informatics for their timely help on computing resources. It gives me great pleasure in acknowledging the members of my research group, Zhong, Usha, Bhanu, Pranav and Laavanya for providing insight and encouragement throughout my research work. Finally, I would like to thank my parents for their tremendous love, endless support, and many prayers. They have always believed in me and always been there for me. iv ABSTRACT SudhaRani Mamidipalli HAPPI: A BIOINFORMATICS DATABASE PLATFORM ENABLING NETWORK BIOLOGY STUDIES The publication of the draft human genome consisting of 30,000 genes is merely the beginning of genome biology. A new way to understand the complexity and richness of molecular and cellular function of proteins in biological processes is through understanding of biological networks. These networks include protein-protein interaction networks, gene regulatory networks, and metabolic networks. In this thesis, we focus on human protein-protein interaction networks using informatics techniques. First, we performed a thorough literature survey to document different experimental methods to detect and collect protein interactions, current public databases that store these interactions, computational software to predict, validate and interpret protein networks. Then, we developed the Human Annotated Protein-Protein Interaction (HAPPI) database to manage a wealth of integrated information related to protein functions, protein-protein functional links, and protein-protein interactions. Approximately 12900 proteins from Swissprot, 57900 proteins from Trembl, 52186 protein-domains from Swisspfam, 4084 gene-pathways from KEGG, 2403190 interactions from STRING and 51207 interactions from OPHID public databases were integrated into a single relational database platform using Oracle 10g on an IU Supercomputing grid. We further assigned a confidence score to each protein interaction pair to help assess the quality and reliability of protein-protein interaction. We hosted the v database on the Discovery Informatics and Computing web site, which is now publicly accessible. HAPPI database differs from other protein interaction databases in these following aspects: 1) It focuses on human protein interactions and contains approximately 860000 high-confidence protein interaction records—one of the most complete and reliable sources of human protein interaction today; 2) It includes thorough protein domain, gene and pathway information of interacting proteins, therefore providing a whole view of protein functional information; 3) It contains a consistent ranking score that can be used to gauge the confidence of protein interactions. To show the benefits of HAPPI database, we performed a case study using Insulin Signaling pathway in collaboration with a biology team on campus. We began by taking two sets of proteins that were previously well studied as separate processes, set A and set B. We queried these proteins against the HAPPI database, and derived high-confidence protein interaction data sets annotated with known KEGG pathways. We then organized these protein interactions on a network diagram. The end result shows many novel hub proteins that connect set A or B proteins. Some hub proteins are even novel members outside of any annotated pathway, making them interesting targets to validate for subsequent biological studies. vi TABLE OF CONTENTS List of Tables ………………………………………………………………….ix List of Figures ………………………………………………………………....x I. Introduction 1.1. Introduction to Protein Interactions ……………………………………1 1.2. Contributions of the thesis ……………………………………………..3 II. Background 2.1 Literature Review……………………………………………………….4 2.1.1. The Life Science Discovery Process…………………………4 2.1.2. Methods to detect Protein Interactions……………………….5 2.1.3. Databases to document Protein Interactions………………….6 2.1.4. Computational methods and Protein Interactions…………….7 2.2. Problem Statement……………………………………………………..10 2.3. Research Question……………………………………………………..11 2.4. Hypothesis……………………………………………………………..11 III. Materials 3.1. Bioinformatics Databases used………………………………………..13 3.1.1. Protein Interaction Databases……………………………….13 3.1.2. Protein Annotation Databases……………………………….14 3.1.3. Protein-Domain Databases…………………………………..16 3.1.4. Pathway Databases…………………………………………..17 3.1.5. Bibliographic Databases……………………………………..18 3.2. Software Languages used……………………………………………...19 3.2.1. Perl…………………………………………………………...19 3.2.2. PHP…………………………………………………………..19 3.2.3. HTML………………………………………………………..19 3.2.4. SQL…………………………………………………………..19 3.2.5. PSQL…………………………………………………………20 3.3. Relational Databases used……………………………………………...20 3.3.1. Oracle…………………………………………………………20 3.3.2. PostgreSQL…………………………………………………...20 3.4. Software Tools used…………………………………………………….21 3.4.1. Komodo……………………………………………………….21 3.4.2. SQL Loader…………………………………………………...21 3.4.3. SSH / SFTP……………………………………………………22 3.4.4. Ultra edit………………………………………………………22 3.4.5. Erwin………………………………………………………......22 3.4.6. Toad…………………………………………………………...22 3.4.7. Aqua Data Studio……………………………………………...23 3.4.8. SQL Tools…………………………………………………......23 3.4.9. Endnote…………………………………………………….….23 vii 3.5. University Computing Resources used…………………………………23 3.5.1. BioX…………………………………………………………..23 3.5.2. Zen ……………………………………………………………24 3.5.3. Research Database Complex (RDC)……………………….....24 IV. Procedures and Interventions 4.1. Method Roadmap for HAPPI database ….……………………………..25 4.2. Architecture……………………………………………………………..25 4.3. Data Integration…………………………………………………………28 4.3.1. Data Warehouse Approach……………………………………28 4.3.2. Data Acquisition………………………………………………29 4.3.3. Data Reduction………………………………………………...31 4.3.4. Feature Selection……………………………………………....33 4.3.5. Meta-data specification……………………………………......36 4.3.6. Database Design……………………………………………….37 4.3.7. Data Storage/Loading………………………………………….38 4.4. Query Processing………………………………………………………..38 4.5. User Interface……………………………………………………………39 4.6. Unified Scoring Model………………………………………………......41 V. Results and Discussion………………………………………………………..46 VI. Case Study……………………………………………………………………51 VII. Conclusion…………………………………………………………………..58 7.1. Directions for future work………………………………………………58 VIII. Appendices Appendix A………………………………………………………………….60 Appendix B………………………………………………………………….73 Appendix C …………………………………………………………………77 References………………………………………………………………………...82 Curriculum vitae viii List of Tables Table 2.1.Databases of protein interactions………………………………………….6 Table 2.2.Software tools to predict interactions between proteins ………………….7 Table 2.3.Comparison of interaction features across human databases…………….10 Table 4.1.An overview of Data Acquisition from different data sources …….........31 Table 4.2.An Overview of Data Reduction of Protein Integrated Data ……………33 Table 4.3.Line types and codes of a sequence entry in Uniprot database………….33 Table 4.4.Summary of feature selection from Protein Integrated databases……….35 Table 4.5.An overview of tables loaded from Protein Integrated databases……….38 Table 4.6.Analysis of String database scores ………………………………………41 Table 4.7.Analysis of Ophid database scores………………………………………43 Table 5.1.Comparison of SGK1 interacting proteins across P.I. databases………..49 Table 5.2.Comparison of database features across P.I. databases……………….....50 ix List of Figures Figure 2.1.Information-driven discovery process …………………………………...4 Figure 2.2.Different levels of observation of protein interactions……………………5 Figure 3.1.Snapshot of String Protein Interaction database ………………………...13 Figure 3.2.Snapshot of Ophid Protein Interaction database ………………………...14 Figure 3.3.Snapshot of SwissProt manually annotated Protein Sequence database…15 Figure 3.4.Snapshot of Pfam Protein Domain database …………………………….16 Figure 3.5.Kegg Pathway database showing Insulin Signaling Pathway …………...17 Figure 3.6.Snapshot of PubMed literature database ………………………………...18 Figure 4.1.Method Roadmap for HAPPI database ………………………………….25 Figure 4.2.Hardware Architecture of HAPPI database ……………………………..26 Figure 4.3.Three-Tier Software Architecture: Structure and Technologies…..……..27 Figure 4.4.Integration of Protein Annotation, Interaction, Domain, Sequence and Pathway Data……………………………………………..30 Figure 4.5.Initial Data Model of HAPPI database…………………………………...37 Figure 4.6.User Interface Flow Diagram of HAPPI database ………………………40 Figure 4.7.String database score distributions……………………………………….42 Figure 4.8.Flow Chart for Score Consolidation……………………………………..45 Figure 5.1.The Query page of HAPPI database……………………………………..46 Figure 5.2.The Interaction Results page of HAPPI database……………… ……….47 Figure 5.3.The Interaction Annotation page of HAPPI database……………………48 Figure 5.4.The Protein Annotation Page of HAPPI database …………….. ……….48 Figure 6.1.Method Roadmap for case study ………………………………………..52 Figure 6.2.Insulin Signaling Pathway……………………………………………….54 Figure 6.3.Visualization of Insulin Pathway Interaction Network using Pathway studio…………………………………………………………..55 Figure 6.4.Visualization of Insulin Pathway Protein Interaction Network…………56 x I. Introduction 1.1. Introduction to Protein Interactions The study of comprehensive collections of all the proteins and the molecular interactions among them, such as physical binding or regulatory modification, within cells of an organism is known as Protein Interactomics [1]. Its biological significance ultimately lies in the validation of disease bio-markers in molecular network context and in the identification of better drug targets. The PROTEin complement expressed by a genOME is called Proteome. According to Dr. Steven Briggs [2], the concept of proteome is fundamentally different from that of a genome: “while the genome is virtually static and can be well defined for an organism, the proteome continually changes in response to external and internal events”. Proteomics is divided into two main categories: 1. Expression Proteomics – the study of all gene products present in a tissue, a cell, an organelle along with their modifications. 2. Functional Proteomics – examining the changes that arise in response to a change in the biological system of interest, thus studying the functions of proteins within complex networks. This includes the study of protein-protein interactions (Protein Interactomics). The study of protein interactions has been playing a vital role in understanding of how proteins function within the cell. Publication of the draft human genome and proteomicsbased protein profiling studies brought a new era in protein interaction analysis. Understanding the characteristics of protein interactions in a given cellular proteome will be the next milestone in bringing revolution in cellular biochemistry [3]. A comprehensive collection and integration of information belonging to human proteins, their features, and functions would be invaluable to biologists in several ways. For instance, the type of domains found in proteins generally predicts the functional class or 1 biological role of proteins. The exact sub cellular localization of proteins and their distribution within body tissues is also important to protein function [4]. Ultimately, it is vital to know any possibility of association of proteins with human diseases, as this dictates their involvement in certain pathways [5]. About 30,000 genes of the human genome are expected to give rise to 1,000,000 proteins through a series of post-translational modifications and gene splicing mechanisms [3]. Posttranslational modifications such as phosphorylation and ubiquitination can extremely influence the activity of proteins and are generally used as regulatory mechanisms in signal transduction pathways [6]. Although a few of these proteins can be expected to work in relative isolation, the majority of them is expected to interact with other proteins in complexes and networks to combine the innumerable elements of processes that impact cellular structure and function. Because proteins act together with other proteins, knowing the identity and characteristic features of interacting proteins along with the relevant binding sites can boost hypothesis-driven studies and explanation of regulatory networks [7]. Protein functions can be extracted from protein-protein interactions. If two proteins interact with each other and if the function of one protein is known, then some relevant information about the function of other interacting protein can be obtained [8]. When two proteins are said to interact with each other, the domains that constitute proteins also interact physically with each another to perform the required functions. Hence understanding an interaction between two proteins at the domain level is very essential in getting a whole view of the protein interaction network, and thereby protein functions. Domains are considered as structural subunits or building blocks of proteins that are maintained during evolution. Proteins constitute either one domain or more than one domain. But many proteins are built from several domains. The single domain proteins are found to be around 34% in prokaryotes and 20% in eukaryotes [9]. Generally, each domain has its own function to do for the protein, such as spanning the plasma membrane, DNA-binding, providing a surface to bind specifically to another protein [4]. Homologous domains, the domains that are related together by descent, can be grouped together to form a super family. Motifs are smaller elements of a protein that 2 are important in protein-protein interactions and functions of protein. Examples include coiled-coil and nuclear localization signal [10]. Network prediction is a ‘systems’ problem that requires a ‘systems’ approach. Systems biology is an emerging discipline that utilizes experimental techniques and bioinformatics to help understand biological system on a global scale [11]. One of the key problems to solve under systems biology is to integrate gene and protein data effectively with other information such as PubMed literature [12]. The other problems may include limited experimental data, network complexity and infinite network solutions. 1.2. Contributions of the Thesis There are 2 major contributions of this thesis to human protein-protein interactomics field. They are 1. A useful resource that enables human protein interactome researchers to do large scale protein interaction network. a. 86,000 highly reliable comprehensive human protein interaction database b. Consistent ranking score to assess the confidence of protein-protein interactions 2. Database framework enables integrated view of annotated protein interactions in web-based environment to bench biologists. a. Provides a whole view of functional information of interacting proteins. 3 II. Background 2.1 Literature Review The data of protein–protein interactions stored in public databases provide access to both experimental data and predicted data. Literature survey was done on life science discovery process, data sources in life science in particular to protein interactions, associated computational methods and tools. 2.1.1. The Life Science Discovery Process Bioinformatics has been playing a major role in managing scientific data and thereby supporting life science discovery. Bioinformatics can be defined as the application of information technology to life sciences for a better understanding of life. It includes a broad range of functions such as data acquisition, data reduction, data analysis, data management, data integration, storage, statistics, and visualization. The computational approaches need information integration from a variety of data sources [13]. For example, protein interactomics includes the interacting protein information but it should interact with protein annotation databases, pathway databases and domain databases to get a clear picture of how proteins function within a cell. Thus, integration bioinformatics is becoming an emerging frontier for life sciences research. Genomics Gene Expression Proteomics Systems Biology Bioinformatics Databases Networks Pathways Computational Tools Figure 2.1.Information-driven discovery process [13] 4 The above figure (Fig.2.1.) explains the importance of integration of biological data obtained from both experimental and computational methods for a novel discovery. Data integration is an ongoing active topic in life science area. According to access and architectures, the data integration solutions can be roughly divided into three major categories: the data warehousing approach, the distributed or federated approach, and the mediator approach. 2.1.2. Methods to detect Protein Interactions Until recently, experimentally determined protein–protein interactions were gathered to analyze potential functions of proteins [14]. These experimental methods detect proteinprotein interactions at different levels of resolution [15]. The figure (Fig.2.2.) shows in detail different levels of observation of protein interactions. The first is an ‘atomic observation’ in which the protein interaction is detected at atom level. Example includes X-ray crystallography. Central dogma of molecular biology figure obtained from BIOL 103: Principles of Biology Course of Queens University of Charlotte was modified to show different levels of observation of Protein Interactions [16]. Atomic level Direct level Complex level Cellular level Figure 2.2.Different levels of observation of protein interactions [16] 5 Second is a ‘direct interaction observation’ where protein interaction is predicted at protein level. Example includes two-hybrid experiment. Third is a ‘Complex observation’ where interactions are detected at complex level. A complex may comprise of more than one protein and what proteins in first complex are interacting with proteins in second complex is not known. Example includes immunoprecipitation or mass-spectral analysis. The fourth category uses activity bioassay to detect interactions at the cellular level. Example includes proliferation assays of cells stimulated by a receptor-ligand interaction. In terms of the availability of experimental data in the literature, the complex interaction level is the most commonly represented, followed by the cellular interaction, the direct interaction, and finally the atomic observation level [15]. 2.1.3. Databases to document Protein Interactions Here is the literature review on different types of biological databases dedicated to protein interactions. Features such as availability, content, statistics and types of interactions documented in interaction databases were surveyed. The table (Tab.2.1.) compares different features of protein interaction databases. Database Acronym URL Content Interaction Statistics Availability Type of Interactions / Networks Database of Interacting Proteins DIP http://dip.doembi.ucla.edu/ Catalog of protein-protein interactions 55732 interactions among 19051 proteins in 110 species Free for both Academic and Commercial users Experimental Biomolecular Interaction Network Database BIND http://bind.ca Biomolecular interaction complexes and pathways 201896 interactions in 1528 organisms Free for both Academic and Commercial users Experimental MIPS mammalian ProteinProtein Interaction database MPPI A new resource of high-quality human protein interaction data 1800 interactions among 900 proteins from 10 mammalian species Free for both Academic and Commercial users Experimental http://mips.gsf. de/proj/ppi/ 6 Physical network Physical network Physical and Genetic networks Search Tool for the Retrieval of Interacting Genes or Proteins STRING Human Protein Reference Database HPRD Human Protein Interaction Database HPID http://string.e mbl.de/ A database of predicted functional associations among genes or proteins 23256408 interactions among 736429 proteins in 179 species Free for Academic but not for Commercial users Experimental, http://hprd.org Comprehensiv e collection of protein features, posttranslational modifications and proteinprotein interactions 33710 interactions among 20097 proteins in human organism Free for Academic but not for Commercial users Experimental Provides human protein interaction information and Predicts potential interactions between proteins submitted by users 8565 interactions among 1690 proteins in human organism Free for both Academic and Commercial users Experimental, Structural and Predicted http://hpid.org Predicted Physical network Table 2.1.Databases of protein interactions 2.1.4. Computational methods and Protein Interactions Recently computational methods have been playing an important role in determining interactions between proteins. They are used to predict the interactions, to validate the results and to analyze the protein networks developed from interaction databases. Several computational approaches are available today for predicting interactions between proteins. Protein-protein interactions can be inferred on the basis of functional relationships between proteins such as patterns of domain fusion and protein occurrence, sequence and structural analysis, correlation of functional genomic features, the conserved interactions in other organisms and literature. Recently developed methods for the inference of protein–protein interactions can be broadly classified into physical and functional linkages [17]. The methods under physical linkage type are as follows: 7 1. Interspecies interaction transfer based on the interacting sequence motif pairs identified in yeast two-hybrid screens. [18,19] 2. Interactions inferred from correlated mutations. [20] 3. Co-occurrence of sequence domains. [21, 22] 4. Structure assignment followed by threading-based interaction energy evaluation. [23] 5. Ortholog-based transfer of interactions between species followed by experimental validation. [24] The methods under functional linkage type are as follows: 1. Network topology based functional annotation. [25] 2. Phylogenetic profile method. [26] 3. Phylogenetic profile enhancements. [27,28,29] 4. Rosetta stone or gene fusion method.[30,31] Web-based tools and Protein Interactions The following table (Tab.2.2.) enlists the current web based tools to predict, validate, explore, analyze and visualize protein-protein interaction networks. Web tool Acronym URL Features Automated Detection and Validation of Interaction by Co-Evolution ADVICE http://advice.i2r.astar.edu.sg/ Prediction and validation of proteinprotein interactions BioLayout Java - http://cgg.ebi.ac.uk/s ervices/biolayout/ An automatic graph layout algorithm for similarity and network visualization electrostatic surface of Functional-site eF-site http://efsite.hgc.jp/eF-site/ A database for molecular surfaces of proteins functional sites, displaying the electrostatic potentials and hydrophobic properties together on the Connolly surfaces of the active sites Expression Profiler EP:PPI http://ep.ebi.ac.uk/E P/PPI/ Explores protein interaction data using expression data IntAct Project IntAct http://www.ebi.ac.uk Database and toolkit for the storage, 8 /intact/index.html presentation and analysis of protein interactions InterWeaver - http://interweaver.i2r .a-star.edu.sg/ A web server of interaction reports Interaction Prediction through Tertiary Structure InterPreTS http://speedy.emblheidelberg.de/people /patrick/interprets/in dex.html Predicts the potential interaction of two proteins from three-dimensional information of protein complexes Medusa - http://www.bork.em blheidelberg.de/medus a/ An interface to the STRING protein interaction database and a general graph visualization tool Surface Properties of Interfaces – Protein Protein interfaces SPIN-PP Server http://trantor.bioc.col umbia.edu/cgibin/SPIN/ A database of all protein-protein interfaces for protein-protein interactions in the Protein Data Bank Protein-Protein Interaction Server - http://www.biochem. ucl.ac.uk/bsm/PP/ser ver/server_help.html A means of calculating a series of descriptive parameters for the interface between any two proteins in a three dimensional protein structure Protein Interactions VisualizatiOn Tool PIVOT http://www.cs.tau.ac. il/~rshamir/pivot/ A Java based tool for visualizing protein-protein interactions Protein Interaction Map Walker PIMWalker http://pim.hybrigenic s.com/pimriderext/pi mwalker/ An interactive tool for displaying protein interaction networks PathBLAST - http://chianti.ucsd.ed u/pathblast/ Network alignment and search tool for comparing protein interaction networks across species to identify protein pathways and complexes that have been conserved by evolution Interface for Sequence Prediction Of Target iSPOT http://cbm.bio.uniro ma2.it/ispot/ Prediction of protein-protein interactions mediated by families of peptide recognition modules WebInterViewer - http://165.246.44.45/ hpid/webforms/Visu alization.aspx Visualizes and analyzes large-scale protein interaction networks in the 3D space. infer Protein Protein Interactions iPPI http://www.bioinfo.c u/iPPI/ infers protein-protein interactions through homology search interface for Pfam iPfam http://www.sanger.ac .uk/Software/Pfam/i Pfam/ Visualization of protein-protein interactions at domain and amino acid resolutions Virtual Ligand Screening - http://www.molsoft.c om/vls.html A computer technique that simulates the interaction between proteins and small molecules that might be good lead to potential new drugs. Example. HIV-1 protease Table 2.2.Software tools to predict interactions between proteins 9 2.2. Problem Statement Websites with human protein interactions were extensively studied (Tab.2.3.) and compared in respect to features such as ranking, protein domain, pathway, gene, annotation, sequence, co-citation and database cross-references. The link is not considered as a feature except for database cross-references. Features are represented as complete (+) and incomplete (-). Database DIP HPRD STRING INTACT OPHID Interacting proteins list + + + + + Domain Information - + - - - Pathway Information - - - - - Database + + - - + Gene information + + - + - Protein sequence + + + + - - + - - + - - + - - Only links Domain Emphasis - for and on protein source information for domain sequence interaction all interacting proteins info. info. only determinat Crossref. information Protein annotation Information Ranking feature for Protein interactions Comments for input ion protein from Annotation and data but not for different all perspectiv interacting es proteins Table 2.3.Comparison of interaction features across human databases 10 The above comparison shows that none of the existing human interaction databases are providing a complete overview of interacting proteins. Surprisingly, only string database seems to provide ranking to interacting protein pair. HPRD seems to have more information than other databases but unfortunately that information is limited to input protein but not to all interacting proteins. Hence this study shows that there is a need of complete and reliable resource for protein interactions in human protein interactomics field. 2.3. Research Question How best can we integrate, organize and represent the data of human protein-protein interactions, domains, pathways, co-citation and annotation information computationally to extract new biological knowledge? (E.g. validating disease bio-markers in molecular network context and identifying better drug targets). 2.4. Hypothesis The literature survey on life science discovery process, information integration environment for life science discovery, databases and methods in life science in particular protein interactomics and data integration laid a good foundation in defining the scope and goals of the research work on human protein interactomics. Protein interaction networks act as backbone of current functional genomic research. These networks lay the foundation for systems biology analysis of the cell. To extract novel information from protein interaction data, computational development of two basic types is necessary: (1) Database infrastructure enabling efficient storage and retrieval, and database design to accommodate different databases to communicate with each other, and to allow researchers to access the information in these databases [32] and 11 (2) Software tools to identify relationships between data and to generate hypotheses that can be tested experimentally [33]. The robust data integration system in turn should have the following fundamental features [13]: 1. Accessing and retrieving relevant data from different databases 2. Transforming the retrieved data into designated data model for integration 3. Providing a rich common data model for exploring retrieved data and presenting integrated data objects to the end user applications 4. Providing a high-level expressive language to create complex queries across multiple databases and to facilitate data manipulation, transformation and integration tasks 5. Managing query optimization and other complex issues 12 III. MATERIALS 3.1. Bioinformatics Databases used The following are the bioinformatics databases used for the research work. They include wide variety of biological databases ranging from proteins, domains, pathways to proteinprotein interactions. 3.1.1 Protein Interaction Databases STRING STRING [34] is a database of known and predicted protein-protein interactions. The interactions (Fig.3.1.) include direct or physical and indirect or functional associations. The associations are obtained from high-throughput experimental data, from the mining of databases and literature, and from predictions based on genomic context analysis. It integrates and ranks these associations by comparing them against a common reference set, and presents evidence in a consistent web interface. Figure 3.1.Snapshot of String Protein Interaction database OPHID OPHID [35] is an on-line database of human protein-protein interactions. It explores known and predicted protein-protein interactions, and facilitates bioinformatics initiatives in exploring protein interaction networks (Fig.3.2.). It has been built by mapping high- 13 throughput (HTP) model organism (yeast, mouse, Drosophila and C.elegans) data to human proteins. The database currently contains 47656 interactions with 10652 proteins. Figure 3.2.Snapshot of Ophid Protein Interaction database 3.1.2. Protein Annotation Databases UniProt (Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, and PIR. SWISSPROT The Swiss-Prot Protein Knowledgebase is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and a high level of integration with other databases [36]. Swiss-Prot considers itself distinguished from protein sequence databases by three distinct criteria: 14 a. Annotation: Two classes of data, the core data and the annotation, can be distinguished in swissprot. For each sequence entry the core data (Fig.3.3.) consists of the sequence data, the citation information and the taxonomic data. The annotation consists of functions of the protein, posttranslational modifications, domains and sites, secondary structure, quaternary structure, similarities to other proteins, diseases associated with deficiencies in the protein, sequence conflicts, variants, etc. b. Minimal redundancy: Here all the protein data is merged to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry. c. Integration with other databases: Swiss-Prot is currently cross-referenced to more than 50 different databases. This extensive network of cross-references allows Swiss-Prot to play a major role in biomolecular database interconnectivity. Figure 3.3.Snapshot of SwissProt manually annotated Protein Sequence database 15 The current Swiss-Prot Release is version 49.3 as of 21-Mar-2006, and contains 212425 sequence entries, comprising 77942645 amino acids abstracted from 139653 references. The database currently contains 13633 human proteins. TREMBL UniProtKB/TrEMBL is a computer-annotated protein sequence database complementing the UniProtKB/Swiss-Prot Protein Knowledgebase. It contains the translations of all coding sequences (CDS) present in the EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also protein sequences extracted from the literature or submitted to UniProtKB/Swiss-Prot. The database is enriched with automated classification and annotation. The current TrEMBL Release is version 32.3 as of 21-Mar-2006, and contains 2666963 entries comprising 857415579 amino acids. 3.1.3. Protein-Domain Databases PFAM Pfam is a database of multiple alignments of protein domains or conserved protein regions. Figure 3.4.Snapshot of Pfam Protein Domain database 16 There are two types of Pfam. Pfam-A are accurate human crafted multiple alignments whereas Pfam-B is an automatic clustering of the rest of swissprot using the program Domainer [37]. Pfam-A (high quality) families are shown as just their names. Pfam-B (low quality, automatically clustered and aligned families) is shown as Pfam-B_xxxx. The Pfam-A hits are hyperlinked to the domain annotation. The above figure (Fig.3.4.) shows the start position, end position, source and score of Pfam-A domains of protein FAAH_HUMAN. SwissPfam is an annotated description of how Pfam domains map to (possibly multidomain) SwissProt entries. The current release has 1285025 entries and was indexed 28-Feb-2005. 3.1.4. Pathway Databases KEGG [38] is a suite of databases and associated software, integrating the current knowledge on molecular interaction networks in biological processes (PATHWAY database), the Figure 3.5.Kegg Pathway database showing Insulin Signaling Pathway 17 Information about the universe of genes and proteins (GENES/SSDB/KO databases), and the information about the universe of chemical compounds and reactions (COMPOUND/DRUG/GLYCAN/REACTION databases). The above figure (Fig.3.5.) visualizes the Insulin Signaling Pathway in detail. There are around 37,110 pathways, 290 reference pathways and 1,411,118 genes in KEGG database. 3.1.5. Bibliographic Databases PubMed PubMed [39] is a service of the U.S. National Library of Medicine that includes over 16 million citations from MEDLINE and other life science journals for biomedical articles back to the 1950s. PubMed (Fig.3.6.) includes links to full text articles and other related resources. Figure 3.6.Snapshot of PubMed literature database In addition, it provides a Batch Citation Matcher, which allows users to match their citations to PubMed citations using bibliographic information such as journal, volume, issue, page number, and year. 18 3.2. Software Languages used 3.2.1. Perl Perl is a programming language developed for text manipulation, web development and so on. Active Perl is Active State’s quality-assured, ready-to-install distribution of Perl, available for different operating systems AIX, HP-UX, Linux, Mac OS X, Solaris, and Windows. The standard Active Perl distribution 5.6.1.638 was downloaded and installed on windows machine. Perl was used for parsing swissprot, tremble, string, SwissPfam, Ophid, and Kegg database files. 3.2.2. PHP PHP (recursive acronym for PHP: Hypertext Pre-Processor) is a widely used open source general-purpose scripting language that is especially suited for Web development and can be embedded into HTML. PHP version 4.3.2 was used for developing the website. 3.2.3. HTML Hyper Text Mark-Up Language (HTML), a subset of Standard Generalized Mark-Up Language (SGML) for electronic publishing, is the specific standard used for the World Wide Web. It was used for creating web pages with hypertext and other information to be displayed in a web browser. It was also used to structure information. 3.2.4. SQL SQL is an ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems. SQL statements were used to retrieve and update data in a database. The query and update commands together form the Data Manipulation Language (DML) part of SQL where as creation or deletion of tables form Data Definition Language (DDL) part of SQL. 3.2.5. PSQL psql is a terminal-based front-end to PostgreSQL. Queries were typed in interactively to issue them to PostgreSQL, for getting the desired results. In few cases, input was also given in the form of a file. In normal operation, psql provides a prompt with the name of 19 the database to which psql is currently connected, followed by the string =>. Psql version 8.1.3. was used to access string protein interaction database. 3.3. Relational Databases used 3.3.1. Oracle Oracle, the most flexible and cost-effective database to manage enterprise information, was chosen as a back-end utility. Oracle database-mediated information management system (IMS) was developed to effectively and efficiently manage the information of human protein interaction. The human protein interaction data is obtained from STRING and OPHID databases whereas protein annotation data is obtained from UNIPROT database, protein domain data from PFAM database and protein pathway data from KEGG database. BIO10G, an oracle database available on inquire-g server was used for storing all the above mentioned human protein interaction data in the form of tables under nsudhara schema. Oracle 10g client (10.1.0.2.0) software was downloaded and installed on windows machine. Then it was configured to BIO1OG by providing hostname, port number and net service name. This has been found to be an excellent choice for back end capabilities due to its usefulness and ease of creating tables and querying information using SQL. 3.3.2. PostgreSQL PostgreSQL is a free object-relational database server (database management system), released under a flexible BSD-style license. As STRING human protein interaction database is available on postgres, PostgreSQL database installed and configured on BIOX server was used to store human protein interaction data. 3.4. Software Tools used 3.4.1. Komodo Active State Komodo is the professional integrated development environment (IDE) for dynamic languages, providing a powerful workspace for editing, debugging, 20 and testing programs. Komodo 3.5.2 software was downloaded and installed on windows platform. Komodo’s Rx Toolkit was used for creating regular expressions for parsing Swissprot, Trembl, Pfam, String and Kegg database files. Rx Toolkit takes a regular expression and some sample data and finds out the matches, the groups, number of matches, etc. Once the regular expression has been built and debugged, Perl program was written to open a file for input, use the built regular expression for parsing information of interest, and write that information to an output file. 3.4.2. SQL Loader SQL*Loader, a bulk loader utility, was used for moving data from parsed external files into the Oracle database. There are four stages to loading data using SQL-Loader: 1. Create a data file: The data file is a text file that contains parsed data. There is one record per line and a comma separates each attribute value. 2. Create the relation for the data: The relation must exist in the database before SQL-Loader can load data into it. 3. Create a control file: The control file describes the structure of the data and indicates the relation into which the data should be loaded. This file states that each line or record in the file contains attributes corresponding to the attributes in the table. The attribute names must be in the same order in the control file as the attribute values are in the data file. 4. SQL-Loader: SQL-Loader reads the control file and loads the data. The following command is used to execute the SQL-Loader from DOS window: SQL-Loader creates a number of files as it loads the data. A log file is produced that describes what happened and describes any errors that may have occurred. 3.4.3. SSH / SFTP SSH/SFTP Secure Shell Client (version 3.3.2) is a secure network connection system that provides an alternative method to establish an encrypted connection to a remote machine. It also provides a secure file transfer program that transfers files from your local machine 21 to a remote machine or server. SSH was used to connect remotely to biox, discover and inquire-g servers on a windows machine. SFTP was used to transfer files on windows from remote server to local machine and vice versa. Files and directories are dragged from one view and dropped on a target directory in the other, in order to generate standard FTP transfer operations. 3.4.4. Ultra Edit UltraEdit-32 text editor was used for easy viewing and editing of code and variables. As a disk based text editing, it supports files in excess of 4GB and minimum RAM is used even for multi-megabyte files. The other features include syntax highlighting, project/workspace support, column/block mode editing, formatting, hexadecimal editor and multi-byte support with integrated IME support. 3.4.5. Erwin All Fusion Erwin Data Modeler, a powerful database development tool, was used to create data model. Data models visualize data structures to organize, manage and moderate data complexities, database technologies and the deployment environment. 3.4.6. Toad Toad for Oracle is a powerful, low-overhead tool used to create and execute queries, as well as build and manage database objects. Toad version 7.6 was used for Oracle 9i. But when Oracle 9i has been upgraded to Oracle 10g, the existed Toad version did not supported oracle 10g as it was designed to work with only Oracle versions 7.3.4. to 9.2. As a result, Toad freeware 8.5 was used with Oracle 10g for a limited time. 3.4.7. Aqua Data Studio Aqua Data Studio, a database query tool, was used to create, edit, and execute SQL scripts, as well as browse and visually modify database structures. Aqua Data Studio 4.0.1 was downloaded from iuware and installed on windows machine. It provided an integrated database environment with a single consistent interface to all major relational databases such as Oracle, PostgreSQL. The database servers used were Oracle’s BIO10G 22 and PostgreSQL’s BIOX. This allowed us to tackle multiple tasks simultaneously from one application. 3.4.8. SQL Tools SQLTools, a light weighted and robust tool for ORACLE, was used for database development. It is free, small and does not require any installation. 3.4.9. EndNote Endnote was used to retrieve, organize, and print bibliographies and bibliographic references. Endnote 9.0 bibliographic software available at Indiana University was downloaded and installed on windows machine. It was used to search online bibliographic databases such as PubMed, organize references and images, and create bibiliographies and figure lists instantly. By reading the abstracts, one can decide whether the topic is relevant to their subject of interest or not. 3.5. University Computing Resources used 3.5.1. BIOX The host name for biox is in-info-bio3.ads.iu.edu. The size is around 1.1 TB. The Operating System is Red Hat Enterprise Linux 4. The development tools that were available on this machine were Python 2.3.4, Perl 5.8.5, PHP 5.02, MySQL 4.1.14, Oracle 10.2.1, Gcc 3.4.3, Apache 1.3.11, BioPerl 1.5.1 and PostgreSQL 8.1. The server was used for storing large databases such as STRING and PostgreSQL was used for accessing STRING database. 3.5.2. Zen Zen is the name of Lacie server located at Indiana University Purdue University Indianapolis. The website is http://zen.informatics.iupui.edu and the main purpose of this server is to store and share bioinformatics project related data. The size of this server is around 1860 GB. Zen was used to store rawdata, parsed data and scripts (Perl, PHP, control files etc.) related to human protein interactions. 23 3.5.3. Research Database Complex (RDC) Inquire-g database server and discover web server constitute the Research Database Complex at Indiana University. Their host names are inquire-g.uits.indiana.edu and discover.uits.indiana.edu respectively. These are connected via ssh2 as it offers a more secure connection with encrypted text. Inquire-g server was used to store data in oracle database. Discover web server was used for publishing the research work on World Wide Web. 24 IV. Procedures and Interventions 4.1. Method Roadmap for HAPPI Database Method Roadmap for HAPPI database (Fig.4.1.) explains the overview of tasks including the input or motivation and output of this research work. One can get quick understanding of the research work by studying the roadmap instead of going through all pages. Understand the complexity of protein functions Literature Survey Data Integration Quality Assessment of Protein Interactions HAPPI Development Case Study A whole view of functional information of interacting proteins Figure 4.1.Method Roadmap for HAPPI database 4.2. Architecture HAPPI database is a classical 3-tier web application. The three-tier architecture of HAPPI database is discussed in the following paragraphs. The following figures (Fig.4.2. and Fig.4.3.) give an overview, structure, and technologies of HAPPI’s hardware and software architectures. Presentation Tier This layer implements the "look and feel" of an application. It is responsible for the presentation of data, receiving user inputs and controlling the user interface. It receives an HTTP request and returns a response, in the form of an HTML document. As a web 25 authoring markup language, HTML was used for defining content structures and rendering a web page. Web Browser Inquire-g Discover Apache Oracle Web Layer HTML Postgres PHP Application Layer World Wide Web Database Layer Figure 4.2.Hardware Architecture for HAPPI database Application Logic Tier This is the layer in which the business logic exists. The business logic of the application is the logic that decides if all conditions are met and that implement use case scenarios. It processes each request according to the research rules, for example, deciding whether to reject input data or send it to the database. The bulk of the functionality of program is found in the application layer. PHP was used for this tier. 26 SERVER SIDE Presentation Layer CLIENT SIDE Internet Browser User Template file Contains all HTML Code in application Rendering Client side scripting logic Java Script Logic Layer Main PHP script Controlling the logic and Flow of application Database Oracle 10g Data Access Layer Figure 4.3.Three-Tier Software Architecture: Structure and Technologies Data Access Tier This is the layer that manages the persistence of application information. It is powered by a relational database server Oracle. Functions are used to execute database server-side processes pertinent to data integrity. Mostly views are used for presenting data to applications as they offer some level of security and can be used as alias to hide physical structures of database tables. Database tables are used primarily for storing data. 27 PHP offers two extension modules that can be used to connect to Oracle. First is the normal Oracle functions (ORA) and the second is the Oracle Call-Interface functions (OCI). But here OCI Extension module was used to connect to Oracle using PHP since it is optimized with more options. For example, OCI do include support for CLOBs, BLOBs, BFILEs, ROWIDs, etc. compared to ORA. 3 Tier Architecture was chosen owing to advantages such as flexibility, maintainability, reusability, scalability and reliability [40]. 4.3. Data Integration 4.3.1. Data Warehouse Approach The Data warehouse approach was used as a solution for data integration. Databases were assembled into a centralized system with a global data schema and an indexing system for integration and navigation. Data was integrated into a central data repository. Data was also cleaned and transformed during the loading process. While a variety of data models are used for data warehouses, including XML and ASN.1, the most popular relational data model Oracle was chosen. The relational database management system (RDBMS) offers the advantage of a mature and widely accepted database technology and a high level standard query language (SQL). As the number of databases in a data warehouse grows, the cost of storage, maintenance, and updating data will be prohibitive. It has an advantage in that the data are readily accessed without Internet delay or bandwidth limitation in network connections. Vigorous data cleansing to remove potential errors and duplications was performed before entering data in the warehouse. A major strength of a data warehousing approach is that it permits cleansing and filtering of data because an independent copy of the data is being maintained [13]. Warehousing exerts a load on the remote sources only at data refresh times, and changes in the remote sources do not directly affect the warehouse’s availability. In a nut shell, for interaction data that it is critical to clean, transform, curate, and for which only the best query performance is adequate, data warehousing is probably the best approach. Thus, limited data warehouses, popular solutions in the life sciences for data 28 mining of large databases, were chosen in which carefully prepared datasets are critical for success. 4.3.2. Data Acquisition The UniProt Knowledgebase (UniProtKB) provides the central database of protein sequences with accurate, consistent, rich sequence and functional annotation. The data is available in different formats i.e. XML, FASTA and Flat file. UniProt databases are accessed from the web at http://www.uniprot.org and downloaded from http://www.uniprot.org/database/download.shtml Uniprot Knowledgebase Release 6.2 was used. SwissProt Knowledgebase Release 48.8 i.e. uniprot_sprot.dat.gz was downloaded through ftp in flat file format to ZEN storage server. Then the file uniprot_sprot.dat of size 713 MB was extracted using WinRAR application. Trembl Knowledgebase Release 31.2 i.e. uniprot_trembl.dat.gz was downloaded through ftp in flat file format to ZEN storage server. Then the file uniprot_trembl.dat of size 3.3 GB was extracted using WinRAR application. STRING is a database of known and predicted protein-protein interactions. It uses a relational database system (PostgreSQL) to store primary data and precomputed predictions. The data is available in COG mode (flat files), Protein mode (flat files), and database dumps. It is available free of charge for licensing to academic institutions. The following figure and table (Fig.4.4. and Tab.4.1.) give an overview of different bioinformatics databases used to integrate data along with their descriptions, versions and sizes into the HAPPI database. 29 Swissprot Trembl Pfam 280594 2212675 1530770 12905 curated proteins 4084 Pathways Kegg 23464 2212675 Computer annotated proteins HAPPI 52186 Domains ID Mapping IPI 57366 114360 Co-citations 2403198 Associations 51207 Interactions String Ophid String 30104706 51207 1533898 Figure 4.4.Integration of Protein Annotation, Interaction, Domain, Sequence and Pathway Data KEGG, Kyoto Encyclopedia of Genes and Genomes, pathway database was downloaded by FTP as anonymous user. Then it was unzipped using gun zip utility. OPHID, the Online Predicted Human Interaction Database, is available for download in two formats. One is PSI XML and the other is Plaintext. A form with terms of download and use has been filled and an email was sent with a link to download data. By clicking the link, the human protein interaction data i.e. ophid1140247825213.txt was downloaded to ZEN storage server. SwissPfam, the domain database of SwissProt and Trembl proteins, Release 18 was downloaded as anonymous user from sanger.ac.uk/pub/databases/Pfam. In a nutshell, the following files were downloaded from different bioinformatics database servers for protein interaction data analysis: 30 File Download Uniprot_sprot.dat.gz uniprot_trembl.dat.gz Interaction.data.v6.2.sql. gz Primary.data.v6.2.sql.gz Homology.data.v6.2.sql. gz Protein.links.detailed.v6. 3.txt.gz COG.links.detailed.v6.3. txt.gz Genes.tar.gz Swisspfam.gz ophid1140247825213.txt Description Date Download Version Source Size Swissprot Knowledgebase TrEMBL Knowledgebase Protein Interactions 1/18/2006 31.2 UniProt 713 MB 10/20/2005 48.8 UniProt 3.3 GB 2/28/2006 6.2 STRING 1.6 GB Protein Players Protein homology 3/1/2006 3/1/2006 6.2 6.2 STRING STRING 265 MB 4.5 GB Protein network data Association scores 3/1/2006 6.3 STRING 172 MB 3/1/2006 6.3 STRING 13 MB Genes and Pathways Domains Human Protein Interactions 3/9/2006 38 KEGG 1.1GB 10/5/2005 2/20/2006 18 Not Available PFAM OPHID 72 MB 334 KB Table 4.1.An overview of Data Acquisition from different data sources 4.3.3. Data Reduction As the research work is focused on human organism, steps were taken to reduce the data of all organisms to the data of organism Homo sapiens. The table (Tab.4.2.) shows the total number of records, reduced records, the identifier used to reduce the records of each bioinformatics database used to integrate data into HAPPI database. Swissprot: A Perl program was written to extract only human proteins from total proteins. To extract only human proteins, a protein record was analyzed to distinguish between human and non-human record. As uniprotid is the identifier of each record that starts the entry of protein, a parser was written to check the first entry of each protein. If a protein identifier has ‘human’ in its name, then it was considered as a human protein and was written into output file. On the other hand, if a protein identifier does not have ‘human’ in its name, then it was considered as non-human protein and was not written to output file. 31 Upon executing this parser, 12905 human proteins were written to output file from a total of 280594 proteins of all organisms. Trembl: As this database also belongs to Uniprot Knowledgebase, the same procedure of Swissprot was followed. A Perl parser was written to extract human proteins out of total proteins of all organisms. 2,212,675 proteins were reduced to 57,924 human proteins. SwissPfam: A Perl parser was written to extract human proteins from proteins of all organisms. This parser upon execution reduced 1530770 total proteins to 52186 human proteins. Ophid: As this database consists of only human protein interactions, the data is taken as is to the research work. String: As string used PostgreSQL database for data storage, PostgreSQL database system resided on BIOX server was utilized. Under Postgres, a local database named ‘rani’ was created. Then the sql dump file of string was executed under this database using psql language. Upon execution of this file, several string tables were created and loaded. Then the tables and their contents were analyzed. The string interaction table has around 2,403,198 interactions for 180 organisms. As the research work is associated with only human organism, another table was created only with human proteins. As the taxonomy identifier of human organism is ‘9606’, each record was checked and if that record has taxonomy identifier of humans, then that record was written to the newly created table. Then the table contents were copied to a text file with a delimiter. Kegg: A Perl parser was written to distinguish between genes with pathways and genes with no reported pathways. As the main purpose of Kegg database integrations is to extract pathway information, the parser was executed to create a new output file with genes having pathways. There are around 23,464 genes out of which only 4084 genes had pathway information. 32 Database Total Records Reduced Records SwissProt Trembl Pfam String Data Reduction identifier SwissProt ID TrEMBL ID SwissProt ID Taxonomy ID 280,594 2,212,675 1,530,770 30,104,706 Kegg Pathway 23,464 Ophid None 51,207 12,905 proteins 57,924 proteins 52,186 proteins 2,403,198 interactions for 17,636 proteins 4084 genes with pathways 51,207 interactions for 7002 proteins Table 4.2.An Overview of Data Reduction of Protein Integrated Data 4.3.4. Feature Selection SwissProt: The structure of a sequence entry was analyzed for selecting the required features of each protein. Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry. Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown in the table below (Tab.4.3.). Line code ID AC DT DE GN OS OG OC OX RN RP RC Content Identification Accession number(s) Date Description Gene name(s) Organism species Organelle Organism classification Taxonomy cross-reference Reference number Reference position Reference comment(s) 33 Occurrence in an entry Once; starts the entry Once or more Three times Once or more Optional Once Optional Once or more Once Once or more Once or more Optional RX RG RA RT RL CC DR KW FT SQ (blanks) // Reference cross-reference(s) Reference group Reference authors Reference title Reference location Comments or notes Database cross-references Keywords Feature table data Sequence header Sequence data Termination line Optional Once or more (Optional if RA line) Once or more (Optional if RG line) Optional Once or more Optional Optional Optional Optional Once Once or more Once; ends the entry Table 4.3.Line types and codes of a protein sequence entry in Uniprot database [36] As shown in the above table, some line types are found in all entries, others are optional. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).Then a Perl program was written to parse only the required features (given in table below) of each protein. TrEMBL: The general structure of an entry in TrEMBL is identical to SwissProt database. The class entry distinguishes the fully-annotated entry from computerannotated entry of each protein. The data class for Trembl is Preliminary whereas for swissprot, it is Standard. To parse the required features of this database, a Perl program was written and executed. Pfam: The domain structure of Pfam database was analyzed for feature selection. Domains constitute two types of regions. PfamA regions are the regions of proteins that are predicted by the Pfam collection of hidden markov models to belong to a family. These are strongly trusted matches to a family and are very unlikely to be false matches. PfamB regions are regions of proteins that belong to a PfamB family. Pfam-B is an automatically generated supplement to Pfam that is generated from the PRODOM database. A Perl parser was written to select the domain features of human proteins. 34 Kegg: Kegg pathway database structure was analyzed for feature selection. A typical record consists of the name of gene, its description, its pathway information, the source organism, Ortholog information, chromosome location of gene, motif associated, database cross-references, codon usage, amino acid sequence and nucleotide sequence. A Perl parser was written to select gene name, description and pathway information of each record. The following table (Tab.4.4.) gives the summary of selected features of each record of bioinformatics databases used in integration of HAPPI database. Database Uniprot Pfam String - 35 Features Selected Entry Name, Total number of Amino Acids, Accession Number(s), Date of creation and last modification of the database entry, Description of Protein, Gene name, Organism species, Taxonomy cross-reference(s), Bibliographic cross-reference(s), Database cross-references, Keywords, Sequence, Molecular weight, and Check Value Uniprot Identifier, Uniprot Accession Number, Domain(s) name, Domain(s) description Domain(s) Identifier, Domain(s) occurrence, and Domain(s) Position Ensembl Protein Identifier, Ensembl Protein Interactor, Neighborhood score, Gene fusion score, Concurrence score, Coexpression score, Experimental score, Ophid - Kegg Database score, Text mining score, Physical sub score, Combined score, and Mapping between Ensembl Protein Identifier and Uniprot Protein ID Swissprot Protein Identifier, Swissprot Protein Interactor, Source of Dataset Gene name, Synonym(s) of gene, Description of gene, and Pathway information Table 4.4.Summary of feature selection from Protein Integrated databases 4.3.5. Meta-data specification Metadata is data describing data, that is, data that provides documentation on other data managed within an application or environment. A new metadata record is created for each dataset. The decision to classify data as a dataset is called ‘granularity’ [41]. A dataset might be the data from a public data source or experiments. The metadata record describes all the important information about the dataset, such as where the data was collected, when it was collected, and who collected it. It contains information that will enable users to easily interpret the data (such as what column headings mean, what the units are etc.) Metadata records are stored in an oracle database. Though there are many different metadata standards in use by the scientific world, the most widely used metadata standard Dublin Core (DC) was used. The Dublin Core Metadata Element Set consists of 15 elements, which include Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, and Rights [42]. Metadata elements that were created in this project are unique identifier, name of the creator, date created, table name, number of the attributes, name of the attributes, size of the attributes, table source name, associated fact, table subject, table description and table size (number of records in table). The benefits of Metadata include data management, 36 duplicate data reduction, concise dataset description, and so on. Here metadata is mapped to tabular format. 4.3.6. Database Design Program-data independence was achieved by creating independent tables, thus insulating the data from program changes. End users were determined to be researchers, biologists, and protein interactomics specialists. These users are considered to be casual by the database designers. Data Model The data model (Fig.4.5.) describes data as entities, relationships and attributes. Entities are the things about which we seek information. species protein_species genes_proteins species_id species_id (FK) uniprot_id (FK) uniprot_id (FK) gene_id species_name proteins human_protein_sequence uniprot_id crossreferences uniprot_id (FK) database_crossreferences bibiliographic_crossreferences uniprot_id (FK) uniprot_acc name date_created_modified key_information sequence length molecular weight checksum human_protein_interactions protein_ensemblid_a (FK) protein_ensemblid_b (FK) uniprot_id (FK) human_protein_domain identifiers_proteins uniprot_id (FK) domain_id (FK) uniprot_id (FK) protein_ensemblid neighborhood_score genefusion_score cooccurence_score coexpression_score experimental_score database_score textmining_score combined_score domain domain_id domain_name domain_no domain_pos domain_desc Figure 4.5.Initial Data Model for HAPPI Database 37 Attributes are the data we collect about the entities. Relationships provide the structure needed to draw information from multiple entities. The below figure display an initial data model designed for HAPPI database. The initial data model has been modified to include kegg and ophid databases. In this schema, databases such as Swissprot, Trembl, String, Ophid, Pfam and Kegg have been brought together. Selected attributes from each of these databases contribute to the structure of the HAPPI database. The column names generally describe the data it contains. 4.3.7. Data Storage or Loading Oracle’s SQL Loader was used to load large amounts of parsed data into tables. As the data files and tables to be loaded are created, control files were created for each table to tell oracle how to load data from the corresponding data file. Then SQL Loader was executed to read the control file and to load the data. A log file was produced that describes what happened and describes any errors that may have occurred. The following table (Tab.4.5.) summarizes the different tables used for data storage. Public Database SwissProt Trembl Pfam String Ophid Kegg String Total Records 280,594 2,212,675 1,530,770 30,104,706 51,207 23,464 1,533,898 Parsed /Loaded Records 12,905 57,924 52,186 2,403,198 51,207 4084 1,114,360 Schema Table nsudhara nsudhara nsudhara nsudhara nsudhara nsudhara nsudhara human_protein_uniprot_sprot human_protein_uniprot_trembl human_protein_domain_swisspfam new_human_string_interactions ophid_human_interactions kegg_human_pathway string_human_prot_abstracts Table 4.5.An overview of tables loaded from Protein Integrated databases 4.4. Query Processing PHP’s OCI Extension module was used to connect Oracle to PHP since it is optimized with more options such as support for CLOBs, BLOBs, BFILEs, ROWIDs, etc. The OCIError () function was used to obtain an array with error code, message, offset and 38 SQL text. Error was obtained for a specific session or cursor by supplying the appropriate handle as an argument to OCIError (). Without any arguments, OCIError () will return the last encountered error. The most direct way to speed up selection of data i.e. search optimization is to use an index. An index is essentially a structure of pointers that point to rows of data in a table. Indexes were created for all the required attributes over different tables in the database. An index optimizes the performance of database queries by ordering rows to speed access. 4.5. User Interface: HAPPI Database User interface-flow diagrams [43] were used to reflect high-level overview of the user interface for HAPPI database. The high-level overview or architectural approach was implemented to understand the complete user interface for this system. Factors such as usability, clarity, simplicity, speed have been considered in the design of HAPPI website. The following figure (Fig.4.6.) gives a high-level overview of HAPPI website. The HAPPI website is available at http://discover.uits.indiana.edu:8340/ProteinInteractions/index.html 39 HAPPI main page: Takes protein as input Submit Button List of interacting proteins with their descriptions and scores for a given protein Interacting Protein Pair Link List of genes, pathways co-citations and domains of interacting protein pair Individual Protein Link Individual Protein Link Sequence, annotation and cross-reference information of that protein Co-citation Link Database Pathway Domain Link Link Kegg Pathway Cross-Reference Links Pfam Domain PubMed Literature MIM Disease PDB Structure Gene Link Entrez Gene Prosite Ensembl Gene Gene Ontology Reactome ProDom PIR Figure 4.6.HAPPI Database: User Interface Flow Diagram 40 4.6. Unified Scoring Model We used a ranking method that works in principle by clustering interaction confidence scores from different data collection methods. A unified scoring model was developed for ranking the importance and reliability of human protein-protein interactions integrated from both String and Ophid protein interaction databases. First two scoring systems for String and Ophid protein interactions were individually developed and then a unified mechanism was proposed to combine the two scoring systems. The objective of unified scoring model is to assess the quality of protein-protein interactions. For a given input protein, protein interactions were retrieved from String and Ophid protein interaction databases. If an interaction is only found in String, then the String Protein Interaction Confidence score is used and if an interaction is only found in Ophid, then the Ophid Protein Interaction Confidence score is used. But if an interaction is found in both String and Ophid protein interaction databases, then a unified score model is used. The final score is populated to each interacting pair of proteins. String Protein Interaction Confidence Scores Type of String Score Minimum Score Maximum Score Neighborhood score 24 799 Gene Fusion score 1 898 Co-occurrence score 431 975 Co-expression score 200 231 Array score 24 552 Experimental score 24 999 Database score 24 996 Text mining score 24 816 Combined score 150 999 Table 4.6.Analysis of String database scores 41 String used several methods (neighborhood, gene fusion, co-occurrence, co-expression, experiments, databases, text mining) in predicting protein interactions. String interaction data and their confidence scores from different sources were analyzed. For this the distribution of string interaction data was thoroughly examined. After examining its distribution (Tab. 4.6.), a 5-star ranking scale was developed. Rating based on combined score distributions <20% * (1 star): ** (2 star): 20-32% *** (3 star): 32-70% **** (4 star): 70-90% ***** (5 star): >90% Note: the unit of the score is 0.1% Figure 4.7.String database score distributions Manual clustering was done to achieve maximum representation of interactions in each bin. Its combined score ranges from 0.001 to 0.999. Based on combined score distributions, a 5-star ranking model was developed (Fig.4.7.). If the string combined score is less than 0.02, 1 star is given for a given protein-protein interaction. Subsequently 2 stars were given for the string combined score ranging from 0.02 to 0.32, 3 stars were given for the string combined score ranging from 0.33 to 0.7, 4 stars were given for the string combined score ranging from 0.71 to 0.9, and 5 stars were given for the scores greater than 0.9. Confidence levels were defined as low, medium, high and the highest. If the string combined score is 20% or better, then it is considered as low confidence followed by 50% to medium confidence, 75% to high confidence and 95% to 42 the highest confidence. The confidence scores are directly proportional to reliability of a given protein-protein interaction. Ophid Protein Interaction Confidence Scores Ophid Human Protein Interaction data was thoroughly examined (Tab.4.7.) from the perspective of sources and datasets used in data collection. High confidence scores were given to direct interactions, medium confidence scores were given to interactions inferred from high quality mammalians and low confidence scores were given to interactions inferred from low quality mammalians or non-mammalians. Source Data Set C. elegans CORE_1 CORE_2 NON_CORE LITERATURE SCAFFOLD INTEROLOG CE_DATA low medium high MIPS Mouse AfCS Suzuki RikenDIP RikenLit RikenBIND FlyHigh FlyLow FlyCellCycle WranaHigh WranaMedium WranaLow JonesErbB1 Pawson StelzlLow StelzlMedium S. cerevisiae M. musculus D. melanogaster LUMIER H. sapiens 43 Confidence Total Interactions 0.5 0.5 0.4 0.5 0.5 0.6 0.5 0.3 0.35 0.4 0.4 0.7 0.6 0.6 0.7 0.6 0.7 0.65 0.5 0.6 0.4 0.4 0.3 0.75 0.75 0.75 0.75 1,223 18,034 1,800 3,883 620 6,396 Known Human PPI StelzlHigh VidalHuman_core VidalHuman_non_core MINT HPRD BIND Total 0.75 0.75 0.75 0.8 0.8 0.8 17,096 49,052 Table 4.7.Analysis of Ophid database scores Ophid [11] has been built by mapping from yeast, mouse, drosophila and C.elegans high throughput data to human proteins. Confidence scores were assigned for each type of dataset by considering the reliability of source and dataset into account. High confidence scores were given to known human protein-protein interactions such as MINT and BIND where as low confidence scores were assigned to predicted human protein-protein interactions that were built by mapping HTP model organism data to human proteins. Unified Scoring Method Unified scoring model (Fig.4.8.) is developed by taking into fact that the interacting protein pairs may exist in both String and Ophid protein interaction databases. If a protein-protein interaction pair is found in both String and Ophid interaction databases, then the following steps were followed: If the interaction source from the String includes ‘database’ or the Ophid source indicate ‘String’, the assigned STRING score is used. Otherwise, the following scoring formula is used Final Score (S) = 1 – (1 - Score [STRING]) * (1 – Score [OPHID]) 44 Input Protein Extract STRING Interactor Extract OPHID Interactor Similar to Ophid Similar to String Yes Yes S=1-(1-Score [String]) * (1-Score [Ophid]) No DB source No String Source Yes Yes STRING Score Combined Score (S) OPHID Score Figure 4.8.FlowChart for Score Consolidation 45 V. Results and Discussion The diversity of protein-related data was taken into account before designing an optimum database system. During the development of HAPPI database, every category of interaction data was represented and that the database was designed to be visual and userfriendly. We chose to develop a database that is based on a bioinformatics analysis to include all the known human proteins and their interactions. The database is publicly available and can be accessed within Discovery Informatics and Computing Group website. The Webpage for the query has been designed to be as simple to use as possible without losing precision. Figure (Fig.5.1.) shows a screenshot of the query page, indicating that protein has to be given as uniprot identifier for extracting all known and predicted interactions. Once the protein is submitted, we get a list of interacting proteins along with their descriptions followed by a 5-star score (Fig.5.2.). Within the program, all the interactors were sorted based on the importance of reliability i.e. scoring. As a result, all 5 score ranking proteins were shown at the top followed by 4-score, 3-score, 2-score and 1-score interactors if any. This allows the user to view the interactors in the order of interaction reliability. Figure 5.1.The query page of HAPPI database 46 Figure 5.2.The Interaction Results Page of HAPPI database Apart from that by clicking the interacting icon between a pair, thorough information of domain, gene, co-citation and pathway of interacting proteins were given (Fig.5.3.). Within that information, links were provided to gene, abstract, domain and pathways to have detailed information if interested. Then the individual proteins were also clickable to give an extensive information of a protein including their sequence information, annotation information, and related database cross-reference information (Fig.5.4.). The important aspect of proteomic analysis exists in the information around interacting proteins and in mapping the corresponding binding sites [10]. By understanding the proteins and their binding partners in the context of a network, insight into the function of proteins was obtained. 47 Figure 5.3.The Interaction Annotation Page of HAPPI database Figure 5.4.The Protein Annotation Page of HAPPI database 48 Serine/threonine-protein kinase SGK1 as a Representative Entry in HAPPI Database Proteins such as NEDD4, INS, P85A, INSR, PDK1, SGK1, PTEN, GRB2, and SCNAA were used to test the functionality of HAPPI database. SGK1 was used as a representative entry to illustrate the breadth and depth of annotation in HAPPI database and to highlight the importance of integration of protein interaction data (Tab.5.1.) SGK1 String Ophid HPRD HAPPI Interacting Proteins NEDD4 hPDK1 IMA2 MK07 SGK3 AKT1/PKB NEDD4-2 NHERF-2 Q96G51 HIP-2 S6K PDK2 ENaC Table 5.1.Comparison of SGK1 interacting proteins across P.I. databases This protein was searched against String, Ophid, HPRD and HAPPI protein interaction databases. The top proteins with more than 4 star score were retrieved and compared to the other protein interaction databases. The results showed that HAPPI interaction database captures all the interacting proteins that were identified by other interaction databases. Apart from that other factors were compared around these interacting protein pairs such as domain information, pathway information, gene information, sequence 49 information, annotation information and database cross-reference information (Tab.5.2.). As HAPPI database integrates interaction information from both String and Ophid databases, it seems to perform better than other databases in giving the integrated information of an interacting pair of proteins. Features compared String Ophid HPRD List of Interacting Proteins Scoring Pathway Information Gene Information Annotation Information Co-citation Information Domain Information Sequence Information HAPPI Keywords Table 5.2.Comparison of HAPPI database features with P.I. databases 50 VI. Case Study To show the benefits of HAPPI database, we performed a case study (Fig.6.1.) using Insulin Signaling pathway in collaboration with a biology team on campus. We began by taking two sets of proteins that were previously well studied as separate processes, set A and set B. We queried these proteins against the HAPPI database, and derived highconfidence protein interaction data sets annotated with known KEGG pathways. We then organized these protein interactions on a network diagram. The end result shows many novel hub proteins that connect set A or B proteins. Some hub proteins are even novel members outside of any annotated pathway, making them interesting targets to validate for subsequent biological studies. According to Dr. Blazer-Yost, “The peptide hormone insulin is best known as the agent that is necessary for the stimulation of glucose uptake into cells. However, this hormone is also a master regulator of many other biochemical processes including intermediary metabolism, nutrient absorption, utilization and storage and even specific growth factor effects. It has been estimated that insulin exerts a direct or indirect influence over hundreds of biochemical intermediates [44]. The action of insulin is initiated by peptide binding to a cell surface receptor. The receptor activation controls downstream signaling elements. The effects of the hormone vary with cell type and are dependent on the presence and compartmentalization of insulin target proteins and lipids. Because most cells in the body contain insulin receptors, the number of potential responses is as varied as the different types of cells found in an intact organism. It is not surprising, therefore, that some of the insulin stimulated pathways are not completely elucidated (Fig.6.2.). While there may be some commonality and overlap in the various potential pathways, traditional approaches makes it difficult to discern all the possible combinations and permutations even if some of the components of the pathway in a particular cell type are already known.” 51 Insulin Stimulated Pathway Proteins involved in Insulin Pathway Proteins modified by Insulin treatment Query against HAPPI Database Derive High Confidence Interacting Proteins Study KEGG Pathways Develop Visualization Network Target hub proteins that connect set A and B Proteins Figure 6.1.Flow Chart for case study Dr. Blazer-Yost’s research laboratory primarily focuses on the interactions between ENaC and its associated harmones i.e. insulin, ADH, and Aldosterone. It has been shown that insulin stimulated Na+ reabsorption through ENaC is mediated by the phosphoinositide pathway but many of the effectors downstream of the initial activation of PI3-kinase are unknown. It further investigates PI 3-Kinase pathway in ADH stimulated Cl- and Na+ transport in the mouse principal kidney cortical collecting duct 52 cells to understand the role of PI 3-Kinase pathway, and the biochemical pathways involved in hormone-induced Cl- and Na+ reabsorption in the distal nephron of the mammalian kidney [45]. The illumination of such pathways has important medical significance since common health conditions such as hypertension, congestive heart failure and renal diseases are caused by an imbalance with Na+ transport in the kidney. Until now, traditional methods such as cell culture model of the mpkCCD cells, Electrophysiology to monitor ion transport and data interpretation using Sigma plot have been used. As advancement in supercomputing and informatics continue to play a critical role in research and development, a network based bioinformatics solution was proposed in delivering insights into the behavior of cells, proteins and pathways. Dr.Blazer-Yost mentioned that “This bioinformatics solution would be helpful in identifying potential unknown signaling components within the framework of isolated known pathway intermediates. The insulin-stimulated pathway which forms the starting point for this application results in increased sodium reabsorption across the polarized epithelial cells formed by the principal cells of the mammalian distal nephron.” 53 PDK1 PDK2 P P PP P P PIP3 Na+ Na+ ? ? ?? PDK1 PDK2 P P PP P P P P SGK PIP3 SGK PIP3 ? ? 110 85 P P Nucleus P P PP P P PP ? PIP3 ? NEDD-4 PI3-Kinase PIP2 ? PIP2 PI3-Kinase 110 85 IRS P P ENaC transport vesicle Insulin Receptor aldosterone Insulin Tight Junction Figure 6.2.Insulin Signaling Pathway [Contributed by Dr. Blazer-Yost] Asking relevant biological questions directed us in understanding the problem and in finding the solutions. 1. How are these signaling pathways assembled? 2. Are there any relevant kinase signaling pathways? 3. What are the key molecules involved in these pathways? 4. What are the proteins involved in this pathway? 5. What domains do these proteins constitute? 6. What are the likely interacting proteins and their role in ion transportation? . Two sets of proteins were taken as input to this case study. Set-A proteins are proteins that were known to be involved in insulin stimulated sodium transport in renal cells. Set B proteins are proteins that were modified by insulin treatment in the membrane/membrane-associated fraction by means of peptide mass fingerprinting via MALDI-TOF MS [Appendix A]. The latter experiments were performed in the BlazerYost and Witzmann laboratories at IUPUI. Then these proteins were queried against 54 HAPPI database and the high confidence interactors along with their pathways were extracted. Then based on the research experience of Dr.Blazer-Yost, the proteins of interest were considered [Appendix B]. Figure 6.3.Visualization of Insulin Pathway Protein Interaction Network Using Pathway Studio After capturing a list of interacting proteins of particular interest and information around those proteins, a network was built with the input protein list, interacting protein list and the pathways these proteins are involved in. The figure (Fig.6.3.) shows the visualization of insulin pathway protein interaction network. First demo version of PathwayStudio [46] was used to visualize the network. Search tool was used to find and discern a list of proteins based on a name or keyword. Build Pathway tool was used to find regulatory paths between two or more proteins. But the network was so dense to identify proteinprotein interaction. Apart from that pathways were also not shown. Proteins were represented as nodes and interactions as edges. Then the network was done manually by drawing Venn diagrams as pathways and lines as interactions (Fig.6.4.). The input proteins were colored red but the P85A protein was colored white. This network gave a good representation of interacting proteins and in which pathways they are present. 55 AL2S7 PICK1 ARHG1 RAC1 CSK Q7RTZ3 SRC CSKP KPCD PTPA GRK4 PTEN PI52A P3C2B P3C2G PLCD1 PI5PA PLCG1 PLCB1 DYRK3 PI4KA KPCA G P85A PK3CA P55G GRB2 PP PKD2 RASH RRAS INS RAF1 KPCZ MP2K1 IGF2 PDPK1 ELK1 PRKX INSR PTN1 IRS AAPK1 SRBS1 FRAP CBLB PTPRF SHC1 SS SOCS1 GTR4 SNX1 AAPK1 AAKG1 SGK1 PLEK IMA2 AQP2 SCNAA NEDD4 WWP1 WWP2 Figure 6.4.Visualization of Insulin Pathway Protein Interaction Network 56 The major pathways that were studied in this case study were Regulation of actin cytoskeleton, proteins of the tight junctional complex, Phosphatidylinositol signaling intermediates, Insulin Signaling pathway and Ubiquitin mediated Proteolysis. Each pathway is given a color and the proteins are placed according to their role in a pathway. Pathways are represented in the form of a Venn diagram. They are overlapped each other as few proteins exist in more than 1 pathway. The ubiquitin mediated proteolysis pathway is not combined as there were no common proteins with the other 4 pathways. But interestingly the proteins in this pathway are interacting with proteins in Insulin signaling pathway. The visualization network suggested several interactions that can now be tested experimentally. Novel information was found around SGK, NEDD4, PDPK1 and PTEN proteins. SGK protein was known to interact with NEDD4, thus validating this framework. On the other hand SGK was found to interact with PDPK1 protein in Insulin pathway which in turn interacts with KPCA which in turn interacts with PTEN protein. As an example of more detailed information, the phosphatase PTEN appears to be linked to proteins in three of the four domains. PTEN is a potential research target in the BlazerYost laboratory and this schema illustrates additional targets of the phosphatase that were not previously considered. In addition, many of the other proteins which directly or indirectly have interaction with components in multiple domains are of potential interest and can now be considered by the research laboratory for the first time. 57 VII. Conclusion The publication of the draft human genome consisting of 30,000 genes is merely the beginning of genome biology. A new way to understand the complexity and richness of molecular and cellular function of proteins in biological processes is through understanding of biological networks. These networks include protein-protein interaction networks, gene regulatory networks, and metabolic networks. Hence, interaction databases documenting protein-protein interactions are a necessary tool for the network biology of 21st century. HAPPI: Human Annotated Protein-Protein Interaction database fulfill such needs: First, the interaction database provides a single relational database platform allowing users to validate protein interactions by comparing results with previous experiments. Second, the collection and organization of known and predicted protein interactions acts as a great resource for building up interaction networks into pathways. Third, information related to gene, pathway, domain and co-citation of interacting proteins provides a whole view of functional information of proteins. Fourth, the properties of protein networks can be studied. Directions for Future Work The bioinformatics databases used may become outdated based upon the frequency of release of new versions of each database. Hence having latest versions of integrated databases is very essential in getting the good relevant interaction data for data analysis. Protein structures allow distinguishing the large number of domain-domain and proteinprotein interactions which in turn can sort the biologically relevant interactions from nonrelevant interactions. Hence providing the individual structures of interacting proteins and combined structure of protein interactions are very pivotal in enriching the information around interacting proteins. Apart from that this database can also be extended to worm, fly, and yeast protein interactions. 58 As domain-domain interactions play an important role in having a global view of proteinprotein interactions, assessing the quality and reliability of interacting domain pairs in the form of scores helps in increasing the reliability of protein-protein interactions. Data expansion, multi-leveled processes and better graphical displays are also other future considerations for improving our HAPPI database management system. The more integration of protein interaction databases into the HAPPI database, the better would be the reliability of confidence scores of protein-protein interactions. Therefore more interaction databases can be added to HAPPI database to increase the confidentiality of interaction scores. The code should be optimized for quick display of results page on HAPPI database. 59 VIII. Appendices Appendix A: List of Proteins and their associated pathways related to Case Study Protein AAPK1,AAPK2,AAKG1,AAKB1,AAKB2,AAKG3,AAKG2 Pathway KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. KEGG: Insulin signaling pathway [PATH:hsa04910]. AP1S2,ANXA8,ANXA4,ANX11,ANXA6,ANXA5,ANX13 NO PATHWAY ARHG1,ARHG7 KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Apoptosis [PATH:hsa04210]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Tight junction [PATH:hsa04530]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. KEGG: Insulin signaling pathway [PATH:hsa04910]. AKT2 AKA12, AL2S7,ACTS,AKAP5,AP3S1,AQP2 NO PATHWAY BMX NO PATHWAY KEGG: TGF-beta signaling pathway [PATH:hsa04350]. KEGG: Hedgehog signaling pathway [PATH:hsa04340]. KEGG: Cytokine-cytokine receptor interaction [PATH:hsa04060]. BMP2 BTK KEGG: B cell receptor signaling pathway [PATH:hsa04662]. CCL28 KEGG: Cytokine-cytokine receptor interaction [PATH:hsa04060]. CSKP KEGG: Tight junction [PATH:hsa04530]. CABIN, CSN3,CENG1 NO PATHWAY 60 CAV1 KEGG: Focal adhesion [PATH:hsa04510]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. KEGG: Insulin signaling pathway [PATH:hsa04910]. CBLB CXA1 KEGG: Gap junction [PATH:hsa04540]. KEGG: Cell Communication [PATH:hsa01430]. CSK KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. CNTN1 KEGG: Cell adhesion molecules (CAMs) [PATH:hsa04514]. DCAK1,DLG4,DOK1 NO PATHWAY DYR1B,DYRK2 NO PATHWAY KEGG: Nicotinate and nicotinamide metabolism [PATH:hsa00760]. KEGG: Benzoate degradation via CoA ligation [PATH:hsa00632]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. DYRK3,DYRK4,DYR1A ELK1 KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: Dorso-ventral axis formation [PATH:hsa04320]. KEGG: Calcium signaling pathway [PATH:hsa04020]. ERBB4 EPHA2,EPHA3 KEGG: Axon guidance [PATH:hsa04360]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Calcium signaling pathway [PATH:hsa04020]. FAK2 61 FRAP KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: Nicotinate and nicotinamide metabolism [PATH:hsa00760]. KEGG: Benzoate degradation via CoA ligation [PATH:hsa00632]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. GRK6/GRK7/GRK5/GRK4 GBLP,GRB10,GRB2,GAB2,GAB1,GNDS,GBB3 NO PATHWAY GTR4 KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. KEGG: Insulin signaling pathway [PATH:hsa04910]. GNA12,GBG12 KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: MAPK signaling pathway [PATH:hsa04010]. HNRPK NO PATHWAY INSR KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Dentatorubropallidoluysian atrophy (DRPLA) [PATH:hsa05050]. KEGG: Adherens junction [PATH:hsa04520]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Dentatorubropallidoluysian atrophy (DRPLA) [PATH:hsa05050]. KEGG: Type I diabetes mellitus [PATH:hsa04940]. KEGG: Insulin signaling pathway [PATH:hsa04910]. INS IRS1,IRS2 KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. KEGG: Insulin signaling pathway [PATH:hsa04910]. IGF1R KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Adherens junction [PATH:hsa04520]. 62 IKKE KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. ITK INSRR,IMA2,IRK1,IGF2 NO PATHWAY JAD1A NO PATHWAY KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. JAK1 KEGG: Nicotinate and nicotinamide metabolism [PATH:hsa00760]. KEGG: Benzoate degradation via CoA ligation [PATH:hsa00632]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KC1G1 KCNE4,KPCD2,KTN1 NO PATHWAY KPCA KEGG: Gap junction [PATH:hsa04540]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Cholera - Infection [PATH:hsa05110]. KEGG: Tight junction [PATH:hsa04530]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Wnt signaling pathway [PATH:hsa04310]. KEGG: Calcium signaling pathway [PATH:hsa04020]. KPCG KEGG: Gap junction [PATH:hsa04540]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Tight junction [PATH:hsa04530]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. 63 KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Wnt signaling pathway [PATH:hsa04310]. KEGG: Calcium signaling pathway [PATH:hsa04020]. KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Gap junction [PATH:hsa04540]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Tight junction [PATH:hsa04530]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Wnt signaling pathway [PATH:hsa04310]. KEGG: Calcium signaling pathway [PATH:hsa04020]. KPCB KPCZ KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Tight junction [PATH:hsa04530]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KPCD KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Tight junction [PATH:hsa04530]. KPCI KEGG: Tight junction [PATH:hsa04530]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KSYK KLC1,KINN,KINH NO PATHWAY KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. LCK LYN KEGG: B cell receptor signaling pathway [PATH:hsa04662]. MARCS,MYO5A NO PATHWAY MP2K1 KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. 64 KEGG: Gap junction [PATH:hsa04540]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Dorso-ventral axis formation [PATH:hsa04320]. KEGG: Insulin signaling pathway [PATH:hsa04910]. NEDD4 KEGG: Ubiquitin mediated proteolysis [PATH:hsa04120]. NSF,NHRF2,NED4L,NEK6 NO PATHWAY NCK1 KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Axon guidance [PATH:hsa04360]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. OCRL PRKX KEGG: Gap junction [PATH:hsa04540]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Wnt signaling pathway [PATH:hsa04310]. KEGG: Hedgehog signaling pathway [PATH:hsa04340]. KEGG: Calcium signaling pathway [PATH:hsa04020]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Axon guidance [PATH:hsa04360]. PAK6 PTN1 KEGG: Adherens junction [PATH:hsa04520]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. PTN6 65 KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Adherens junction [PATH:hsa04520]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. PTN11 KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. PI52A PLCB2/PLCB1/PLCB4/PLCB3 KEGG: Gap junction [PATH:hsa04540]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Wnt signaling pathway [PATH:hsa04310]. KEGG: Calcium signaling pathway [PATH:hsa04020]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Calcium signaling pathway [PATH:hsa04020]. PLCD1 PLCG1 KEGG: Cholera - Infection [PATH:hsa05110]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Calcium signaling pathway [PATH:hsa04020]. 66 KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Cholera - Infection [PATH:hsa05110]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Calcium signaling pathway [PATH:hsa04020]. PLCG2 PCTK3,PTPRE,PICK1,PCBP1,PFD5,PTBP1,PLEK NO PATHWAY PDPK1 KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. PI4KA,PI5PA PSA7,PSA2,PSB2,PSA5,PSA1,PSA3 KEGG: Proteasome [PATH:hsa03050]. PLK1 KEGG: Cell cycle [PATH:hsa04110]. PKD2,PTEN,PTN21 NO PATHWAY KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Apoptosis [PATH:hsa04210]. KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. KEGG: Insulin signaling pathway [PATH:hsa04910]. PK3CA,PK3CD,PK3CB PK3CG KEGG: B cell receptor signaling pathway 67 [PATH:hsa04662]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Apoptosis [PATH:hsa04210]. KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Apoptosis [PATH:hsa04210]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. P3C2B,P3C2G KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Apoptosis [PATH:hsa04210]. KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. KEGG: Insulin signaling pathway [PATH:hsa04910]. P55G KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. P85A 68 KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Apoptosis [PATH:hsa04210]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. KEGG: Insulin signaling pathway [PATH:hsa04910]. KEGG: Cell adhesion molecules (CAMs) [PATH:hsa04514]. KEGG: Adherens junction [PATH:hsa04520]. KEGG: Insulin signaling pathway [PATH:hsa04910]. PTPRF PTPA KEGG: Tight junction [PATH:hsa04530]. RHG01,RABE1,RB33B,RARA,RET NO PATHWAY RIPK3,RIPK4,RHOA,RASK,RGS2 NO PATHWAY KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Gap junction [PATH:hsa04540]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Dorso-ventral axis formation [PATH:hsa04320]. KEGG: Insulin signaling pathway [PATH:hsa04910]. RAF1 KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Focal adhesion [PATH:hsa04510]. RAC1 69 KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Wnt signaling pathway [PATH:hsa04310]. KEGG: Adherens junction [PATH:hsa04520]. KEGG: Axon guidance [PATH:hsa04360]. KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: Gap junction [PATH:hsa04540]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Tight junction [PATH:hsa04530]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Dorso-ventral axis formation [PATH:hsa04320]. KEGG: Axon guidance [PATH:hsa04360]. KEGG: Insulin signaling pathway [PATH:hsa04910]. RASH,RRAS,RASM,RRAS2,RASN RASA1 KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Huntington's disease [PATH:hsa05040]. KEGG: Axon guidance [PATH:hsa04360]. RASA2 KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Axon guidance [PATH:hsa04360]. SHC1 KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Insulin signaling pathway [PATH:hsa04910]. SRC KEGG: Gap junction [PATH:hsa04540]. KEGG: Tight junction [PATH:hsa04530]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Adherens junction [PATH:hsa04520]. SOCS1 KEGG: Type II diabetes mellitus [PATH:hsa04930]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. KEGG: Insulin signaling pathway [PATH:hsa04910]. STK11 KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. STX1A KEGG: Parkinson's disease [PATH:hsa05020]. 70 KEGG: SNARE interactions in vesicular transport [PATH:hsa04130]. SNX1,STK6,SH3K1,SL9A2,SYPH NO PATHWAY SNIL1,S22A3,S10A2,SRBS1 NO PATHWAY SL9A1 KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. SCNAA,SCNNB,SCNND,SCNNG,SGK1,SGK2,SGK3 NO PATHWAY KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. SYNJ1,SYNJ2 KEGG: T cell receptor signaling pathway [PATH:hsa04660]. TEC KEGG: Nicotinate and nicotinamide metabolism [PATH:hsa00760]. KEGG: Benzoate degradation via CoA ligation [PATH:hsa00632]. KEGG: Inositol phosphate metabolism [PATH:hsa00562]. KEGG: Phosphatidylinositol signaling system [PATH:hsa04070]. TTK TBB4,TBB2 KEGG: Gap junction [PATH:hsa04540]. KEGG: Adipocytokine signaling pathway [PATH:hsa04920]. KEGG: Jak-STAT signaling pathway [PATH:hsa04630]. TYK2 TESK2,TBA2,TUB NO PATHWAY TLN1,TIE1 NO PATHWAY TXK KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Neuroactive ligand-receptor interaction [PATH:hsa04080]. KEGG: Calcium signaling pathway [PATH:hsa04020]. V1AR KEGG: Neuroactive ligand-receptor interaction [PATH:hsa04080]. V2R 71 VPS16,VPS41,VP33A NO PATHWAY KEGG: Ubiquitin mediated proteolysis [PATH:hsa04120]. KEGG: Dentatorubropallidoluysian atrophy (DRPLA) [PATH:hsa05050]. WWP1,WWP2 WBP2 NO PATHWAY YPEL2 NO PATHWAY Q96BD6,Q9BQ83,Q15464,Q8WWN9 NO PATHWAY Q8N556,Q8NAL1,Q8N317 NO PATHWAY KEGG: B cell receptor signaling pathway [PATH:hsa04662]. KEGG: Regulation of actin cytoskeleton [PATH:hsa04810]. KEGG: MAPK signaling pathway [PATH:hsa04010]. KEGG: Toll-like receptor signaling pathway [PATH:hsa04620]. KEGG: Leukocyte transendothelial migration [PATH:hsa04670]. KEGG: Focal adhesion [PATH:hsa04510]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. KEGG: Wnt signaling pathway [PATH:hsa04310]. KEGG: Adherens junction [PATH:hsa04520]. KEGG: Axon guidance [PATH:hsa04360]. Q7RTZ3 KEGG: T cell receptor signaling pathway [PATH:hsa04660]. KEGG: Natural killer cell mediated cytotoxicity [PATH:hsa04650]. ZAP70 1433G,1433Z,1433B,1433E,1433F,1433T KEGG: Cell cycle [PATH:hsa04110]. 2A5E,1433F NO PATHWAY 72 Appendix B: Pathway significant proteins ADIPOCYTOKINE SIGNALING PATHWAY INSULIN SIGNALING PATHWAY AAPK1/2 AAKG1/ 2/ 3 AAKB1/ 2 AKT2 FRAP GTR4 IRS1/2 JAK1 STK11 TYK2 AAPK1/2 AAKG1/ 2/ 3 AAKB1/ 2 AKT2 CBLB ELK1 FRAP GTR4 INSR INS IRS1/2 KPCI/Z MP2K1 PDPK1 PK3CA/B/D/G PRKX PTN1 PTPRF P55G P85A RAF1 RASH/M/N RRAS RRAS2 SHC1 SOCS1 B CELL RECEPTOR SIGNALING PATHWAY AKT2 BTK KPCB KSYK LYN PTN6 PK3CA/B/D/G P3C2B/G P55G P85A PLCG2 RAC1 RASH/M/N RRAS RRAS2 Q7RTZ3 TOLL-LIKE RECEPTOR SIGNALING PATHWAY AKT2 IKKE PK3CA/B/D/G P3C2B/G P55G P85A RAC1 Q7RTZ3 MAPK SIGNALING PATHWAY AKT2 ELK1 GNA12 GBG12 IKKE KPCA/B/G MP2K1 PRKX RAF1 RAC1 RASH/M/N RRAS RASN RASA1/2 Q7RTZ3 CELL ADHESION MOLECULES CNTN1 PTPRF NICOTINATE AND NICOTINAMIDE METABOLISM DYRK3/4 DYR1A GRK4/5/6/7 KC1G1 TTK APOPTOSIS AKT2 P3KCA/B/D/G P3C2B/G P55G P85A TIGHT JUNCTION AKT2 CSKP KPCA/B/D/G/I/Z RASH/M/N RRAS RRAS2 SRC 73 FOCAL ADHESION AKT2 CAV1 IGF1R KPCA/B MP2K1 PAK6 PDPK1 PK3CA/B/D/G P55G P85A RAF1 RAC1 RASH/M/N RRAS RRAS2 SHC1 SRC Q7RTZ3 JAK-STAT SIGNALING PATHWAY T CELL RECEPTOR SIGNALING PATHWAY AKT2 CBLB JAK1 PTN6 PTN11 PK3CA/B/D/G P3C2B//G P55G P85A TYK2 AKT2 CBLB ITK LCK NCK1 PAK6 PTN6 PLCG1 P3KCA/B/D/G P3C2B/G P55G P85A RASH/M/N RRAS RRAS2 TEC ZAP70 HEDGEHOG SIGNALING PATHWAY BMP2 PRKX REGULATION OF ACTIN CYTOSKELETON GAP JUNCTION CYTOKINECYTOKINE RECEPTOR INTERACTION BMP2 CCL28 BENZOATE DEGRADATION VIA CoA LIGATION DYRK3/4 DYR1A GRK4/5/6/7 KC1G1 TTK CHOLERA INFECTION KPCA PLCG1/2 CXA1 KPCA/B/G MP2K1 PRKX PLCB1/2/3/4 RAF1 RASH/M/N RRAS RRAS2 SRC TBB2/4 ARHG1/7 CSK GNA12 GBG12 INS MP2K1 PAK6 PI52A PK3CA/B/D/G P3C2B/G P55G P85A RAF1 RAC1 RASH/M/N RRAS RRAS2 SL9A1 Q7RTZ3 INOSITOL PHOSPHATE METABOLISM DYRK3/4 DYR1A GRK4/5/6/7 KC1G1 OCRL PI52A PLCB1/2/3/4 PLCD1 PLCG1/2 PI4KA PI5PA PK3CA/B/D/G SYNJ1/2 TTK NEUROACTIVE LIGANDRECEPTOR INTERACTION V1AR V2R 74 PHOSPHATIDY LINOSITOL SIGNALING SYSTEM DYR1A DYRK3/4 GRK4/5/6/7 KC1G1 KPCA/B/G OCRL PI52A PLCB1/2/3/4 PLCD1 PLCG1/2 PI4KA PI5PA PK3CA/B/D/G P3C2B/G P55G P85A SYNJ1/2 TTK TYPE II DIABETES MELLITUS FRAP GTR4 INSR INS IRS1/2 KPCD/Z PK3CA/B/D/G P55G P85A SOCS1 UBIQUITIN MEDIATED PROTEOLYSIS NEDD4 WWP1/2 DORSOVENTRAL AXIS FORMATION CALCIUM SIGNALING PATHWAY ERBB4 MP2K1 RAF1 RASH/M/N RRAS RRAS2 ERBB4 FAK2 KPCA/B/G PRKX PLCB1/2/3/4 PLCD1 PLCG1/2 V1AR LEUKOCYTE TRANSENDOT HELIAL MIGRATION FAK2 ITK KPCA/B/G PTN11 PLCG1 PK3CA/B/D P85A RAC1 TXK Q7RTZ3 ADHERENS JUNCTION INSR IGF1R PTN1/6 PTPRF RAC1 SRC QRT7Z3 NATURAL KILLER CELL MEDIATED CYTOTOXICITY FAK2 KPCA KSYK LCK MP2K1 PAK6 PTN6/11 PLCG1/2 PK3CA/B/D P85A RAF1 RAC1 RASH/M/N RRAS RRAS2 SHC1 Q7RTZ3 ZAP70 CELL CYCLE PLK1 1433B 1433F 1433E 1433G 1433Z 1433T 75 AXON GUIDANCE EPHA2/3 NCK1 PAK6 RAC1 RASH/M/N RRAS RRAS2 RASA1/2 Q7RTZ3 DENTATORUB ROPALLIDOLU YSIAN ATROPHY (DRPLA) INSR INS WWP1/2 Wnt SIGNALING PATHWAY KPCA/B/G PRKX PLCB1/2/3/4 RAC1 Q7RTZ3 INSULIN SIGNALING PATHWAY REGULATION OF ACTIN CYTOSKELETON AAPK1/2 AAKG1/ 2/ 3 AAKB1/ 2 AKT2 CBLB ELK1 FRAP GTR4 INSR INS IRS1/2 KPCI/Z MP2K1 PDPK1 PK3CA/B/D/G PRKX PTN1 PTPRF P55G P85A RAF1 RASH/M/N RRAS RRAS2 SHC1 SOCS1 ARHG1/7 CSK GNA12 GBG12 INS MP2K1 PAK6 PI52A PK3CA/B/D/G P3C2B/G P55G P85A RAF1 RAC1 RASH/M/N RRAS RRAS2 SL9A1 Q7RTZ3 PHOSPHATIDY LINOSITOL SIGNALING SYSTEM DYR1A DYRK3/4 GRK4/5/6/7 KC1G1 KPCA/B/G OCRL PI52A PLCB1/2/3/4 PLCD1 PLCG1/2 PI4KA PI5PA PK3CA/B/D/G P3C2B/G P55G P85A SYNJ1/2 TTK TIGHT JUNCTION AKT2 CSKP KPCA/B/D/G/I/Z RASH/M/N RRAS RRAS2 SRC 76 Appendix C: Physical Schema / Rudimentary Data Dictionary Uniprot Protein Table: Attributes Name Type Size Constraints Uniprot_ID VARCHAR2 20 PK, NOT NULL Amino_Acids VARCHAR2 20 NOT NULL Uniprot_Acc VARCHAR2 800 NOT NULL Data_Info VARCHAR2 2000 NOT NULL Protein_Desc VARCHAR2 4000 NOT NULL Gene VARCHAR2 800 FK, NOT NULL Organism VARCHAR2 100 NOT NULL Taxonomy_ID VARCHAR2 100 NOT NULL Primary_ref_id VACHAR2 500 NULL Db_ref CLOB - NULL Keywords VARCHAR2 4000 NULL Mol_wt VARCHAR2 50 NOT NULL Check_value VARCHAR2 50 NOT NULL Protein_seq CLOB - NOT NULL Pfam Domain Table: Attributes Name Type Size Constraints Uniprot_ID VARCHAR2 20 PK, NOT NULL Uniprot_acc VARCHAR2 800 NOT NULL Domain_name VARCHAR2 100 NULL Domain_desc VARCHAR2 1000 NULL Domain_no VARCHAR2 100 NULL Domain_pos VARCHAR2 400 NULL Domain_ID VARCHAR2 100 PK, NOT NULL String Protein Interaction Table: Attributes Name Type Size 77 Constraints Ensembl_protein_id_a VARCHAR2 40 PK, NOT NULL Ensembl_protein_id_b VARCHAR2 40 PK, NOT NULL Equiv_nscore INTEGER NULL Equiv_nscore_transferred INTEGER NULL Equiv_fscore INTEGER NULL Equiv_pscore INTEGER NULL Equiv_hscore INTEGER NULL Array_score INTEGER NULL Array_score_transferred INTEGER NULL Experimental_score INTEGER NULL Experimental_score_transferred INTEGER NULL Database_score INTEGER NULL Database_score_transferred INTEGER NULL Textmining_score INTEGER NULL Textmining_score_transferred INTEGER NULL Subscore_physical VARCHAR2 Combined_score INTEGER 10 NULL NULL Ophid Protein Interaction Table Attributes Name Type Size Constraints Uniprot_acc1 VARCHAR2 100 FK, NOT NULL Uniprot_acc2 VARCHAR2 100 FK, NOT NULL Dataset VARCHAR2 100 NOT NULL Kegg Pathway Table Attributes Name Type Size Constraints Gene_name VARCHAR2 255 FK, NOT NULL Gene_synonyms VARCHAR2 255 NOT NULL Gene_desc VARCHAR2 255 NULL Pathway VARCHAR2 1000 NULL Protein Identifiers Table 78 Attributes Name Type Size Constraints Uniprot_protein_id VARCHAR2 40 FK, NOT NULL Ensembl_protein_id VARCHAR2 40 FK, NOT NULL Description of Tables and Attributes 1. Uniprot Protein Table: stores manually annotated and computationally analyzed records with protein sequence and functional annotation. Uniprot_ID: Identifies a protein sequence. It usually consists of up to 11 uppercase alphanumeric characters. The general naming convention can be symbolized as X_Y, where X is a mnemonic code of at most 5 alphanumeric characters representing the protein name, ‘_’ sign serves as a separator and Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein. This code is generally made of the first three letters of the genus and the first two letters of the species. Amino_Acids: Total number of amino acids in the sequence. Uniprot_Acc: A stable way of identifying entries from release to release and includes primary and secondary accession numbers Data_Info: the date of creation and last modification of the database entry Protein_Desc: General descriptive information about the sequence stored Gene: The name of gene that code for the stored protein sequence Organism: the organism which was the source of the stored sequence Taxonomy_ID: the identifier of a specific organism in a taxonomic database Primary_ref_id: includes Medline, PubMed and Digital object identifiers Db_ref: include pointers to information related to entries and found in data collections other than Uniprot Keywords: provide information that can be used to generate indexes of the sequence entries based on functional, structural, or other categories. Mol_wt: Molecular weight of protein rounded to the nearest mass unit Dalton 79 Check_value: the sequence 64-bit CRC (Cyclic Redundancy Check) value (‘CRC64’) Sequence: contains standard IUPAC one letter code amino acids 2. Pfam Domain Table: stores domain information of proteins Uniprot_ID: Identifies a protein sequence. 1. Uniprot_Acc: A stable way of identifying entries from release to release and includes primary and secondary accession numbers 2. Domain_ID: unique identifier of protein domain for pfam database 3. Domain_name: Name of the protein domain 4. Domain_desc: Description of the protein domain 5. Domain_occurrence: the number of occurrences of each domain in a particular protein 6. Domain_pos: the start and end position of each occurrence of a domain in a protein 3. String Protein Interaction Table: stores known and predicted protein-protein interactions with scores. The score is a combined measure from the different prediction algorithms. 7. Ensembl_protein_id_a: the identifier of protein 8. Ensembl_protein_id_b: the identifier of interacting protein 9. Nscore: conserved neighborhood score of interacting proteins i.e. genes that occur repeatedly in close neighborhood in genomes (maximum allowed intergenic distance is 300 base pairs) 10. Co-occurrence score: Shows the presence or absence of linked orthologous groups across species 11. Gene fusion score: shows the individual gene fusion events per species 80 12. Dbscore: shows that interacting proteins information is documented in databases 13. Experimental score: shows that interacting protein information is obtained from an experiment 14. Text mining score: shows that interacting protein information is mentioned in publications 15. total score: sum of all above mentioned scores of interacting proteins 4. Ophid interaction Table: stores known and predicted human protein-protein interactions. It has been built from yeast, mouse, Drosophila and C.elegans HTP data. 16. Dataset: the source of dataset 17. UniProt protein_id_a : the unique identifier of a protein 18. Uniprot protein_id_b: the unique identifier of interacting protein 5. Kegg Pathway Table: stores the pathway information of genes 19. Gene_name: A unique identifier for human gene 20. Gene_synonyms: The other names for human gene 21. Gene_desc: Description of gene 22. Pathway: identifier of pathway i.e. Kegg pathway id followed by description of pathway 81 References 1. Chen J.Y., Sivachenko A.Y., Li L. Initial large-scale exploration of proteinprotein interactions in human brain. Proceedings of IEEE Computational Systems Biology (CSB), Stanford, CA, 2003, 18-23. 2. Briggs S. The Emerging Field of Systems Biology and its potential role in understanding disease. Division of Biological Sciences, University of California, San Diego, Biosphere Winter, 2004/5, 7. 3. Golemis E. Toward an Understanding of Protein Interactions. In Protein- Protein Interactions – A Molecular Cloning Manual. Cold Spring Harbor Laboratory Press, 2002, 1-5. 4. Nakai K. Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem. 2000, 54:277-344. 5. Hanash S. Disease proteomics. Nature 2003, 422:226-232. 6. Karin M., Ben-Neriah Y. Phosphorylation meets ubiquitination: The control of NF- B activity. Annual Rev. Immunology 2000, 18: 621-663. 7. Pawson T., Nash P. Assembly of cell regulatory systems through protein interaction domains. Science 2003, 300: 445-452. 8. Albert R., Jeong H., Barabasi AL. Error and attack tolerance of complex networks. Nature 2000, 406: 378-382. 9. Apic G., Gough J., Teichmann S.A. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 2001, 310:311–325. 10. Peri S. et al. Development of Human Protein Reference Database as an Initial Platform for Approaching Systems Biology in Human. Genome Research 13, 2003, 2363-2371. 11. Kitano H. Computational systems biology. Nature, 2002, 420: 206-210. 12. Birney E., Clamp M., Hubbard T. Databases and tools for browsing genomes. Annual Rev. Genomics Hum. Genetics, 2002, 3: 293-310. 13. Lacroix Z., Critchlow T. Bioinformatics: Managing Scientific Data. Morgan Kaufmann series in multimedia information and systems, 2003, 1-32. 14. Golemis E. Protein–Protein Interactions: A Molecular Cloning Manual. Cold Spring Harbor Laboratory Press 2002. 15. Xenarios I., Eisenberg D. Protein Interaction Databases. Current Opinion in Biotechnology 2001, 12:334-339. 16. http://campus.queens.edu/faculty/jannr/bio103/helpPages/c11DNA.htm 17. Salwinski L., Eisenberg D. Computational methods of analysis of protein-protein interactions. Current Opinion Structural Biology 2003, 13:377-382. 82 18. Wojcik J., Schachter V. Protein-protein interaction map inference using interacting domain profile pairs. Bioinformatics 17, suppl 1, 2001, S296-S305. 19. Wojcik J., Boneca I.G., Legrain P. Prediction, assessment and validation of protein interaction maps in bacteria. J. Mol. Biol. 323, 2002, 763–770. 20. Pazos F., Valencia A. In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins 47, 2002, 219–227. 21. Sprinzak E., Margalit H., Correlated sequence-signatures as markers of protein– protein interaction. J. Mol. Biol. 311, 2001, 681–692. 22. Deng M., Mehta S., Sun F., Chen T. Inferring domain–domain interactions from protein–protein interactions. Genome Res. 12, 2002, 1540–1548. 23. Lu L., Lu H., Skolnick J. Multiprospector: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins 49, 2002, 350–364. 24. Matthews L.R., Vaglio P., Reboul J., Ge H., Davis P., Garrels J., Vincent S., Vidal M. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res. 11, 2001, 2120–2126. 25. Zhou X., Kao M.C., Wong W.H. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl. Acad. Sci. USA 99, 2002, 12783– 12788. 26. Pellegrini M., Marcotte E.M., Thompson J., Eisenberg D., Yeates T.O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 1999, 4285–4288. 27. Pazos F., Valencia A. Similarity of phylogenetic trees as indicator of proteinprotein interaction. Protein Eng. 14, 2001, 609–614. 28. Liberles D.A., Thoren A., Heijne G.V., Elofsson A. The use of phylogenetic profiles for gene predictions. Current Genomics 3, 2002, 131–137. 29. Vert J.P., A tree kernel to analyse phylogenetic profiles. Bioinformatics 18 suppl 1, 2002, S276–S284. 30. Marcotte E.M., Pellegrini M., Ng H.L., Rice D.W., Yeates T.O., Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285: 751-753. 31. Enright A. Illioupolos I., Kyrpides N.C., Ouzounis C.A. Protein interaction maps for complete genomes based on gene fusion events. Nature 1999, 402: 86-90. 32. Clark B. F. Towards a total human protein map. Nature 1981, 292:491-492. 33. Anderson N. G., Anderson N. L. Behring Inst. Mitt. 1979, 63:169 –210. 34. Von M.C., Jensen L.J., Snel B., Hooper S.D., Krupp M., Foglierini M., Jouffre N., Huynen M.A., Bork P. STRING: known and predicted protein–protein associations integrated and transferred across organisms. Nucleic Acids Res., 2005, 33 Database issue: D433-7. 83 35. Brown K.R., Jurisica I. Online Predicted Human Interaction Database. Bioinformatics 2005 21(9):2076-2082. 36. Bairoch A., Apweiler R., Wu C. H., Barker W. C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S. The Universal Protein Resource (UniProt) Nucleic Acids Res. 33. 2005, D154-159. 37. Bateman A. Coin L., Durbin R., Finn R.D., Hollich V., Griffiths-Jones S., Khanna A., Marshall M., Moxon S., Sonnhammer E.L., Studholme D.J., Yeats C., Eddy S.R. The Pfam Protein Families Database. Nucleic Acids Research 2004, Database Issue 32:D138-D141. 38. Kanehisa M. et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006, 34:D354-357. 39. NCBI http://www.ncbi.nih.gov/ 40. Jain P., Kircher M., Parameswaran K. http://posa3.org/workshops/ThreeTierPatterns/ 41. Bellinzona M., Quercia D., Capece J.C., Campbell K.L. MAERC/IFAS Agro ecosystem research program information system development and implementation. Technical Report, University of Florida, 2002. 42. Dublin Core Metadata Element, version http://dublincore.org./documents/dces, 2004. 1.1.: Reference Description, 43. Ambler S.W. Agile model driven development with UML2, The Object Primer Third Edition, 2004, Chapter 6. 44. Blazer-Yost B.L., Vahle J.C., Byars J.M., Bacallao R. Real-time three dimensional imaging of lipid signal transduction: Apical membrane insertion of epithelial Na+ channels. Am.J. Physiol. Cell Physiol., 2004, 287:C1569-C1576. 45. Blazer-Yost B.L., Nofziger C. The role of the phosphoinositide pathway in hormonal regulation of the epithelial sodium channel. Adv. Expert. Med. & Biol., 2004, 559:359-368. 46. Nikitin A., Egorov S., Daraselia N., Mazo I. Pathway studio – the analysis and navigation of molecular networks. Bioinformatics Vol.19 no.16 2003, 2155-2157. 84 Curriculum Vitae SudhaRani Mamidipalli [email protected] http://informatics.iupui.edu/people/profile.php?id=230 3856 Cornwallis Lane, Carmel IN 46032 Phone (317) 873 4746 Objective Seeking a full-time position in Bioinformatics, Computational Biology, Information Management, Database Management, Protein Interactomics or related areas. Education M.S. Bioinformatics, Indiana University, Indianapolis, GPA 3.9 / 4.0, May 2006 o Thesis titled “HAPPI: A Bioinformatics Database Platform Enabling Network Biology Studies”, advised by Dr. Jake Yue Chen Post Graduate Diploma in Computer Applications, Hyderabad, India B.S. Agricultural Science, Acharya N G Ranga Agricultural University, India Sun Certified Java2 Programmer, Sun Microsystems, USA Certificate in Web Markup & style coding and Access End-User, Indiana University Skills Scripting Languages Markup Languages Languages Platforms Databases Software Tools Scientific Tools : : : : : : : Perl, PHP, and Java script XML, HTML, and XHTML Java, SQL, PSQL, C, C++, COBOL, JCL, and CICS UNIX (Solaris), Linux (Red Hat, Suse), Windows, MVS, DOS Oracle, MySQL, DB2, PostgreSQL, MS Access, and SQL Server SSH, Exceed, Toad, Erwin, Endnote, Komodo, AquaData Studio Blast, FASTA, Spotfire, Affymetrix Suite, CyberLab, Phase, Lab Track, Vector NTI, and TotalChrom Proteomics Tools : mzXML, ReAdW, mzXML2Other, PeptideProphet Bioinformatics Databases : String, Ophid, UniprotKB/Swiss-prot, UniprotKB/Trembl, Pfam, Kegg, Ensembl, IPI, Entrez, Refseq Projects, Research Papers and Presentations Gene Expression Data Management and Analysis NCBI – Website Analysis Medical Databases-Electronic Medical Records Intelligent Electronic Laboratory Notebooks Querying Multiple Bioinformatics Information Sources- Can Semantic Web Research Help? Genetic Algorithms for Protein Folding Simulations Surrostat Biomarker Analysis System Information Representation, Retrieval and Visual Presentation in Bioinformatics 85 Professional Experience * School of Informatics & School of Science, IUPUI August, 2004 – May, 2006 Research Assistant, Discovery Informatics and Computing Group Research on Protein Interactomics: Mining functional links between proteins. Developed an application for oligo sequence analysis OligoMatcher and a sequence annotated feature mapping tool SafMap. Installed and Configured the Oracle client and Toad for database development Downloaded (ftp), parsed (Perl regular expressions) and loaded data into tables (sqlloader) Defined metadata (Dublin Core standard) for data management and duplicity reduction. Developed conceptual (Entity-Relationship diagrams), logical and physical data models by identifying entities, attributes, relationships, and assigning keys Normalized data for data integrity, optimized queries, faster index creation and sorting. Used data management tools (Spotfire) to visually and statistically interrogate the data. Administered Lacie Storage Server (Size: 2TB, users: 17) Trained new users of the core informatics resources. * Dow AgroSciences, Indianapolis May, 2005 - August, 2005 R&D Intern, Discovery Research Information Management Installed, configured and administered EMBOSS suite of Molecular Biology Programs, wrappers for EMBOSS, and wEMBOSS on DAS Intranet of Bioinformatics that enables project, data and results management. Integrated public biological databases into DAS Bioinformatics environment. Established a pipeline for tandem mass spectrometry (MS/MS) data analysis, validation and quantification for Proteomics. Solved thorny problems by communicating with DAS users and external experts. Evaluated DAS Bioinformatics website and proposed usability enhancements. Promoted work to molecular biologists through seminars, conferences and meetings. Influenced Trait Genetics and Technologies scientists in using bioinformatics tools. * School of Medicine, Indiana University January, 2005 - May, 2005 Independent Study, Pharmacogenetic Bioinformatics Application of Bioinformatics in SNP discovery of INDO gene * University Information Technology Services May, 2004 - August, 2004 Web Analyst, Indiana University Collected user and technical requirements from professors and administrators. Designed and developed Music Theory Placement Exam for School of Music, IUPUI. * CGI (formerly IMRGlobal), Bangalore, India October, 1997 - May, 2000 Software Engineer Involved in analysis, design, development, and maintenance of mainframe applications. Conducted project meetings on weekly basis to resolve issues and problems. Developed test strategy that includes baseline, unit, parallel and integration testing of jobs. Interacted with onsite team on a daily basis via conference calls. Planned and conducted peer reviews for all deliverables. Maintained detailed record of decisions taken. Clients are Michelin Tires, Fleming Foods, and John Hancock Insurance etc. 86 * Compcore Info tech (India) Ltd., Hyderabad, India January, 1996 - September, 1997 Y2K Project Trainee Downloaded, checked and unzipped the inventory by sub-system and language Analyzed the programs using Transform-2000 tool Applied macros to convert the tool outputs to the required format Prepared weekly time-sheets for having control on the hours spent on project Adhered to quality control standards to ensure the correctness and quality of work Publication Mamidipalli SudhaRani, Mathew Palakal, Shuyu Dan Li. OligoMatcher: analysis and selection of specific oligonucleotide sequences for gene silencing by antisense or siRNA. Applied Bioinformatics Journal. In press. Book Chapter Mamidipalli SudhaRani, Jake Yue Chen. “Network Biology Informatics: Enabling human protein interactomics studies” in ‘Current Topics in Human Genetics: Studies of Complex Diseases’. In press. Abstract Arafayene, M., Mamidipalli, S., Philips, S., Cao, D., Flockhart, D., Wilkes, D., Skaar, T. Identification of Functional Genetic Variants of the Indoleamine 2, 3 Dioxygenase Gene. American Association for Cancer Research Annual Meeting 2006. Academic Honors Featured as a top graduate from School of Informatics at IUPUI Commencement ceremony, 2006. Professional Activities Grand Awards Judge, Intel International Science and Engineering Fair (ISEF 2006) - Medicine and Health Sciences Category Paper Reviewer, ACM Symposium on Applied Computing (SAC 2006) - Bioinformatics Track Affiliations Women and Hi Tech Informatics Women’s Organization IUPUI Computer Science Club 87