Download HAPPI: A Bioinformatics Database Platform Enabling Network

Document related concepts

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
HAPPI: A BIOINFORMATICS DATABASE PLATFORM
ENABLING NETWORK BIOLOGY STUDIES
SudhaRani Mamidipalli
Submitted to the faculty of the Bioinformatics Graduate Program
in partial fulfillment of the requirements
for the degree
Master of Science in Bioinformatics
in the School of Informatics
Indiana University
May 2006
Accepted by the Graduate Faculty of Indiana University, in partial
Fulfillment of the requirements for the degree of Master of Science
_________________________________
Dr. Jake Yue Chen, Ph.D., Chairperson
__________________________________
Dr. Snehasis Mukhopadhyay, Ph.D.
Master’s Thesis
Committee
__________________________________
Dr. Bonnie Blazer-Yost, Ph.D.
ii
Dedicated to
Amma, Daddy,
Sudhakar & Laasya
iii
Acknowledgements
There are many people whose time and efforts have been instrumental in making this
thesis a success.
First and foremost, to my best friend and husband Sudhakar, a source of wisdom,
encouragement, strength, and especially entertainment … thank you for supporting me
always and without question in whatever intellectual pursuit I attempt. Special thanks to
my daughter, Laasya for being patient and for reminding me what truly matters… and for
being my inspiration through this effort.
It is difficult to overstate my gratitude to my research advisor, Dr. Jake Yue Chen.
With his enthusiasm, his inspiration, and his great efforts to explain things clearly and
simply, he helped to make bioinformatics fun for me. Through out my research work, he
provided encouragement, sound advice, and good teaching to achieve greater heights and
realize my full potential.
To Dr. Snehasis Mukhopadhyay, I thank for being a supportive, strong guiding
force as an academic advisor and member of my committee. I would also like to thank
Dr. Bonnie Blazer-Yost for all the support and guidance through out the whole process. It
was absolutely invaluable.
My heartful thanks are due to Stephanie Burks and Teresa Hunter of Research and
Technical Services and Kimberly Melluck of school of informatics for their timely help
on computing resources.
It gives me great pleasure in acknowledging the members of my research group,
Zhong, Usha, Bhanu, Pranav and Laavanya for providing insight and encouragement
throughout my research work.
Finally, I would like to thank my parents for their tremendous love, endless
support, and many prayers. They have always believed in me and always been there for
me.
iv
ABSTRACT
SudhaRani Mamidipalli
HAPPI: A BIOINFORMATICS DATABASE PLATFORM
ENABLING NETWORK BIOLOGY STUDIES
The publication of the draft human genome consisting of 30,000 genes is merely the
beginning of genome biology. A new way to understand the complexity and richness of
molecular and cellular function of proteins in biological processes is through
understanding of biological networks. These networks include protein-protein interaction
networks, gene regulatory networks, and metabolic networks. In this thesis, we focus on
human protein-protein interaction networks using informatics techniques.
First, we performed a thorough literature survey to document different experimental
methods to detect and collect protein interactions, current public databases that store
these interactions, computational software to predict, validate and interpret protein
networks. Then, we developed the Human Annotated Protein-Protein Interaction
(HAPPI) database to manage a wealth of integrated information related to protein
functions,
protein-protein
functional
links,
and
protein-protein
interactions.
Approximately 12900 proteins from Swissprot, 57900 proteins from Trembl, 52186
protein-domains from Swisspfam, 4084 gene-pathways from KEGG, 2403190
interactions from STRING and 51207 interactions from OPHID public databases were
integrated into a single relational database platform using Oracle 10g on an IU
Supercomputing grid. We further assigned a confidence score to each protein interaction
pair to help assess the quality and reliability of protein-protein interaction. We hosted the
v
database on the Discovery Informatics and Computing web site, which is now publicly
accessible.
HAPPI database differs from other protein interaction databases in these following
aspects: 1) It focuses on human protein interactions and contains approximately 860000
high-confidence protein interaction records—one of the most complete and reliable
sources of human protein interaction today; 2) It includes thorough protein domain, gene
and pathway information of interacting proteins, therefore providing a whole view of
protein functional information; 3) It contains a consistent ranking score that can be used
to gauge the confidence of protein interactions.
To show the benefits of HAPPI database, we performed a case study using Insulin
Signaling pathway in collaboration with a biology team on campus. We began by taking
two sets of proteins that were previously well studied as separate processes, set A and set
B. We queried these proteins against the HAPPI database, and derived high-confidence
protein interaction data sets annotated with known KEGG pathways. We then organized
these protein interactions on a network diagram. The end result shows many novel hub
proteins that connect set A or B proteins. Some hub proteins are even novel members
outside of any annotated pathway, making them interesting targets to validate for
subsequent biological studies.
vi
TABLE OF CONTENTS
List of Tables ………………………………………………………………….ix
List of Figures ………………………………………………………………....x
I. Introduction
1.1. Introduction to Protein Interactions ……………………………………1
1.2. Contributions of the thesis ……………………………………………..3
II. Background
2.1 Literature Review……………………………………………………….4
2.1.1. The Life Science Discovery Process…………………………4
2.1.2. Methods to detect Protein Interactions……………………….5
2.1.3. Databases to document Protein Interactions………………….6
2.1.4. Computational methods and Protein Interactions…………….7
2.2. Problem Statement……………………………………………………..10
2.3. Research Question……………………………………………………..11
2.4. Hypothesis……………………………………………………………..11
III. Materials
3.1. Bioinformatics Databases used………………………………………..13
3.1.1. Protein Interaction Databases……………………………….13
3.1.2. Protein Annotation Databases……………………………….14
3.1.3. Protein-Domain Databases…………………………………..16
3.1.4. Pathway Databases…………………………………………..17
3.1.5. Bibliographic Databases……………………………………..18
3.2. Software Languages used……………………………………………...19
3.2.1. Perl…………………………………………………………...19
3.2.2. PHP…………………………………………………………..19
3.2.3. HTML………………………………………………………..19
3.2.4. SQL…………………………………………………………..19
3.2.5. PSQL…………………………………………………………20
3.3. Relational Databases used……………………………………………...20
3.3.1. Oracle…………………………………………………………20
3.3.2. PostgreSQL…………………………………………………...20
3.4. Software Tools used…………………………………………………….21
3.4.1. Komodo……………………………………………………….21
3.4.2. SQL Loader…………………………………………………...21
3.4.3. SSH / SFTP……………………………………………………22
3.4.4. Ultra edit………………………………………………………22
3.4.5. Erwin………………………………………………………......22
3.4.6. Toad…………………………………………………………...22
3.4.7. Aqua Data Studio……………………………………………...23
3.4.8. SQL Tools…………………………………………………......23
3.4.9. Endnote…………………………………………………….….23
vii
3.5. University Computing Resources used…………………………………23
3.5.1. BioX…………………………………………………………..23
3.5.2. Zen ……………………………………………………………24
3.5.3. Research Database Complex (RDC)……………………….....24
IV. Procedures and Interventions
4.1. Method Roadmap for HAPPI database ….……………………………..25
4.2. Architecture……………………………………………………………..25
4.3. Data Integration…………………………………………………………28
4.3.1. Data Warehouse Approach……………………………………28
4.3.2. Data Acquisition………………………………………………29
4.3.3. Data Reduction………………………………………………...31
4.3.4. Feature Selection……………………………………………....33
4.3.5. Meta-data specification……………………………………......36
4.3.6. Database Design……………………………………………….37
4.3.7. Data Storage/Loading………………………………………….38
4.4. Query Processing………………………………………………………..38
4.5. User Interface……………………………………………………………39
4.6. Unified Scoring Model………………………………………………......41
V. Results and Discussion………………………………………………………..46
VI. Case Study……………………………………………………………………51
VII. Conclusion…………………………………………………………………..58
7.1. Directions for future work………………………………………………58
VIII. Appendices
Appendix A………………………………………………………………….60
Appendix B………………………………………………………………….73
Appendix C …………………………………………………………………77
References………………………………………………………………………...82
Curriculum vitae
viii
List of Tables
Table 2.1.Databases of protein interactions………………………………………….6
Table 2.2.Software tools to predict interactions between proteins ………………….7
Table 2.3.Comparison of interaction features across human databases…………….10
Table 4.1.An overview of Data Acquisition from different data sources …….........31
Table 4.2.An Overview of Data Reduction of Protein Integrated Data ……………33
Table 4.3.Line types and codes of a sequence entry in Uniprot database………….33
Table 4.4.Summary of feature selection from Protein Integrated databases……….35
Table 4.5.An overview of tables loaded from Protein Integrated databases……….38
Table 4.6.Analysis of String database scores ………………………………………41
Table 4.7.Analysis of Ophid database scores………………………………………43
Table 5.1.Comparison of SGK1 interacting proteins across P.I. databases………..49
Table 5.2.Comparison of database features across P.I. databases……………….....50
ix
List of Figures
Figure 2.1.Information-driven discovery process …………………………………...4
Figure 2.2.Different levels of observation of protein interactions……………………5
Figure 3.1.Snapshot of String Protein Interaction database ………………………...13
Figure 3.2.Snapshot of Ophid Protein Interaction database ………………………...14
Figure 3.3.Snapshot of SwissProt manually annotated Protein Sequence database…15
Figure 3.4.Snapshot of Pfam Protein Domain database …………………………….16
Figure 3.5.Kegg Pathway database showing Insulin Signaling Pathway …………...17
Figure 3.6.Snapshot of PubMed literature database ………………………………...18
Figure 4.1.Method Roadmap for HAPPI database ………………………………….25
Figure 4.2.Hardware Architecture of HAPPI database ……………………………..26
Figure 4.3.Three-Tier Software Architecture: Structure and Technologies…..……..27
Figure 4.4.Integration of Protein Annotation, Interaction, Domain,
Sequence and Pathway Data……………………………………………..30
Figure 4.5.Initial Data Model of HAPPI database…………………………………...37
Figure 4.6.User Interface Flow Diagram of HAPPI database ………………………40
Figure 4.7.String database score distributions……………………………………….42
Figure 4.8.Flow Chart for Score Consolidation……………………………………..45
Figure 5.1.The Query page of HAPPI database……………………………………..46
Figure 5.2.The Interaction Results page of HAPPI database……………… ……….47
Figure 5.3.The Interaction Annotation page of HAPPI database……………………48
Figure 5.4.The Protein Annotation Page of HAPPI database …………….. ……….48
Figure 6.1.Method Roadmap for case study ………………………………………..52
Figure 6.2.Insulin Signaling Pathway……………………………………………….54
Figure 6.3.Visualization of Insulin Pathway Interaction Network using
Pathway studio…………………………………………………………..55
Figure 6.4.Visualization of Insulin Pathway Protein Interaction Network…………56
x
I. Introduction
1.1. Introduction to Protein Interactions
The study of comprehensive collections of all the proteins and the molecular interactions
among them, such as physical binding or regulatory modification, within cells of an
organism is known as Protein Interactomics [1]. Its biological significance ultimately lies
in the validation of disease bio-markers in molecular network context and in the
identification of better drug targets.
The PROTEin complement expressed by a genOME is called Proteome. According to Dr.
Steven Briggs [2], the concept of proteome is fundamentally different from that of a
genome: “while the genome is virtually static and can be well defined for an organism,
the proteome continually changes in response to external and internal events”. Proteomics
is divided into two main categories:
1. Expression Proteomics – the study of all gene products present in a tissue, a cell, an
organelle along with their modifications.
2. Functional Proteomics – examining the changes that arise in response to a change in
the biological system of interest, thus studying the functions of proteins within
complex networks. This includes the study of protein-protein interactions (Protein
Interactomics).
The study of protein interactions has been playing a vital role in understanding of how
proteins function within the cell. Publication of the draft human genome and proteomicsbased protein profiling studies brought a new era in protein interaction analysis.
Understanding the characteristics of protein interactions in a given cellular proteome will
be the next milestone in bringing revolution in cellular biochemistry [3].
A comprehensive collection and integration of information belonging to human proteins,
their features, and functions would be invaluable to biologists in several ways. For
instance, the type of domains found in proteins generally predicts the functional class or
1
biological role of proteins. The exact sub cellular localization of proteins and their
distribution within body tissues is also important to protein function [4]. Ultimately, it is
vital to know any possibility of association of proteins with human diseases, as this
dictates their involvement in certain pathways [5].
About 30,000 genes of the human genome are expected to give rise to 1,000,000 proteins
through a series of post-translational modifications and gene splicing mechanisms [3].
Posttranslational modifications such as phosphorylation and ubiquitination can extremely
influence the activity of proteins and are generally used as regulatory mechanisms in
signal transduction pathways [6]. Although a few of these proteins can be expected to
work in relative isolation, the majority of them is expected to interact with other proteins
in complexes and networks to combine the innumerable elements of processes that
impact cellular structure and function. Because proteins act together with other proteins,
knowing the identity and characteristic features of interacting proteins along with the
relevant binding sites can boost hypothesis-driven studies and explanation of regulatory
networks [7]. Protein functions can be extracted from protein-protein interactions. If two
proteins interact with each other and if the function of one protein is known, then some
relevant information about the function of other interacting protein can be obtained [8].
When two proteins are said to interact with each other, the domains that constitute
proteins also interact physically with each another to perform the required functions.
Hence understanding an interaction between two proteins at the domain level is very
essential in getting a whole view of the protein interaction network, and thereby protein
functions. Domains are considered as structural subunits or building blocks of proteins
that are maintained during evolution. Proteins constitute either one domain or more than
one domain. But many proteins are built from several domains. The single domain
proteins are found to be around 34% in prokaryotes and 20% in eukaryotes [9].
Generally, each domain has its own function to do for the protein, such as spanning the
plasma membrane, DNA-binding, providing a surface to bind specifically to another
protein [4]. Homologous domains, the domains that are related together by descent, can
be grouped together to form a super family. Motifs are smaller elements of a protein that
2
are important in protein-protein interactions and functions of protein. Examples include
coiled-coil and nuclear localization signal [10].
Network prediction is a ‘systems’ problem that requires a ‘systems’ approach. Systems
biology is an emerging discipline that utilizes experimental techniques and bioinformatics
to help understand biological system on a global scale [11]. One of the key problems to
solve under systems biology is to integrate gene and protein data effectively with other
information such as PubMed literature [12]. The other problems may include limited
experimental data, network complexity and infinite network solutions.
1.2. Contributions of the Thesis
There are 2 major contributions of this thesis to human protein-protein interactomics
field. They are
1. A useful resource that enables human protein interactome researchers to do large
scale protein interaction network.
a. 86,000 highly reliable comprehensive human protein interaction database
b. Consistent ranking score to assess the confidence of protein-protein
interactions
2. Database framework enables integrated view of annotated protein interactions in
web-based environment to bench biologists.
a. Provides a whole view of functional information of interacting proteins.
3
II. Background
2.1 Literature Review
The data of protein–protein interactions stored in public databases provide access to both
experimental data and predicted data. Literature survey was done on life science
discovery process, data sources in life science in particular to protein interactions,
associated computational methods and tools.
2.1.1. The Life Science Discovery Process
Bioinformatics has been playing a major role in managing scientific data and thereby
supporting life science discovery. Bioinformatics can be defined as the application of
information technology to life sciences for a better understanding of life. It includes a
broad range of functions such as data acquisition, data reduction, data analysis, data
management, data integration, storage, statistics, and visualization. The computational
approaches need information integration from a variety of data sources [13]. For
example, protein interactomics includes the interacting protein information but it should
interact with protein annotation databases, pathway databases and domain databases to
get a clear picture of how proteins function within a cell. Thus, integration bioinformatics
is becoming an emerging frontier for life sciences research.
Genomics
Gene Expression
Proteomics
Systems Biology
Bioinformatics Databases
Networks
Pathways
Computational Tools
Figure 2.1.Information-driven discovery process [13]
4
The above figure (Fig.2.1.) explains the importance of integration of biological data
obtained from both experimental and computational methods for a novel discovery. Data
integration is an ongoing active topic in life science area. According to access and
architectures, the data integration solutions can be roughly divided into three major
categories: the data warehousing approach, the distributed or federated approach, and the
mediator approach.
2.1.2. Methods to detect Protein Interactions
Until recently, experimentally determined protein–protein interactions were gathered to
analyze potential functions of proteins [14]. These experimental methods detect proteinprotein interactions at different levels of resolution [15]. The figure (Fig.2.2.) shows in
detail different levels of observation of protein interactions. The first is an ‘atomic
observation’ in which the protein interaction is detected at atom level. Example includes
X-ray crystallography. Central dogma of molecular biology figure obtained from BIOL
103: Principles of Biology Course of Queens University of Charlotte was modified to
show different levels of observation of Protein Interactions [16].
Atomic level
Direct level
Complex level
Cellular level
Figure 2.2.Different levels of observation of protein interactions [16]
5
Second is a ‘direct interaction observation’ where protein interaction is predicted at
protein level. Example includes two-hybrid experiment. Third is a ‘Complex observation’
where interactions are detected at complex level. A complex may comprise of more than
one protein and what proteins in first complex are interacting with proteins in second
complex is not known. Example includes immunoprecipitation or mass-spectral analysis.
The fourth category uses activity bioassay to detect interactions at the cellular level.
Example includes proliferation assays of cells stimulated by a receptor-ligand interaction.
In terms of the availability of experimental data in the literature, the complex interaction
level is the most commonly represented, followed by the cellular interaction, the direct
interaction, and finally the atomic observation level [15].
2.1.3. Databases to document Protein Interactions
Here is the literature review on different types of biological databases dedicated to
protein interactions. Features such as availability, content, statistics and types of
interactions documented in interaction databases were surveyed. The table (Tab.2.1.)
compares different features of protein interaction databases.
Database
Acronym
URL
Content
Interaction
Statistics
Availability
Type of
Interactions /
Networks
Database of
Interacting
Proteins
DIP
http://dip.doembi.ucla.edu/
Catalog of
protein-protein
interactions
55732
interactions
among
19051
proteins in
110 species
Free for
both
Academic
and
Commercial
users
Experimental
Biomolecular
Interaction
Network
Database
BIND
http://bind.ca
Biomolecular
interaction
complexes and
pathways
201896
interactions
in 1528
organisms
Free for
both
Academic
and
Commercial
users
Experimental
MIPS
mammalian
ProteinProtein
Interaction
database
MPPI
A new
resource of
high-quality
human protein
interaction
data
1800
interactions
among 900
proteins
from 10
mammalian
species
Free for
both
Academic
and
Commercial
users
Experimental
http://mips.gsf.
de/proj/ppi/
6
Physical
network
Physical
network
Physical and
Genetic
networks
Search Tool
for the
Retrieval of
Interacting
Genes or
Proteins
STRING
Human
Protein
Reference
Database
HPRD
Human
Protein
Interaction
Database
HPID
http://string.e
mbl.de/
A database of
predicted
functional
associations
among genes
or proteins
23256408
interactions
among
736429
proteins in
179 species
Free for
Academic
but not for
Commercial
users
Experimental,
http://hprd.org
Comprehensiv
e collection of
protein
features, posttranslational
modifications
and proteinprotein
interactions
33710
interactions
among
20097
proteins in
human
organism
Free for
Academic
but not for
Commercial
users
Experimental
Provides
human protein
interaction
information
and Predicts
potential
interactions
between
proteins
submitted by
users
8565
interactions
among 1690
proteins in
human
organism
Free for
both
Academic
and
Commercial
users
Experimental,
Structural and
Predicted
http://hpid.org
Predicted
Physical
network
Table 2.1.Databases of protein interactions
2.1.4. Computational methods and Protein Interactions
Recently computational methods have been playing an important role in determining
interactions between proteins. They are used to predict the interactions, to validate the
results and to analyze the protein networks developed from interaction databases.
Several computational approaches are available today for predicting interactions between
proteins. Protein-protein interactions can be inferred on the basis of functional
relationships between proteins such as patterns of domain fusion and protein occurrence,
sequence and structural analysis, correlation of functional genomic features, the
conserved interactions in other organisms and literature. Recently developed methods for
the inference of protein–protein interactions can be broadly classified into physical and
functional linkages [17].
The methods under physical linkage type are as follows:
7
1. Interspecies interaction transfer based on the interacting sequence motif pairs
identified in yeast two-hybrid screens. [18,19]
2. Interactions inferred from correlated mutations. [20]
3. Co-occurrence of sequence domains. [21, 22]
4. Structure assignment followed by threading-based interaction energy evaluation.
[23]
5. Ortholog-based transfer of interactions between species followed by experimental
validation. [24]
The methods under functional linkage type are as follows:
1. Network topology based functional annotation. [25]
2. Phylogenetic profile method. [26]
3. Phylogenetic profile enhancements. [27,28,29]
4. Rosetta stone or gene fusion method.[30,31]
Web-based tools and Protein Interactions
The following table (Tab.2.2.) enlists the current web based tools to predict, validate, explore,
analyze and visualize protein-protein interaction networks.
Web tool
Acronym
URL
Features
Automated Detection and
Validation of Interaction
by Co-Evolution
ADVICE
http://advice.i2r.astar.edu.sg/
Prediction and validation of proteinprotein interactions
BioLayout Java
-
http://cgg.ebi.ac.uk/s
ervices/biolayout/
An automatic graph layout algorithm
for similarity and network
visualization
electrostatic surface of
Functional-site
eF-site
http://efsite.hgc.jp/eF-site/
A database for molecular surfaces of
proteins functional sites, displaying
the electrostatic potentials and
hydrophobic properties together on the
Connolly surfaces of the active sites
Expression Profiler
EP:PPI
http://ep.ebi.ac.uk/E
P/PPI/
Explores protein interaction data using
expression data
IntAct Project
IntAct
http://www.ebi.ac.uk
Database and toolkit for the storage,
8
/intact/index.html
presentation and analysis of protein
interactions
InterWeaver
-
http://interweaver.i2r
.a-star.edu.sg/
A web server of interaction reports
Interaction Prediction
through Tertiary Structure
InterPreTS
http://speedy.emblheidelberg.de/people
/patrick/interprets/in
dex.html
Predicts the potential interaction of
two proteins from three-dimensional
information of protein complexes
Medusa
-
http://www.bork.em
blheidelberg.de/medus
a/
An interface to the STRING protein
interaction database and a general
graph visualization tool
Surface Properties of
Interfaces – Protein
Protein interfaces
SPIN-PP
Server
http://trantor.bioc.col
umbia.edu/cgibin/SPIN/
A database of all protein-protein
interfaces for protein-protein
interactions in the Protein Data Bank
Protein-Protein Interaction
Server
-
http://www.biochem.
ucl.ac.uk/bsm/PP/ser
ver/server_help.html
A means of calculating a series of
descriptive parameters for the interface
between any two proteins in a three
dimensional protein structure
Protein Interactions
VisualizatiOn Tool
PIVOT
http://www.cs.tau.ac.
il/~rshamir/pivot/
A Java based tool for visualizing
protein-protein interactions
Protein Interaction Map
Walker
PIMWalker
http://pim.hybrigenic
s.com/pimriderext/pi
mwalker/
An interactive tool for displaying
protein interaction networks
PathBLAST
-
http://chianti.ucsd.ed
u/pathblast/
Network alignment and search tool for
comparing protein interaction
networks across species to identify
protein pathways and complexes that
have been conserved by evolution
Interface for Sequence
Prediction Of Target
iSPOT
http://cbm.bio.uniro
ma2.it/ispot/
Prediction of protein-protein
interactions mediated by families of
peptide recognition modules
WebInterViewer
-
http://165.246.44.45/
hpid/webforms/Visu
alization.aspx
Visualizes and analyzes large-scale
protein interaction networks in the 3D
space.
infer Protein Protein
Interactions
iPPI
http://www.bioinfo.c
u/iPPI/
infers protein-protein interactions
through homology search
interface for Pfam
iPfam
http://www.sanger.ac
.uk/Software/Pfam/i
Pfam/
Visualization of protein-protein
interactions at domain and amino acid
resolutions
Virtual Ligand Screening
-
http://www.molsoft.c
om/vls.html
A computer technique that simulates
the interaction between proteins and
small molecules that might be good
lead to potential new drugs. Example.
HIV-1 protease
Table 2.2.Software tools to predict interactions between proteins
9
2.2. Problem Statement
Websites with human protein interactions were extensively studied (Tab.2.3.) and
compared in respect to features such as ranking, protein domain, pathway, gene,
annotation, sequence, co-citation and database cross-references. The link is not
considered as a feature except for database cross-references. Features are represented as
complete (+) and incomplete (-).
Database
DIP
HPRD
STRING
INTACT
OPHID
Interacting proteins list
+
+
+
+
+
Domain Information
-
+
-
-
-
Pathway Information
-
-
-
-
-
Database
+
+
-
-
+
Gene information
+
+
-
+
-
Protein sequence
+
+
+
+
-
-
+
-
-
+
-
-
+
-
-
Only links Domain
Emphasis
-
for
and
on protein
source information for
domain
sequence interaction
all interacting proteins
info.
info. only determinat
Crossref.
information
Protein annotation
Information
Ranking feature for
Protein interactions
Comments
for input
ion
protein
from
Annotation and data
but not for different
all
perspectiv
interacting es
proteins
Table 2.3.Comparison of interaction features across human databases
10
The above comparison shows that none of the existing human interaction databases are
providing a complete overview of interacting proteins. Surprisingly, only string database
seems to provide ranking to interacting protein pair. HPRD seems to have more
information than other databases but unfortunately that information is limited to input
protein but not to all interacting proteins. Hence this study shows that there is a need of
complete and reliable resource for protein interactions in human protein interactomics
field.
2.3. Research Question
How best can we integrate, organize and represent the data of human protein-protein
interactions, domains, pathways, co-citation and annotation information computationally
to extract new biological knowledge? (E.g. validating disease bio-markers in molecular
network context and identifying better drug targets).
2.4. Hypothesis
The literature survey on life science discovery process, information integration
environment for life science discovery, databases and methods in life science in particular
protein interactomics and data integration laid a good foundation in defining the scope
and goals of the research work on human protein interactomics.
Protein interaction networks act as backbone of current functional genomic research.
These networks lay the foundation for systems biology analysis of the cell. To extract
novel information from protein interaction data, computational development of two basic
types is necessary:
(1) Database infrastructure enabling efficient storage and retrieval, and
database design to accommodate different databases to communicate
with each other, and to allow researchers to access the information in
these databases [32] and
11
(2) Software tools to identify relationships between data and to generate
hypotheses that can be tested experimentally [33].
The robust data integration system in turn should have the following fundamental
features [13]:
1. Accessing and retrieving relevant data from different databases
2. Transforming the retrieved data into designated data model for integration
3. Providing a rich common data model for exploring retrieved data and presenting
integrated data objects to the end user applications
4. Providing a high-level expressive language to create complex queries across
multiple databases and to facilitate data manipulation, transformation and
integration tasks
5. Managing query optimization and other complex issues
12
III. MATERIALS
3.1. Bioinformatics Databases used
The following are the bioinformatics databases used for the research work. They include
wide variety of biological databases ranging from proteins, domains, pathways to proteinprotein interactions.
3.1.1 Protein Interaction Databases
STRING
STRING [34] is a database of known and predicted protein-protein interactions. The
interactions (Fig.3.1.) include direct or physical and indirect or functional associations.
The associations are obtained from high-throughput experimental data, from the mining
of databases and literature, and from predictions based on genomic context analysis. It
integrates and ranks these associations by comparing them against a common reference
set, and presents evidence in a consistent web interface.
Figure 3.1.Snapshot of String Protein Interaction database
OPHID
OPHID [35] is an on-line database of human protein-protein interactions. It explores
known and predicted protein-protein interactions, and facilitates bioinformatics initiatives
in exploring protein interaction networks (Fig.3.2.). It has been built by mapping high-
13
throughput (HTP) model organism (yeast, mouse, Drosophila and C.elegans) data to
human proteins. The database currently contains 47656 interactions with 10652 proteins.
Figure 3.2.Snapshot of Ophid Protein Interaction database
3.1.2. Protein Annotation Databases
UniProt (Universal Protein Resource) is the world's most comprehensive catalogue of
information on proteins. It is a central repository of protein sequence and function created
by joining the information contained in UniProtKB/Swiss-Prot, UniProtKB/TrEMBL,
and PIR.
SWISSPROT
The Swiss-Prot Protein Knowledgebase is a curated protein sequence database that
provides a high level of annotation, a minimal level of redundancy and a high level of
integration with other databases [36]. Swiss-Prot considers itself distinguished from
protein sequence databases by three distinct criteria:
14
a. Annotation:
Two classes of data, the core data and the annotation, can be distinguished in swissprot.
For each sequence entry the core data (Fig.3.3.) consists of the sequence data, the citation
information and the taxonomic data. The annotation consists of functions of the protein,
posttranslational modifications, domains and sites, secondary structure, quaternary
structure, similarities to other proteins, diseases associated with deficiencies in the
protein, sequence conflicts, variants, etc.
b. Minimal redundancy:
Here all the protein data is merged to minimize the redundancy of the database. If
conflicts exist between various sequencing reports, they are indicated in the feature table
of the corresponding entry.
c. Integration with other databases:
Swiss-Prot is currently cross-referenced to more than 50 different databases. This extensive
network of cross-references allows Swiss-Prot to play a major role in biomolecular database
interconnectivity.
Figure 3.3.Snapshot of SwissProt manually annotated Protein Sequence database
15
The current Swiss-Prot Release is version 49.3 as of 21-Mar-2006, and contains 212425
sequence entries, comprising 77942645 amino acids abstracted from 139653 references.
The database currently contains 13633 human proteins.
TREMBL
UniProtKB/TrEMBL is a computer-annotated protein sequence database complementing
the UniProtKB/Swiss-Prot Protein Knowledgebase. It contains the translations of all
coding sequences (CDS) present in the EMBL/GenBank/DDBJ Nucleotide Sequence
Databases and also protein sequences extracted from the literature or submitted to
UniProtKB/Swiss-Prot. The database is enriched with automated classification and
annotation. The current TrEMBL Release is version 32.3 as of 21-Mar-2006, and
contains 2666963 entries comprising 857415579 amino acids.
3.1.3. Protein-Domain Databases
PFAM
Pfam is a database of multiple alignments of protein domains or conserved protein
regions.
Figure 3.4.Snapshot of Pfam Protein Domain database
16
There are two types of Pfam. Pfam-A are accurate human crafted multiple alignments
whereas Pfam-B is an automatic clustering of the rest of swissprot using the program
Domainer [37]. Pfam-A (high quality) families are shown as just their names. Pfam-B
(low quality, automatically clustered and aligned families) is shown as Pfam-B_xxxx.
The Pfam-A hits are hyperlinked to the domain annotation. The above figure (Fig.3.4.)
shows the start position, end position, source and score of Pfam-A domains of protein
FAAH_HUMAN. SwissPfam is an annotated description of how Pfam domains map to
(possibly multidomain) SwissProt entries. The current release has 1285025 entries and
was indexed 28-Feb-2005.
3.1.4. Pathway Databases
KEGG [38] is a suite of databases and associated software, integrating the current knowledge on
molecular interaction networks in biological processes (PATHWAY database), the
Figure 3.5.Kegg Pathway database showing Insulin Signaling Pathway
17
Information about the universe of genes and proteins (GENES/SSDB/KO databases), and
the
information
about
the
universe
of
chemical
compounds
and
reactions
(COMPOUND/DRUG/GLYCAN/REACTION databases). The above figure (Fig.3.5.)
visualizes the Insulin Signaling Pathway in detail. There are around 37,110 pathways,
290 reference pathways and 1,411,118 genes in KEGG database.
3.1.5. Bibliographic Databases
PubMed
PubMed [39] is a service of the U.S. National Library of Medicine that includes over 16
million citations from MEDLINE and other life science journals for biomedical articles
back to the 1950s. PubMed (Fig.3.6.) includes links to full text articles and other related
resources.
Figure 3.6.Snapshot of PubMed literature database
In addition, it provides a Batch Citation Matcher, which allows users to match their
citations to PubMed citations using bibliographic information such as journal, volume,
issue, page number, and year.
18
3.2. Software Languages used
3.2.1. Perl
Perl is a programming language developed for text manipulation, web development and
so on. Active Perl is Active State’s quality-assured, ready-to-install distribution of Perl,
available for different operating systems AIX, HP-UX, Linux, Mac OS X, Solaris, and
Windows. The standard Active Perl distribution 5.6.1.638 was downloaded and installed
on windows machine. Perl was used for parsing swissprot, tremble, string, SwissPfam,
Ophid, and Kegg database files.
3.2.2. PHP
PHP (recursive acronym for PHP: Hypertext Pre-Processor) is a widely used open source
general-purpose scripting language that is especially suited for Web development and can
be embedded into HTML. PHP version 4.3.2 was used for developing the website.
3.2.3. HTML
Hyper Text Mark-Up Language (HTML), a subset of Standard Generalized Mark-Up
Language (SGML) for electronic publishing, is the specific standard used for the World
Wide Web. It was used for creating web pages with hypertext and other information to be
displayed in a web browser. It was also used to structure information.
3.2.4. SQL
SQL is an ANSI (American National Standards Institute) standard computer language for
accessing and manipulating database systems. SQL statements were used to retrieve and
update data in a database. The query and update commands together form the Data
Manipulation Language (DML) part of SQL where as creation or deletion of tables form
Data Definition Language (DDL) part of SQL.
3.2.5. PSQL
psql is a terminal-based front-end to PostgreSQL. Queries were typed in interactively to
issue them to PostgreSQL, for getting the desired results. In few cases, input was also
given in the form of a file. In normal operation, psql provides a prompt with the name of
19
the database to which psql is currently connected, followed by the string =>. Psql version
8.1.3. was used to access string protein interaction database.
3.3. Relational Databases used
3.3.1. Oracle
Oracle, the most flexible and cost-effective database to manage enterprise information,
was chosen as a back-end utility. Oracle database-mediated information management
system (IMS) was developed to effectively and efficiently manage the information of
human protein interaction. The human protein interaction data is obtained from STRING
and OPHID databases whereas protein annotation data is obtained from UNIPROT
database, protein domain data from PFAM database and protein pathway data from
KEGG database. BIO10G, an oracle database available on inquire-g server was used for
storing all the above mentioned human protein interaction data in the form of tables under
nsudhara schema.
Oracle 10g client (10.1.0.2.0) software was downloaded and installed on windows
machine. Then it was configured to BIO1OG by providing hostname, port number and
net service name. This has been found to be an excellent choice for back end capabilities
due to its usefulness and ease of creating tables and querying information using SQL.
3.3.2. PostgreSQL
PostgreSQL is a free object-relational database server (database management system),
released under a flexible BSD-style license. As STRING human protein interaction
database is available on postgres, PostgreSQL database installed and configured on BIOX
server was used to store human protein interaction data.
3.4. Software Tools used
3.4.1. Komodo
Active State Komodo is the professional integrated development environment
(IDE) for dynamic languages, providing a powerful workspace for editing, debugging,
20
and testing programs. Komodo 3.5.2 software was downloaded and installed on windows
platform. Komodo’s Rx Toolkit was used for creating regular expressions for parsing
Swissprot, Trembl, Pfam, String and Kegg database files. Rx Toolkit takes a regular
expression and some sample data and finds out the matches, the groups, number of
matches, etc. Once the regular expression has been built and debugged, Perl program was
written to open a file for input, use the built regular expression for parsing information of
interest, and write that information to an output file.
3.4.2. SQL Loader
SQL*Loader, a bulk loader utility, was used for moving data from parsed external files
into the Oracle database. There are four stages to loading data using SQL-Loader:
1. Create a data file: The data file is a text file that contains parsed data. There
is one record per line and a comma separates each attribute value.
2. Create the relation for the data: The relation must exist in the database
before SQL-Loader can load data into it.
3. Create a control file: The control file describes the structure of the data and
indicates the relation into which the data should be loaded. This file states that
each line or record in the file contains attributes corresponding to the
attributes in the table. The attribute names must be in the same order in the
control file as the attribute values are in the data file.
4. SQL-Loader: SQL-Loader reads the control file and loads the data. The
following command is used to execute the SQL-Loader from DOS window:
SQL-Loader creates a number of files as it loads the data. A log file is produced that
describes what happened and describes any errors that may have occurred.
3.4.3. SSH / SFTP
SSH/SFTP Secure Shell Client (version 3.3.2) is a secure network connection system that
provides an alternative method to establish an encrypted connection to a remote machine.
It also provides a secure file transfer program that transfers files from your local machine
21
to a remote machine or server. SSH was used to connect remotely to biox, discover and
inquire-g servers on a windows machine. SFTP was used to transfer files on windows
from remote server to local machine and vice versa. Files and directories are dragged
from one view and dropped on a target directory in the other, in order to generate
standard FTP transfer operations.
3.4.4. Ultra Edit
UltraEdit-32 text editor was used for easy viewing and editing of code and variables. As a
disk based text editing, it supports files in excess of 4GB and minimum RAM is used
even for multi-megabyte files. The other features include syntax highlighting,
project/workspace support, column/block mode editing, formatting, hexadecimal editor
and multi-byte support with integrated IME support.
3.4.5. Erwin
All Fusion Erwin Data Modeler, a powerful database development tool, was used to
create data model. Data models visualize data structures to organize, manage and
moderate data complexities, database technologies and the deployment environment.
3.4.6. Toad
Toad for Oracle is a powerful, low-overhead tool used to create and execute queries, as
well as build and manage database objects. Toad version 7.6 was used for Oracle 9i. But
when Oracle 9i has been upgraded to Oracle 10g, the existed Toad version did not
supported oracle 10g as it was designed to work with only Oracle versions 7.3.4. to 9.2.
As a result, Toad freeware 8.5 was used with Oracle 10g for a limited time.
3.4.7. Aqua Data Studio
Aqua Data Studio, a database query tool, was used to create, edit, and execute SQL
scripts, as well as browse and visually modify database structures. Aqua Data Studio
4.0.1 was downloaded from iuware and installed on windows machine. It provided an
integrated database environment with a single consistent interface to all major relational
databases such as Oracle, PostgreSQL. The database servers used were Oracle’s BIO10G
22
and PostgreSQL’s BIOX. This allowed us to tackle multiple tasks simultaneously from
one application.
3.4.8. SQL Tools
SQLTools, a light weighted and robust tool for ORACLE, was used for database
development. It is free, small and does not require any installation.
3.4.9. EndNote
Endnote was used to retrieve, organize, and print bibliographies and bibliographic
references. Endnote 9.0 bibliographic software available at Indiana University was
downloaded and installed on windows machine. It was used to search online
bibliographic databases such as PubMed, organize references and images, and create
bibiliographies and figure lists instantly. By reading the abstracts, one can decide whether
the topic is relevant to their subject of interest or not.
3.5. University Computing Resources used
3.5.1. BIOX
The host name for biox is in-info-bio3.ads.iu.edu. The size is around 1.1 TB. The
Operating System is Red Hat Enterprise Linux 4. The development tools that were
available on this machine were Python 2.3.4, Perl 5.8.5, PHP 5.02, MySQL 4.1.14,
Oracle 10.2.1, Gcc 3.4.3, Apache 1.3.11, BioPerl 1.5.1 and PostgreSQL 8.1. The server
was used for storing large databases such as STRING and PostgreSQL was used for
accessing STRING database.
3.5.2. Zen
Zen is the name of Lacie server located at Indiana University Purdue University
Indianapolis. The website is http://zen.informatics.iupui.edu and the main purpose of this
server is to store and share bioinformatics project related data. The size of this server is
around 1860 GB. Zen was used to store rawdata, parsed data and scripts (Perl, PHP,
control files etc.) related to human protein interactions.
23
3.5.3. Research Database Complex (RDC)
Inquire-g database server and discover web server constitute the Research Database
Complex at Indiana University. Their host names are inquire-g.uits.indiana.edu and
discover.uits.indiana.edu respectively. These are connected via ssh2 as it offers a more
secure connection with encrypted text. Inquire-g server was used to store data in oracle
database. Discover web server was used for publishing the research work on World Wide
Web.
24
IV. Procedures and Interventions
4.1. Method Roadmap for HAPPI Database
Method Roadmap for HAPPI database (Fig.4.1.) explains the overview of tasks including
the input or motivation and output of this research work. One can get quick
understanding of the research work by studying the roadmap instead of going through all
pages.
Understand the complexity
of protein functions
Literature
Survey
Data
Integration
Quality
Assessment
of Protein
Interactions
HAPPI
Development
Case
Study
A whole view of functional
information of interacting proteins
Figure 4.1.Method Roadmap for HAPPI database
4.2. Architecture
HAPPI database is a classical 3-tier web application. The three-tier architecture of
HAPPI database is discussed in the following paragraphs. The following figures (Fig.4.2.
and Fig.4.3.) give an overview, structure, and technologies of HAPPI’s hardware and
software architectures.
Presentation Tier
This layer implements the "look and feel" of an application. It is responsible for the
presentation of data, receiving user inputs and controlling the user interface. It receives
an HTTP request and returns a response, in the form of an HTML document. As a web
25
authoring markup language, HTML was used for defining content structures and
rendering a web page.
Web
Browser
Inquire-g
Discover
Apache
Oracle
Web Layer
HTML
Postgres
PHP
Application
Layer
World
Wide
Web
Database Layer
Figure 4.2.Hardware Architecture for HAPPI database
Application Logic Tier
This is the layer in which the business logic exists. The business logic of the application
is the logic that decides if all conditions are met and that implement use case scenarios. It
processes each request according to the research rules, for example, deciding whether to
reject input data or send it to the database. The bulk of the functionality of program is
found in the application layer. PHP was used for this tier.
26
SERVER SIDE
Presentation
Layer
CLIENT SIDE
Internet
Browser
User
Template file
Contains all HTML
Code in application
Rendering
Client side scripting logic
Java Script
Logic
Layer
Main PHP script
Controlling the logic and
Flow of application
Database
Oracle 10g
Data Access
Layer
Figure 4.3.Three-Tier Software Architecture: Structure and Technologies
Data Access Tier
This is the layer that manages the persistence of application information. It is powered by
a relational database server Oracle. Functions are used to execute database server-side
processes pertinent to data integrity. Mostly views are used for presenting data to
applications as they offer some level of security and can be used as alias to hide physical
structures of database tables. Database tables are used primarily for storing data.
27
PHP offers two extension modules that can be used to connect to Oracle. First is the
normal Oracle functions (ORA) and the second is the Oracle Call-Interface functions
(OCI). But here OCI Extension module was used to connect to Oracle using PHP since it
is optimized with more options. For example, OCI do include support for CLOBs,
BLOBs, BFILEs, ROWIDs, etc. compared to ORA. 3 Tier Architecture was chosen
owing to advantages such as flexibility, maintainability, reusability, scalability and
reliability [40].
4.3. Data Integration
4.3.1. Data Warehouse Approach
The Data warehouse approach was used as a solution for data integration. Databases were
assembled into a centralized system with a global data schema and an indexing system
for integration and navigation. Data was integrated into a central data repository. Data
was also cleaned and transformed during the loading process. While a variety of data
models are used for data warehouses, including XML and ASN.1, the most popular
relational data model Oracle was chosen. The relational database management system
(RDBMS) offers the advantage of a mature and widely accepted database technology and
a high level standard query language (SQL). As the number of databases in a data
warehouse grows, the cost of storage, maintenance, and updating data will be prohibitive.
It has an advantage in that the data are readily accessed without Internet delay or
bandwidth limitation in network connections. Vigorous data cleansing to remove
potential errors and duplications was performed before entering data in the warehouse.
A major strength of a data warehousing approach is that it permits cleansing and filtering
of data because an independent copy of the data is being maintained [13]. Warehousing
exerts a load on the remote sources only at data refresh times, and changes in the remote
sources do not directly affect the warehouse’s availability.
In a nut shell, for interaction data that it is critical to clean, transform, curate, and for
which only the best query performance is adequate, data warehousing is probably the best
approach. Thus, limited data warehouses, popular solutions in the life sciences for data
28
mining of large databases, were chosen in which carefully prepared datasets are critical
for success.
4.3.2. Data Acquisition
The UniProt Knowledgebase (UniProtKB) provides the central database of protein
sequences with accurate, consistent, rich sequence and functional annotation. The data is
available in different formats i.e. XML, FASTA and Flat file. UniProt databases are
accessed
from
the
web
at
http://www.uniprot.org
and
downloaded
from
http://www.uniprot.org/database/download.shtml
Uniprot Knowledgebase Release 6.2 was used. SwissProt Knowledgebase Release 48.8
i.e. uniprot_sprot.dat.gz was downloaded through ftp in flat file format to ZEN storage
server. Then the file uniprot_sprot.dat of size 713 MB was extracted using WinRAR
application.
Trembl Knowledgebase Release 31.2 i.e. uniprot_trembl.dat.gz was downloaded through
ftp in flat file format to ZEN storage server. Then the file uniprot_trembl.dat of size 3.3
GB was extracted using WinRAR application.
STRING is a database of known and predicted protein-protein interactions. It uses a
relational database system (PostgreSQL) to store primary data and precomputed
predictions. The data is available in COG mode (flat files), Protein mode (flat files), and
database dumps. It is available free of charge for licensing to academic institutions.
The following figure and table (Fig.4.4. and Tab.4.1.) give an overview of different
bioinformatics databases used to integrate data along with their descriptions, versions and
sizes into the HAPPI database.
29
Swissprot
Trembl
Pfam
280594
2212675
1530770
12905 curated proteins
4084 Pathways
Kegg
23464
2212675 Computer
annotated proteins
HAPPI
52186 Domains
ID Mapping
IPI
57366
114360 Co-citations
2403198 Associations
51207 Interactions
String
Ophid
String
30104706
51207
1533898
Figure 4.4.Integration of Protein Annotation, Interaction, Domain,
Sequence and Pathway Data
KEGG, Kyoto Encyclopedia of Genes and Genomes, pathway database was downloaded
by FTP as anonymous user. Then it was unzipped using gun zip utility. OPHID, the
Online Predicted Human Interaction Database, is available for download in two formats.
One is PSI XML and the other is Plaintext. A form with terms of download and use has
been filled and an email was sent with a link to download data. By clicking the link, the
human protein interaction data i.e. ophid1140247825213.txt was downloaded to ZEN
storage server.
SwissPfam, the domain database of SwissProt and Trembl proteins, Release 18 was
downloaded as anonymous user from sanger.ac.uk/pub/databases/Pfam. In a nutshell, the
following files were downloaded from different bioinformatics database servers for
protein interaction data analysis:
30
File Download
Uniprot_sprot.dat.gz
uniprot_trembl.dat.gz
Interaction.data.v6.2.sql.
gz
Primary.data.v6.2.sql.gz
Homology.data.v6.2.sql.
gz
Protein.links.detailed.v6.
3.txt.gz
COG.links.detailed.v6.3.
txt.gz
Genes.tar.gz
Swisspfam.gz
ophid1140247825213.txt
Description
Date
Download
Version
Source
Size
Swissprot
Knowledgebase
TrEMBL
Knowledgebase
Protein Interactions
1/18/2006
31.2
UniProt
713 MB
10/20/2005
48.8
UniProt
3.3 GB
2/28/2006
6.2
STRING
1.6 GB
Protein Players
Protein homology
3/1/2006
3/1/2006
6.2
6.2
STRING
STRING
265 MB
4.5 GB
Protein network
data
Association scores
3/1/2006
6.3
STRING
172 MB
3/1/2006
6.3
STRING
13 MB
Genes and
Pathways
Domains
Human Protein
Interactions
3/9/2006
38
KEGG
1.1GB
10/5/2005
2/20/2006
18
Not
Available
PFAM
OPHID
72 MB
334 KB
Table 4.1.An overview of Data Acquisition from different data sources
4.3.3. Data Reduction
As the research work is focused on human organism, steps were taken to reduce the data
of all organisms to the data of organism Homo sapiens. The table (Tab.4.2.) shows the
total number of records, reduced records, the identifier used to reduce the records of each
bioinformatics database used to integrate data into HAPPI database.
Swissprot: A Perl program was written to extract only human proteins from total
proteins.
To extract only human proteins, a protein record was analyzed to distinguish between
human and non-human record. As uniprotid is the identifier of each record that starts the
entry of protein, a parser was written to check the first entry of each protein. If a protein
identifier has ‘human’ in its name, then it was considered as a human protein and was
written into output file. On the other hand, if a protein identifier does not have ‘human’ in
its name, then it was considered as non-human protein and was not written to output file.
31
Upon executing this parser, 12905 human proteins were written to output file from a total
of 280594 proteins of all organisms.
Trembl: As this database also belongs to Uniprot Knowledgebase, the same procedure of
Swissprot was followed. A Perl parser was written to extract human proteins out of total
proteins of all organisms. 2,212,675 proteins were reduced to 57,924 human proteins.
SwissPfam: A Perl parser was written to extract human proteins from proteins of all
organisms. This parser upon execution reduced 1530770 total proteins to 52186 human
proteins.
Ophid: As this database consists of only human protein interactions, the data is taken as
is to the research work.
String: As string used PostgreSQL database for data storage, PostgreSQL database
system resided on BIOX server was utilized. Under Postgres, a local database named
‘rani’ was created. Then the sql dump file of string was executed under this database
using psql language. Upon execution of this file, several string tables were created and
loaded. Then the tables and their contents were analyzed. The string interaction table has
around 2,403,198 interactions for 180 organisms. As the research work is associated with
only human organism, another table was created only with human proteins. As the
taxonomy identifier of human organism is ‘9606’, each record was checked and if that
record has taxonomy identifier of humans, then that record was written to the newly
created table. Then the table contents were copied to a text file with a delimiter.
Kegg: A Perl parser was written to distinguish between genes with pathways and genes
with no reported pathways. As the main purpose of Kegg database integrations is to
extract pathway information, the parser was executed to create a new output file with
genes having pathways. There are around 23,464 genes out of which only 4084 genes had
pathway information.
32
Database
Total Records
Reduced Records
SwissProt
Trembl
Pfam
String
Data Reduction
identifier
SwissProt ID
TrEMBL ID
SwissProt ID
Taxonomy ID
280,594
2,212,675
1,530,770
30,104,706
Kegg
Pathway
23,464
Ophid
None
51,207
12,905 proteins
57,924 proteins
52,186 proteins
2,403,198
interactions for
17,636 proteins
4084 genes with
pathways
51,207 interactions
for 7002 proteins
Table 4.2.An Overview of Data Reduction of Protein Integrated Data
4.3.4. Feature Selection
SwissProt: The structure of a sequence entry was analyzed for selecting the required
features of each protein. Each sequence entry is composed of lines. Different types of
lines, each with their own format, are used to record the various data that make up the
entry. Each line begins with a two-character line code, which indicates the type of data
contained in the line. The current line types and line codes and the order in which they
appear in an entry, are shown in the table below (Tab.4.3.).
Line code
ID
AC
DT
DE
GN
OS
OG
OC
OX
RN
RP
RC
Content
Identification
Accession number(s)
Date
Description
Gene name(s)
Organism species
Organelle
Organism classification
Taxonomy cross-reference
Reference number
Reference position
Reference comment(s)
33
Occurrence in an entry
Once; starts the entry
Once or more
Three times
Once or more
Optional
Once
Optional
Once or more
Once
Once or more
Once or more
Optional
RX
RG
RA
RT
RL
CC
DR
KW
FT
SQ
(blanks)
//
Reference cross-reference(s)
Reference group
Reference authors
Reference title
Reference location
Comments or notes
Database cross-references
Keywords
Feature table data
Sequence header
Sequence data
Termination line
Optional
Once or more (Optional if RA line)
Once or more (Optional if RG line)
Optional
Once or more
Optional
Optional
Optional
Optional
Once
Once or more
Once; ends the entry
Table 4.3.Line types and codes of a protein sequence entry in Uniprot database [36]
As shown in the above table, some line types are found in all entries, others are optional.
Some line types occur many times in a single entry. Each entry must begin with an
identification line (ID) and end with a terminator line (//).Then a Perl program was
written to parse only the required features (given in table below) of each protein.
TrEMBL: The general structure of an entry in TrEMBL is identical to SwissProt
database. The class entry distinguishes the fully-annotated entry from computerannotated entry of each protein. The data class for Trembl is Preliminary whereas for
swissprot, it is Standard. To parse the required features of this database, a Perl program
was written and executed.
Pfam: The domain structure of Pfam database was analyzed for feature selection.
Domains constitute two types of regions. PfamA regions are the regions of proteins that
are predicted by the Pfam collection of hidden markov models to belong to a family.
These are strongly trusted matches to a family and are very unlikely to be false matches.
PfamB regions are regions of proteins that belong to a PfamB family. Pfam-B is an
automatically generated supplement to Pfam that is generated from the PRODOM
database. A Perl parser was written to select the domain features of human proteins.
34
Kegg: Kegg pathway database structure was analyzed for feature selection. A typical
record consists of the name of gene, its description, its pathway information, the source
organism, Ortholog information, chromosome location of gene, motif associated,
database cross-references, codon usage, amino acid sequence and nucleotide sequence. A
Perl parser was written to select gene name, description and pathway information of each
record.
The following table (Tab.4.4.) gives the summary of selected features of each record of
bioinformatics databases used in integration of HAPPI database.
Database
Uniprot
Pfam
String
-
35
Features Selected
Entry Name,
Total number of Amino Acids,
Accession Number(s),
Date of creation and last
modification of the database entry,
Description of Protein,
Gene name,
Organism species,
Taxonomy cross-reference(s),
Bibliographic cross-reference(s),
Database cross-references,
Keywords,
Sequence,
Molecular weight, and
Check Value
Uniprot Identifier,
Uniprot Accession Number,
Domain(s) name,
Domain(s) description
Domain(s) Identifier,
Domain(s) occurrence, and
Domain(s) Position
Ensembl Protein Identifier,
Ensembl Protein Interactor,
Neighborhood score,
Gene fusion score,
Concurrence score,
Coexpression score,
Experimental score,
Ophid
-
Kegg
Database score,
Text mining score,
Physical sub score,
Combined score, and
Mapping between Ensembl Protein
Identifier and Uniprot Protein ID
Swissprot Protein Identifier,
Swissprot Protein Interactor,
Source of Dataset
Gene name,
Synonym(s) of gene,
Description of gene, and
Pathway information
Table 4.4.Summary of feature selection from Protein Integrated databases
4.3.5. Meta-data specification
Metadata is data describing data, that is, data that provides documentation on other data
managed within an application or environment. A new metadata record is created for
each dataset. The decision to classify data as a dataset is called ‘granularity’ [41]. A
dataset might be the data from a public data source or experiments. The metadata record
describes all the important information about the dataset, such as where the data was
collected, when it was collected, and who collected it. It contains information that will
enable users to easily interpret the data (such as what column headings mean, what the
units are etc.)
Metadata records are stored in an oracle database. Though there are many different
metadata standards in use by the scientific world, the most widely used metadata standard
Dublin Core (DC) was used. The Dublin Core Metadata Element Set consists of 15
elements, which include Title, Creator, Subject, Description, Publisher, Contributor,
Date, Type, Format, Identifier, Source, Language, Relation, Coverage, and Rights [42].
Metadata elements that were created in this project are unique identifier, name of the
creator, date created, table name, number of the attributes, name of the attributes, size of
the attributes, table source name, associated fact, table subject, table description and table
size (number of records in table). The benefits of Metadata include data management,
36
duplicate data reduction, concise dataset description, and so on. Here metadata is mapped
to tabular format.
4.3.6. Database Design
Program-data independence was achieved by creating independent tables, thus insulating
the data from program changes. End users were determined to be researchers, biologists,
and protein interactomics specialists. These users are considered to be casual by the
database designers.
Data Model
The data model (Fig.4.5.) describes data as entities, relationships and attributes. Entities
are the things about which we seek information.
species
protein_species
genes_proteins
species_id
species_id (FK)
uniprot_id (FK)
uniprot_id (FK)
gene_id
species_name
proteins
human_protein_sequence
uniprot_id
crossreferences
uniprot_id (FK)
database_crossreferences
bibiliographic_crossreferences
uniprot_id (FK)
uniprot_acc
name
date_created_modified
key_information
sequence
length
molecular weight
checksum
human_protein_interactions
protein_ensemblid_a (FK)
protein_ensemblid_b (FK)
uniprot_id (FK)
human_protein_domain
identifiers_proteins
uniprot_id (FK)
domain_id (FK)
uniprot_id (FK)
protein_ensemblid
neighborhood_score
genefusion_score
cooccurence_score
coexpression_score
experimental_score
database_score
textmining_score
combined_score
domain
domain_id
domain_name
domain_no
domain_pos
domain_desc
Figure 4.5.Initial Data Model for HAPPI Database
37
Attributes are the data we collect about the entities. Relationships provide the structure
needed to draw information from multiple entities. The below figure display an initial
data model designed for HAPPI database. The initial data model has been modified to
include kegg and ophid databases. In this schema, databases such as Swissprot, Trembl,
String, Ophid, Pfam and Kegg have been brought together. Selected attributes from each
of these databases contribute to the structure of the HAPPI database. The column names
generally describe the data it contains.
4.3.7. Data Storage or Loading
Oracle’s SQL Loader was used to load large amounts of parsed data into tables. As the
data files and tables to be loaded are created, control files were created for each table to
tell oracle how to load data from the corresponding data file. Then SQL Loader was
executed to read the control file and to load the data. A log file was produced that
describes what happened and describes any errors that may have occurred. The following
table (Tab.4.5.) summarizes the different tables used for data storage.
Public
Database
SwissProt
Trembl
Pfam
String
Ophid
Kegg
String
Total
Records
280,594
2,212,675
1,530,770
30,104,706
51,207
23,464
1,533,898
Parsed
/Loaded
Records
12,905
57,924
52,186
2,403,198
51,207
4084
1,114,360
Schema
Table
nsudhara
nsudhara
nsudhara
nsudhara
nsudhara
nsudhara
nsudhara
human_protein_uniprot_sprot
human_protein_uniprot_trembl
human_protein_domain_swisspfam
new_human_string_interactions
ophid_human_interactions
kegg_human_pathway
string_human_prot_abstracts
Table 4.5.An overview of tables loaded from Protein Integrated databases
4.4. Query Processing
PHP’s OCI Extension module was used to connect Oracle to PHP since it is optimized
with more options such as support for CLOBs, BLOBs, BFILEs, ROWIDs, etc. The
OCIError () function was used to obtain an array with error code, message, offset and
38
SQL text. Error was obtained for a specific session or cursor by supplying the appropriate
handle as an argument to OCIError (). Without any arguments, OCIError () will return
the last encountered error.
The most direct way to speed up selection of data i.e. search optimization is to use an
index. An index is essentially a structure of pointers that point to rows of data in a table.
Indexes were created for all the required attributes over different tables in the database.
An index optimizes the performance of database queries by ordering rows to speed
access.
4.5. User Interface: HAPPI Database
User interface-flow diagrams [43] were used to reflect high-level overview of the user
interface for HAPPI database. The high-level overview or architectural approach was
implemented to understand the complete user interface for this system. Factors such as
usability, clarity, simplicity, speed have been considered in the design of HAPPI website.
The following figure (Fig.4.6.) gives a high-level overview of HAPPI website. The
HAPPI website is available at
http://discover.uits.indiana.edu:8340/ProteinInteractions/index.html
39
HAPPI main page:
Takes protein as input
Submit Button
List of interacting
proteins with their
descriptions and scores
for a given protein
Interacting Protein
Pair Link
List of genes, pathways
co-citations and domains
of interacting protein pair
Individual Protein
Link
Individual
Protein Link
Sequence, annotation and
cross-reference
information of that protein
Co-citation Link
Database
Pathway Domain
Link
Link
Kegg
Pathway
Cross-Reference
Links
Pfam
Domain
PubMed
Literature
MIM
Disease
PDB
Structure
Gene Link
Entrez
Gene
Prosite
Ensembl
Gene
Gene
Ontology
Reactome
ProDom
PIR
Figure 4.6.HAPPI Database: User Interface Flow Diagram
40
4.6. Unified Scoring Model
We used a ranking method that works in principle by clustering interaction confidence
scores from different data collection methods. A unified scoring model was developed for
ranking the importance and reliability of human protein-protein interactions integrated
from both String and Ophid protein interaction databases. First two scoring systems for
String and Ophid protein interactions were individually developed and then a unified
mechanism was proposed to combine the two scoring systems. The objective of unified
scoring model is to assess the quality of protein-protein interactions.
For a given input protein, protein interactions were retrieved from String and Ophid
protein interaction databases. If an interaction is only found in String, then the String
Protein Interaction Confidence score is used and if an interaction is only found in Ophid,
then the Ophid Protein Interaction Confidence score is used. But if an interaction is found
in both String and Ophid protein interaction databases, then a unified score model is used.
The final score is populated to each interacting pair of proteins.
String Protein Interaction Confidence Scores
Type of String Score
Minimum Score
Maximum Score
Neighborhood score
24
799
Gene Fusion score
1
898
Co-occurrence score
431
975
Co-expression score
200
231
Array score
24
552
Experimental score
24
999
Database score
24
996
Text mining score
24
816
Combined score
150
999
Table 4.6.Analysis of String database scores
41
String used several methods (neighborhood, gene fusion, co-occurrence, co-expression,
experiments, databases, text mining) in predicting protein interactions. String interaction
data and their confidence scores from different sources were analyzed. For this the
distribution of string interaction data was thoroughly examined. After examining its
distribution (Tab. 4.6.), a 5-star ranking scale was developed.
Rating based on combined score distributions
<20%
* (1 star):
** (2 star): 20-32%
*** (3 star): 32-70%
**** (4 star): 70-90%
***** (5 star): >90%
Note: the unit of the score is 0.1%
Figure 4.7.String database score distributions
Manual clustering was done to achieve maximum representation of interactions in each
bin. Its combined score ranges from 0.001 to 0.999. Based on combined score
distributions, a 5-star ranking model was developed (Fig.4.7.). If the string combined
score is less than 0.02, 1 star is given for a given protein-protein interaction.
Subsequently 2 stars were given for the string combined score ranging from 0.02 to 0.32,
3 stars were given for the string combined score ranging from 0.33 to 0.7, 4 stars were
given for the string combined score ranging from 0.71 to 0.9, and 5 stars were given for
the scores greater than 0.9. Confidence levels were defined as low, medium, high and the
highest. If the string combined score is 20% or better, then it is considered as low
confidence followed by 50% to medium confidence, 75% to high confidence and 95% to
42
the highest confidence. The confidence scores are directly proportional to reliability of a
given protein-protein interaction.
Ophid Protein Interaction Confidence Scores
Ophid Human Protein Interaction data was thoroughly examined (Tab.4.7.) from the
perspective of sources and datasets used in data collection. High confidence scores were
given to direct interactions, medium confidence scores were given to interactions inferred
from high quality mammalians and low confidence scores were given to interactions
inferred from low quality mammalians or non-mammalians.
Source
Data Set
C. elegans
CORE_1
CORE_2
NON_CORE
LITERATURE
SCAFFOLD
INTEROLOG
CE_DATA
low
medium
high
MIPS
Mouse
AfCS
Suzuki
RikenDIP
RikenLit
RikenBIND
FlyHigh
FlyLow
FlyCellCycle
WranaHigh
WranaMedium
WranaLow
JonesErbB1
Pawson
StelzlLow
StelzlMedium
S. cerevisiae
M. musculus
D.
melanogaster
LUMIER
H. sapiens
43
Confidence
Total
Interactions
0.5
0.5
0.4
0.5
0.5
0.6
0.5
0.3
0.35
0.4
0.4
0.7
0.6
0.6
0.7
0.6
0.7
0.65
0.5
0.6
0.4
0.4
0.3
0.75
0.75
0.75
0.75
1,223
18,034
1,800
3,883
620
6,396
Known Human
PPI
StelzlHigh
VidalHuman_core
VidalHuman_non_core
MINT
HPRD
BIND
Total
0.75
0.75
0.75
0.8
0.8
0.8
17,096
49,052
Table 4.7.Analysis of Ophid database scores
Ophid [11] has been built by mapping from yeast, mouse, drosophila and C.elegans high
throughput data to human proteins. Confidence scores were assigned for each type of
dataset by considering the reliability of source and dataset into account. High confidence
scores were given to known human protein-protein interactions such as MINT and BIND
where as low confidence scores were assigned to predicted human protein-protein
interactions that were built by mapping HTP model organism data to human proteins.
Unified Scoring Method
Unified scoring model (Fig.4.8.) is developed by taking into fact that the interacting
protein pairs may exist in both String and Ophid protein interaction databases. If a
protein-protein interaction pair is found in both String and Ophid interaction databases,
then the following steps were followed:

If the interaction source from the String includes ‘database’ or the Ophid source
indicate ‘String’, the assigned STRING score is used.

Otherwise, the following scoring formula is used
Final Score (S) = 1 – (1 - Score [STRING]) * (1 – Score [OPHID])
44
Input Protein
Extract
STRING
Interactor
Extract
OPHID
Interactor
Similar to
Ophid
Similar to
String
Yes
Yes
S=1-(1-Score [String]) * (1-Score [Ophid])
No
DB
source
No
String
Source
Yes
Yes
STRING
Score
Combined
Score (S)
OPHID
Score
Figure 4.8.FlowChart for Score Consolidation
45
V. Results and Discussion
The diversity of protein-related data was taken into account before designing an optimum
database system. During the development of HAPPI database, every category of
interaction data was represented and that the database was designed to be visual and userfriendly. We chose to develop a database that is based on a bioinformatics analysis to
include all the known human proteins and their interactions. The database is publicly
available and can be accessed within Discovery Informatics and Computing Group
website.
The Webpage for the query has been designed to be as simple to use as possible without
losing precision. Figure (Fig.5.1.) shows a screenshot of the query page, indicating that
protein has to be given as uniprot identifier for extracting all known and predicted
interactions. Once the protein is submitted, we get a list of interacting proteins along with
their descriptions followed by a 5-star score (Fig.5.2.). Within the program, all the
interactors were sorted based on the importance of reliability i.e. scoring. As a result, all
5 score ranking proteins were shown at the top followed by 4-score, 3-score, 2-score and
1-score interactors if any. This allows the user to view the interactors in the order of
interaction reliability.
Figure 5.1.The query page of HAPPI database
46
Figure 5.2.The Interaction Results Page of HAPPI database
Apart from that by clicking the interacting icon between a pair, thorough information of
domain, gene, co-citation and pathway of interacting proteins were given (Fig.5.3.).
Within that information, links were provided to gene, abstract, domain and pathways to
have detailed information if interested. Then the individual proteins were also clickable
to give an extensive information of a protein including their sequence information,
annotation information, and related database cross-reference information (Fig.5.4.). The
important aspect of proteomic analysis exists in the information around interacting
proteins and in mapping the corresponding binding sites [10]. By understanding the
proteins and their binding partners in the context of a network, insight into the function of
proteins was obtained.
47
Figure 5.3.The Interaction Annotation Page of HAPPI database
Figure 5.4.The Protein Annotation Page of HAPPI database
48
Serine/threonine-protein kinase SGK1 as a Representative Entry in HAPPI Database
Proteins such as NEDD4, INS, P85A, INSR, PDK1, SGK1, PTEN, GRB2, and SCNAA
were used to test the functionality of HAPPI database. SGK1 was used as a representative
entry to illustrate the breadth and depth of annotation in HAPPI database and to highlight
the importance of integration of protein interaction data (Tab.5.1.)
SGK1
String
Ophid
HPRD
HAPPI
Interacting
Proteins
NEDD4




hPDK1




IMA2




MK07

SGK3

AKT1/PKB

NEDD4-2








NHERF-2



Q96G51


HIP-2


S6K


PDK2

ENaC




Table 5.1.Comparison of SGK1 interacting proteins across P.I. databases
This protein was searched against String, Ophid, HPRD and HAPPI protein interaction
databases. The top proteins with more than 4 star score were retrieved and compared to
the other protein interaction databases. The results showed that HAPPI interaction
database captures all the interacting proteins that were identified by other interaction
databases. Apart from that other factors were compared around these interacting protein
pairs such as domain information, pathway information, gene information, sequence
49
information, annotation information and database cross-reference information (Tab.5.2.).
As HAPPI database integrates interaction information from both String and Ophid
databases, it seems to perform better than other databases in giving the integrated
information of an interacting pair of proteins.
Features compared
String
Ophid
HPRD
List of Interacting Proteins



Scoring




Pathway Information

Gene Information

Annotation Information
Co-citation Information


Domain Information
Sequence Information
HAPPI









Keywords

Table 5.2.Comparison of HAPPI database features with P.I. databases
50
VI. Case Study
To show the benefits of HAPPI database, we performed a case study (Fig.6.1.) using
Insulin Signaling pathway in collaboration with a biology team on campus. We began by
taking two sets of proteins that were previously well studied as separate processes, set A
and set B. We queried these proteins against the HAPPI database, and derived highconfidence protein interaction data sets annotated with known KEGG pathways. We then
organized these protein interactions on a network diagram. The end result shows many
novel hub proteins that connect set A or B proteins. Some hub proteins are even novel
members outside of any annotated pathway, making them interesting targets to validate
for subsequent biological studies.
According to Dr. Blazer-Yost, “The peptide hormone insulin is best known as the agent
that is necessary for the stimulation of glucose uptake into cells. However, this hormone
is also a master regulator of many other biochemical processes including intermediary
metabolism, nutrient absorption, utilization and storage and even specific growth factor
effects. It has been estimated that insulin exerts a direct or indirect influence over
hundreds of biochemical intermediates [44]. The action of insulin is initiated by peptide
binding to a cell surface receptor. The receptor activation controls downstream signaling
elements. The effects of the hormone vary with cell type and are dependent on the
presence and compartmentalization of insulin target proteins and lipids. Because most
cells in the body contain insulin receptors, the number of potential responses is as varied
as the different types of cells found in an intact organism. It is not surprising, therefore,
that some of the insulin stimulated pathways are not completely elucidated (Fig.6.2.).
While there may be some commonality and overlap in the various potential pathways,
traditional approaches makes it difficult to discern all the possible combinations and
permutations even if some of the components of the pathway in a particular cell type are
already known.”
51
Insulin Stimulated Pathway
Proteins
involved in
Insulin
Pathway
Proteins
modified by
Insulin
treatment
Query against
HAPPI
Database
Derive High
Confidence
Interacting Proteins
Study KEGG
Pathways
Develop
Visualization
Network
Target hub proteins that
connect set A and B
Proteins
Figure 6.1.Flow Chart for case study
Dr. Blazer-Yost’s research laboratory primarily focuses on the interactions between
ENaC and its associated harmones i.e. insulin, ADH, and Aldosterone. It has been shown
that insulin stimulated Na+ reabsorption through ENaC is mediated by the
phosphoinositide pathway but many of the effectors downstream of the initial activation
of PI3-kinase are unknown. It further investigates PI 3-Kinase pathway in ADH
stimulated Cl- and Na+ transport in the mouse principal kidney cortical collecting duct
52
cells to understand the role of PI 3-Kinase pathway, and the biochemical pathways
involved in hormone-induced Cl- and Na+ reabsorption in the distal nephron of the
mammalian kidney [45].
The illumination of such pathways has important medical significance since common
health conditions such as hypertension, congestive heart failure and renal diseases are
caused by an imbalance with Na+ transport in the kidney.
Until now, traditional methods such as cell culture model of the mpkCCD cells,
Electrophysiology to monitor ion transport and data interpretation using Sigma plot have
been used. As advancement in supercomputing and informatics continue to play a critical
role in research and development, a network based bioinformatics solution was proposed
in delivering insights into the behavior of cells, proteins and pathways. Dr.Blazer-Yost
mentioned that “This bioinformatics solution would be helpful in identifying potential
unknown signaling components within the framework of isolated known pathway
intermediates. The insulin-stimulated pathway which forms the starting point for this
application results in increased sodium reabsorption across the polarized epithelial cells
formed by the principal cells of the mammalian distal nephron.”
53
PDK1 PDK2
P P PP P P
PIP3
Na+
Na+
?
?
??
PDK1 PDK2
P P PP P P
P
P
SGK
PIP3
SGK
PIP3
?
?
110
85
P P
Nucleus
P P
PP P
P PP
?
PIP3
?
NEDD-4
PI3-Kinase
PIP2
?
PIP2
PI3-Kinase
110
85
IRS P
P
ENaC transport
vesicle
Insulin
Receptor
aldosterone
Insulin
Tight Junction
Figure 6.2.Insulin Signaling Pathway [Contributed by Dr. Blazer-Yost]
Asking relevant biological questions directed us in understanding the problem and in
finding the solutions.
1. How are these signaling pathways assembled?
2. Are there any relevant kinase signaling pathways?
3. What are the key molecules involved in these pathways?
4. What are the proteins involved in this pathway?
5. What domains do these proteins constitute?
6. What are the likely interacting proteins and their role in ion transportation?
.
Two sets of proteins were taken as input to this case study. Set-A proteins are proteins
that were known to be involved in insulin stimulated sodium transport in renal cells. Set
B
proteins
are
proteins
that
were
modified
by insulin
treatment
in
the
membrane/membrane-associated fraction by means of peptide mass fingerprinting via
MALDI-TOF MS [Appendix A]. The latter experiments were performed in the BlazerYost and Witzmann laboratories at IUPUI. Then these proteins were queried against
54
HAPPI database and the high confidence interactors along with their pathways were
extracted. Then based on the research experience of Dr.Blazer-Yost, the proteins of
interest were considered [Appendix B].
Figure 6.3.Visualization of Insulin Pathway Protein Interaction Network
Using Pathway Studio
After capturing a list of interacting proteins of particular interest and information around
those proteins, a network was built with the input protein list, interacting protein list and
the pathways these proteins are involved in. The figure (Fig.6.3.) shows the visualization
of insulin pathway protein interaction network. First demo version of PathwayStudio [46]
was used to visualize the network. Search tool was used to find and discern a list of
proteins based on a name or keyword. Build Pathway tool was used to find regulatory
paths between two or more proteins. But the network was so dense to identify proteinprotein interaction. Apart from that pathways were also not shown. Proteins were
represented as nodes and interactions as edges. Then the network was done manually by
drawing Venn diagrams as pathways and lines as interactions (Fig.6.4.). The input
proteins were colored red but the P85A protein was colored white. This network gave a
good representation of interacting proteins and in which pathways they are present.
55
AL2S7
PICK1
ARHG1 RAC1
CSK Q7RTZ3
SRC
CSKP
KPCD
PTPA
GRK4
PTEN
PI52A
P3C2B
P3C2G
PLCD1
PI5PA
PLCG1
PLCB1
DYRK3
PI4KA
KPCA
G
P85A
PK3CA
P55G
GRB2
PP
PKD2
RASH
RRAS
INS
RAF1
KPCZ
MP2K1
IGF2
PDPK1
ELK1
PRKX
INSR
PTN1
IRS
AAPK1
SRBS1
FRAP
CBLB
PTPRF
SHC1
SS
SOCS1 GTR4
SNX1
AAPK1
AAKG1
SGK1
PLEK
IMA2
AQP2
SCNAA
NEDD4
WWP1
WWP2
Figure 6.4.Visualization of Insulin Pathway Protein Interaction Network
56
The major pathways that were studied in this case study were Regulation of actin
cytoskeleton, proteins of the tight junctional complex, Phosphatidylinositol signaling
intermediates, Insulin Signaling pathway and Ubiquitin mediated Proteolysis. Each
pathway is given a color and the proteins are placed according to their role in a pathway.
Pathways are represented in the form of a Venn diagram. They are overlapped each other
as few proteins exist in more than 1 pathway. The ubiquitin mediated proteolysis pathway
is not combined as there were no common proteins with the other 4 pathways. But
interestingly the proteins in this pathway are interacting with proteins in Insulin signaling
pathway.
The visualization network suggested several interactions that can now be tested
experimentally. Novel information was found around SGK, NEDD4, PDPK1 and PTEN
proteins. SGK protein was known to interact with NEDD4, thus validating this
framework. On the other hand SGK was found to interact with PDPK1 protein in Insulin
pathway which in turn interacts with KPCA which in turn interacts with PTEN protein.
As an example of more detailed information, the phosphatase PTEN appears to be linked
to proteins in three of the four domains. PTEN is a potential research target in the BlazerYost laboratory and this schema illustrates additional targets of the phosphatase that were
not previously considered. In addition, many of the other proteins which directly or
indirectly have interaction with components in multiple domains are of potential interest
and can now be considered by the research laboratory for the first time.
57
VII. Conclusion
The publication of the draft human genome consisting of 30,000 genes is merely the
beginning of genome biology. A new way to understand the complexity and richness of
molecular and cellular function of proteins in biological processes is through
understanding of biological networks. These networks include protein-protein interaction
networks, gene regulatory networks, and metabolic networks. Hence, interaction
databases documenting protein-protein interactions are a necessary tool for the network
biology of 21st century.
HAPPI: Human Annotated Protein-Protein Interaction database fulfill such needs: First,
the interaction database provides a single relational database platform allowing users to
validate protein interactions by comparing results with previous experiments. Second, the
collection and organization of known and predicted protein interactions acts as a great
resource for building up interaction networks into pathways. Third, information related to
gene, pathway, domain and co-citation of interacting proteins provides a whole view of
functional information of proteins. Fourth, the properties of protein networks can be
studied.
Directions for Future Work
The bioinformatics databases used may become outdated based upon the frequency of
release of new versions of each database. Hence having latest versions of integrated
databases is very essential in getting the good relevant interaction data for data analysis.
Protein structures allow distinguishing the large number of domain-domain and proteinprotein interactions which in turn can sort the biologically relevant interactions from nonrelevant interactions. Hence providing the individual structures of interacting proteins
and combined structure of protein interactions are very pivotal in enriching the
information around interacting proteins. Apart from that this database can also be
extended to worm, fly, and yeast protein interactions.
58
As domain-domain interactions play an important role in having a global view of proteinprotein interactions, assessing the quality and reliability of interacting domain pairs in the
form of scores helps in increasing the reliability of protein-protein interactions.
Data expansion, multi-leveled processes and better graphical displays are also other
future considerations for improving our HAPPI database management system. The more
integration of protein interaction databases into the HAPPI database, the better would be
the reliability of confidence scores of protein-protein interactions. Therefore more
interaction databases can be added to HAPPI database to increase the confidentiality of
interaction scores. The code should be optimized for quick display of results page on
HAPPI database.
59
VIII. Appendices
Appendix A: List of Proteins and their associated pathways related to Case Study
Protein
AAPK1,AAPK2,AAKG1,AAKB1,AAKB2,AAKG3,AAKG2
Pathway
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
KEGG: Insulin signaling pathway [PATH:hsa04910].
AP1S2,ANXA8,ANXA4,ANX11,ANXA6,ANXA5,ANX13
NO PATHWAY
ARHG1,ARHG7
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Apoptosis [PATH:hsa04210].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Tight junction [PATH:hsa04530].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
KEGG: Insulin signaling pathway [PATH:hsa04910].
AKT2
AKA12, AL2S7,ACTS,AKAP5,AP3S1,AQP2
NO PATHWAY
BMX
NO PATHWAY
KEGG: TGF-beta signaling pathway
[PATH:hsa04350].
KEGG: Hedgehog signaling pathway
[PATH:hsa04340].
KEGG: Cytokine-cytokine receptor interaction
[PATH:hsa04060].
BMP2
BTK
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
CCL28
KEGG: Cytokine-cytokine receptor interaction
[PATH:hsa04060].
CSKP
KEGG: Tight junction [PATH:hsa04530].
CABIN, CSN3,CENG1
NO PATHWAY
60
CAV1
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
KEGG: Insulin signaling pathway [PATH:hsa04910].
CBLB
CXA1
KEGG: Gap junction [PATH:hsa04540].
KEGG: Cell Communication [PATH:hsa01430].
CSK
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
CNTN1
KEGG: Cell adhesion molecules (CAMs)
[PATH:hsa04514].
DCAK1,DLG4,DOK1
NO PATHWAY
DYR1B,DYRK2
NO PATHWAY
KEGG: Nicotinate and nicotinamide metabolism
[PATH:hsa00760].
KEGG: Benzoate degradation via CoA ligation
[PATH:hsa00632].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
DYRK3,DYRK4,DYR1A
ELK1
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: Dorso-ventral axis formation
[PATH:hsa04320].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
ERBB4
EPHA2,EPHA3
KEGG: Axon guidance [PATH:hsa04360].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
FAK2
61
FRAP
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: Nicotinate and nicotinamide metabolism
[PATH:hsa00760].
KEGG: Benzoate degradation via CoA ligation
[PATH:hsa00632].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
GRK6/GRK7/GRK5/GRK4
GBLP,GRB10,GRB2,GAB2,GAB1,GNDS,GBB3
NO PATHWAY
GTR4
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
KEGG: Insulin signaling pathway [PATH:hsa04910].
GNA12,GBG12
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: MAPK signaling pathway [PATH:hsa04010].
HNRPK
NO PATHWAY
INSR
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Dentatorubropallidoluysian atrophy
(DRPLA) [PATH:hsa05050].
KEGG: Adherens junction [PATH:hsa04520].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Dentatorubropallidoluysian atrophy
(DRPLA) [PATH:hsa05050].
KEGG: Type I diabetes mellitus [PATH:hsa04940].
KEGG: Insulin signaling pathway [PATH:hsa04910].
INS
IRS1,IRS2
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
KEGG: Insulin signaling pathway [PATH:hsa04910].
IGF1R
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Adherens junction [PATH:hsa04520].
62
IKKE
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
ITK
INSRR,IMA2,IRK1,IGF2
NO PATHWAY
JAD1A
NO PATHWAY
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
JAK1
KEGG: Nicotinate and nicotinamide metabolism
[PATH:hsa00760].
KEGG: Benzoate degradation via CoA ligation
[PATH:hsa00632].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KC1G1
KCNE4,KPCD2,KTN1
NO PATHWAY
KPCA
KEGG: Gap junction [PATH:hsa04540].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Cholera - Infection [PATH:hsa05110].
KEGG: Tight junction [PATH:hsa04530].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Wnt signaling pathway [PATH:hsa04310].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
KPCG
KEGG: Gap junction [PATH:hsa04540].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Tight junction [PATH:hsa04530].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
63
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Wnt signaling pathway [PATH:hsa04310].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Gap junction [PATH:hsa04540].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Tight junction [PATH:hsa04530].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Wnt signaling pathway [PATH:hsa04310].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
KPCB
KPCZ
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Tight junction [PATH:hsa04530].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KPCD
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Tight junction [PATH:hsa04530].
KPCI
KEGG: Tight junction [PATH:hsa04530].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KSYK
KLC1,KINN,KINH
NO PATHWAY
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
LCK
LYN
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
MARCS,MYO5A
NO PATHWAY
MP2K1
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
64
KEGG: Gap junction [PATH:hsa04540].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Dorso-ventral axis formation
[PATH:hsa04320].
KEGG: Insulin signaling pathway [PATH:hsa04910].
NEDD4
KEGG: Ubiquitin mediated proteolysis
[PATH:hsa04120].
NSF,NHRF2,NED4L,NEK6
NO PATHWAY
NCK1
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Axon guidance [PATH:hsa04360].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
OCRL
PRKX
KEGG: Gap junction [PATH:hsa04540].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Wnt signaling pathway [PATH:hsa04310].
KEGG: Hedgehog signaling pathway
[PATH:hsa04340].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Axon guidance [PATH:hsa04360].
PAK6
PTN1
KEGG: Adherens junction [PATH:hsa04520].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
PTN6
65
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Adherens junction [PATH:hsa04520].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
PTN11
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
PI52A
PLCB2/PLCB1/PLCB4/PLCB3
KEGG: Gap junction [PATH:hsa04540].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Wnt signaling pathway [PATH:hsa04310].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
PLCD1
PLCG1
KEGG: Cholera - Infection [PATH:hsa05110].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
66
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Cholera - Infection [PATH:hsa05110].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
PLCG2
PCTK3,PTPRE,PICK1,PCBP1,PFD5,PTBP1,PLEK
NO PATHWAY
PDPK1
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
PI4KA,PI5PA
PSA7,PSA2,PSB2,PSA5,PSA1,PSA3
KEGG: Proteasome [PATH:hsa03050].
PLK1
KEGG: Cell cycle [PATH:hsa04110].
PKD2,PTEN,PTN21
NO PATHWAY
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Apoptosis [PATH:hsa04210].
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
KEGG: Insulin signaling pathway [PATH:hsa04910].
PK3CA,PK3CD,PK3CB
PK3CG
KEGG: B cell receptor signaling pathway
67
[PATH:hsa04662].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Apoptosis [PATH:hsa04210].
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Apoptosis [PATH:hsa04210].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
P3C2B,P3C2G
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Apoptosis [PATH:hsa04210].
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
KEGG: Insulin signaling pathway [PATH:hsa04910].
P55G
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
P85A
68
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Apoptosis [PATH:hsa04210].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
KEGG: Insulin signaling pathway [PATH:hsa04910].
KEGG: Cell adhesion molecules (CAMs)
[PATH:hsa04514].
KEGG: Adherens junction [PATH:hsa04520].
KEGG: Insulin signaling pathway [PATH:hsa04910].
PTPRF
PTPA
KEGG: Tight junction [PATH:hsa04530].
RHG01,RABE1,RB33B,RARA,RET
NO PATHWAY
RIPK3,RIPK4,RHOA,RASK,RGS2
NO PATHWAY
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Gap junction [PATH:hsa04540].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Dorso-ventral axis formation
[PATH:hsa04320].
KEGG: Insulin signaling pathway [PATH:hsa04910].
RAF1
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Focal adhesion [PATH:hsa04510].
RAC1
69
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Wnt signaling pathway [PATH:hsa04310].
KEGG: Adherens junction [PATH:hsa04520].
KEGG: Axon guidance [PATH:hsa04360].
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: Gap junction [PATH:hsa04540].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Tight junction [PATH:hsa04530].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Dorso-ventral axis formation
[PATH:hsa04320].
KEGG: Axon guidance [PATH:hsa04360].
KEGG: Insulin signaling pathway [PATH:hsa04910].
RASH,RRAS,RASM,RRAS2,RASN
RASA1
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Huntington's disease [PATH:hsa05040].
KEGG: Axon guidance [PATH:hsa04360].
RASA2
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Axon guidance [PATH:hsa04360].
SHC1
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Insulin signaling pathway [PATH:hsa04910].
SRC
KEGG: Gap junction [PATH:hsa04540].
KEGG: Tight junction [PATH:hsa04530].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Adherens junction [PATH:hsa04520].
SOCS1
KEGG: Type II diabetes mellitus [PATH:hsa04930].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
KEGG: Insulin signaling pathway [PATH:hsa04910].
STK11
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
STX1A
KEGG: Parkinson's disease [PATH:hsa05020].
70
KEGG: SNARE interactions in vesicular transport
[PATH:hsa04130].
SNX1,STK6,SH3K1,SL9A2,SYPH
NO PATHWAY
SNIL1,S22A3,S10A2,SRBS1
NO PATHWAY
SL9A1
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
SCNAA,SCNNB,SCNND,SCNNG,SGK1,SGK2,SGK3
NO PATHWAY
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
SYNJ1,SYNJ2
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
TEC
KEGG: Nicotinate and nicotinamide metabolism
[PATH:hsa00760].
KEGG: Benzoate degradation via CoA ligation
[PATH:hsa00632].
KEGG: Inositol phosphate metabolism
[PATH:hsa00562].
KEGG: Phosphatidylinositol signaling system
[PATH:hsa04070].
TTK
TBB4,TBB2
KEGG: Gap junction [PATH:hsa04540].
KEGG: Adipocytokine signaling pathway
[PATH:hsa04920].
KEGG: Jak-STAT signaling pathway
[PATH:hsa04630].
TYK2
TESK2,TBA2,TUB
NO PATHWAY
TLN1,TIE1
NO PATHWAY
TXK
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Neuroactive ligand-receptor interaction
[PATH:hsa04080].
KEGG: Calcium signaling pathway
[PATH:hsa04020].
V1AR
KEGG: Neuroactive ligand-receptor interaction
[PATH:hsa04080].
V2R
71
VPS16,VPS41,VP33A
NO PATHWAY
KEGG: Ubiquitin mediated proteolysis
[PATH:hsa04120].
KEGG: Dentatorubropallidoluysian atrophy
(DRPLA) [PATH:hsa05050].
WWP1,WWP2
WBP2
NO PATHWAY
YPEL2
NO PATHWAY
Q96BD6,Q9BQ83,Q15464,Q8WWN9
NO PATHWAY
Q8N556,Q8NAL1,Q8N317
NO PATHWAY
KEGG: B cell receptor signaling pathway
[PATH:hsa04662].
KEGG: Regulation of actin cytoskeleton
[PATH:hsa04810].
KEGG: MAPK signaling pathway [PATH:hsa04010].
KEGG: Toll-like receptor signaling pathway
[PATH:hsa04620].
KEGG: Leukocyte transendothelial migration
[PATH:hsa04670].
KEGG: Focal adhesion [PATH:hsa04510].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
KEGG: Wnt signaling pathway [PATH:hsa04310].
KEGG: Adherens junction [PATH:hsa04520].
KEGG: Axon guidance [PATH:hsa04360].
Q7RTZ3
KEGG: T cell receptor signaling pathway
[PATH:hsa04660].
KEGG: Natural killer cell mediated cytotoxicity
[PATH:hsa04650].
ZAP70
1433G,1433Z,1433B,1433E,1433F,1433T
KEGG: Cell cycle [PATH:hsa04110].
2A5E,1433F
NO PATHWAY
72
Appendix B: Pathway significant proteins
ADIPOCYTOKINE
SIGNALING
PATHWAY
INSULIN
SIGNALING
PATHWAY
AAPK1/2
AAKG1/ 2/ 3
AAKB1/ 2
AKT2
FRAP
GTR4
IRS1/2
JAK1
STK11
TYK2
AAPK1/2
AAKG1/ 2/ 3
AAKB1/ 2
AKT2
CBLB
ELK1
FRAP
GTR4
INSR
INS
IRS1/2
KPCI/Z
MP2K1
PDPK1
PK3CA/B/D/G
PRKX
PTN1
PTPRF
P55G
P85A
RAF1
RASH/M/N
RRAS
RRAS2
SHC1
SOCS1
B CELL RECEPTOR
SIGNALING
PATHWAY
AKT2
BTK
KPCB
KSYK
LYN
PTN6
PK3CA/B/D/G
P3C2B/G
P55G
P85A
PLCG2
RAC1
RASH/M/N
RRAS
RRAS2
Q7RTZ3
TOLL-LIKE
RECEPTOR
SIGNALING
PATHWAY
AKT2
IKKE
PK3CA/B/D/G
P3C2B/G
P55G
P85A
RAC1
Q7RTZ3
MAPK
SIGNALING
PATHWAY
AKT2
ELK1
GNA12
GBG12
IKKE
KPCA/B/G
MP2K1
PRKX
RAF1
RAC1
RASH/M/N
RRAS
RASN
RASA1/2
Q7RTZ3
CELL ADHESION
MOLECULES
CNTN1
PTPRF
NICOTINATE
AND
NICOTINAMIDE
METABOLISM
DYRK3/4
DYR1A
GRK4/5/6/7
KC1G1
TTK
APOPTOSIS
AKT2
P3KCA/B/D/G
P3C2B/G
P55G
P85A
TIGHT JUNCTION
AKT2
CSKP
KPCA/B/D/G/I/Z
RASH/M/N
RRAS
RRAS2
SRC
73
FOCAL
ADHESION
AKT2
CAV1
IGF1R
KPCA/B
MP2K1
PAK6
PDPK1
PK3CA/B/D/G
P55G
P85A
RAF1
RAC1
RASH/M/N
RRAS
RRAS2
SHC1
SRC
Q7RTZ3
JAK-STAT
SIGNALING
PATHWAY
T CELL RECEPTOR
SIGNALING
PATHWAY
AKT2
CBLB
JAK1
PTN6
PTN11
PK3CA/B/D/G
P3C2B//G
P55G
P85A
TYK2
AKT2
CBLB
ITK
LCK
NCK1
PAK6
PTN6
PLCG1
P3KCA/B/D/G
P3C2B/G
P55G
P85A
RASH/M/N
RRAS
RRAS2
TEC
ZAP70
HEDGEHOG
SIGNALING
PATHWAY
BMP2
PRKX
REGULATION OF
ACTIN
CYTOSKELETON
GAP JUNCTION
CYTOKINECYTOKINE
RECEPTOR
INTERACTION
BMP2
CCL28
BENZOATE
DEGRADATION
VIA CoA
LIGATION
DYRK3/4
DYR1A
GRK4/5/6/7
KC1G1
TTK
CHOLERA INFECTION
KPCA
PLCG1/2
CXA1
KPCA/B/G
MP2K1
PRKX
PLCB1/2/3/4
RAF1
RASH/M/N
RRAS
RRAS2
SRC
TBB2/4
ARHG1/7
CSK
GNA12
GBG12
INS
MP2K1
PAK6
PI52A
PK3CA/B/D/G
P3C2B/G
P55G
P85A
RAF1
RAC1
RASH/M/N
RRAS
RRAS2
SL9A1
Q7RTZ3
INOSITOL
PHOSPHATE
METABOLISM
DYRK3/4
DYR1A
GRK4/5/6/7
KC1G1
OCRL
PI52A
PLCB1/2/3/4
PLCD1
PLCG1/2
PI4KA
PI5PA
PK3CA/B/D/G
SYNJ1/2
TTK
NEUROACTIVE
LIGANDRECEPTOR
INTERACTION
V1AR
V2R
74
PHOSPHATIDY
LINOSITOL
SIGNALING
SYSTEM
DYR1A
DYRK3/4
GRK4/5/6/7
KC1G1
KPCA/B/G
OCRL
PI52A
PLCB1/2/3/4
PLCD1
PLCG1/2
PI4KA
PI5PA
PK3CA/B/D/G
P3C2B/G
P55G
P85A
SYNJ1/2
TTK
TYPE II
DIABETES
MELLITUS
FRAP
GTR4
INSR
INS
IRS1/2
KPCD/Z
PK3CA/B/D/G
P55G
P85A
SOCS1
UBIQUITIN
MEDIATED
PROTEOLYSIS
NEDD4
WWP1/2
DORSOVENTRAL AXIS
FORMATION
CALCIUM
SIGNALING
PATHWAY
ERBB4
MP2K1
RAF1
RASH/M/N
RRAS
RRAS2
ERBB4
FAK2
KPCA/B/G
PRKX
PLCB1/2/3/4
PLCD1
PLCG1/2
V1AR
LEUKOCYTE
TRANSENDOT
HELIAL
MIGRATION
FAK2
ITK
KPCA/B/G
PTN11
PLCG1
PK3CA/B/D
P85A
RAC1
TXK
Q7RTZ3
ADHERENS
JUNCTION
INSR
IGF1R
PTN1/6
PTPRF
RAC1
SRC
QRT7Z3
NATURAL
KILLER CELL
MEDIATED
CYTOTOXICITY
FAK2
KPCA
KSYK
LCK
MP2K1
PAK6
PTN6/11
PLCG1/2
PK3CA/B/D
P85A
RAF1
RAC1
RASH/M/N
RRAS
RRAS2
SHC1
Q7RTZ3
ZAP70
CELL
CYCLE
PLK1
1433B
1433F
1433E
1433G
1433Z
1433T
75
AXON
GUIDANCE
EPHA2/3
NCK1
PAK6
RAC1
RASH/M/N
RRAS
RRAS2
RASA1/2
Q7RTZ3
DENTATORUB
ROPALLIDOLU
YSIAN
ATROPHY
(DRPLA)
INSR
INS
WWP1/2
Wnt SIGNALING
PATHWAY
KPCA/B/G
PRKX
PLCB1/2/3/4
RAC1
Q7RTZ3
INSULIN
SIGNALING
PATHWAY
REGULATION OF
ACTIN
CYTOSKELETON
AAPK1/2
AAKG1/ 2/ 3
AAKB1/ 2
AKT2
CBLB
ELK1
FRAP
GTR4
INSR
INS
IRS1/2
KPCI/Z
MP2K1
PDPK1
PK3CA/B/D/G
PRKX
PTN1
PTPRF
P55G
P85A
RAF1
RASH/M/N
RRAS
RRAS2
SHC1
SOCS1
ARHG1/7
CSK
GNA12
GBG12
INS
MP2K1
PAK6
PI52A
PK3CA/B/D/G
P3C2B/G
P55G
P85A
RAF1
RAC1
RASH/M/N
RRAS
RRAS2
SL9A1
Q7RTZ3
PHOSPHATIDY
LINOSITOL
SIGNALING
SYSTEM
DYR1A
DYRK3/4
GRK4/5/6/7
KC1G1
KPCA/B/G
OCRL
PI52A
PLCB1/2/3/4
PLCD1
PLCG1/2
PI4KA
PI5PA
PK3CA/B/D/G
P3C2B/G
P55G
P85A
SYNJ1/2
TTK
TIGHT JUNCTION
AKT2
CSKP
KPCA/B/D/G/I/Z
RASH/M/N
RRAS
RRAS2
SRC
76
Appendix C: Physical Schema / Rudimentary Data Dictionary
Uniprot Protein Table:
Attributes Name
Type
Size
Constraints
Uniprot_ID
VARCHAR2
20
PK, NOT NULL
Amino_Acids
VARCHAR2
20
NOT NULL
Uniprot_Acc
VARCHAR2
800
NOT NULL
Data_Info
VARCHAR2
2000
NOT NULL
Protein_Desc
VARCHAR2
4000
NOT NULL
Gene
VARCHAR2
800
FK, NOT NULL
Organism
VARCHAR2
100
NOT NULL
Taxonomy_ID
VARCHAR2
100
NOT NULL
Primary_ref_id
VACHAR2
500
NULL
Db_ref
CLOB
-
NULL
Keywords
VARCHAR2
4000
NULL
Mol_wt
VARCHAR2
50
NOT NULL
Check_value
VARCHAR2
50
NOT NULL
Protein_seq
CLOB
-
NOT NULL
Pfam Domain Table:
Attributes Name
Type
Size
Constraints
Uniprot_ID
VARCHAR2
20
PK, NOT NULL
Uniprot_acc
VARCHAR2
800
NOT NULL
Domain_name
VARCHAR2
100
NULL
Domain_desc
VARCHAR2
1000
NULL
Domain_no
VARCHAR2
100
NULL
Domain_pos
VARCHAR2
400
NULL
Domain_ID
VARCHAR2
100
PK, NOT NULL
String Protein Interaction Table:
Attributes Name
Type
Size
77
Constraints
Ensembl_protein_id_a
VARCHAR2
40
PK, NOT NULL
Ensembl_protein_id_b
VARCHAR2
40
PK, NOT NULL
Equiv_nscore
INTEGER
NULL
Equiv_nscore_transferred
INTEGER
NULL
Equiv_fscore
INTEGER
NULL
Equiv_pscore
INTEGER
NULL
Equiv_hscore
INTEGER
NULL
Array_score
INTEGER
NULL
Array_score_transferred
INTEGER
NULL
Experimental_score
INTEGER
NULL
Experimental_score_transferred INTEGER
NULL
Database_score
INTEGER
NULL
Database_score_transferred
INTEGER
NULL
Textmining_score
INTEGER
NULL
Textmining_score_transferred
INTEGER
NULL
Subscore_physical
VARCHAR2
Combined_score
INTEGER
10
NULL
NULL
Ophid Protein Interaction Table
Attributes Name
Type
Size
Constraints
Uniprot_acc1
VARCHAR2
100
FK, NOT NULL
Uniprot_acc2
VARCHAR2
100
FK, NOT NULL
Dataset
VARCHAR2
100
NOT NULL
Kegg Pathway Table
Attributes Name
Type
Size
Constraints
Gene_name
VARCHAR2
255
FK, NOT NULL
Gene_synonyms
VARCHAR2
255
NOT NULL
Gene_desc
VARCHAR2
255
NULL
Pathway
VARCHAR2
1000
NULL
Protein Identifiers Table
78
Attributes Name
Type
Size
Constraints
Uniprot_protein_id
VARCHAR2
40
FK, NOT NULL
Ensembl_protein_id
VARCHAR2
40
FK, NOT NULL
Description of Tables and Attributes
1. Uniprot Protein Table: stores manually annotated and computationally analyzed
records with protein sequence and functional annotation.

Uniprot_ID: Identifies a protein sequence. It usually consists of up to 11
uppercase alphanumeric characters. The general naming convention can
be symbolized as X_Y, where X is a mnemonic code of at most 5
alphanumeric characters representing the protein name, ‘_’ sign serves as
a separator and Y is a mnemonic species identification code of at most 5
alphanumeric characters representing the biological source of the protein.
This code is generally made of the first three letters of the genus and the
first two letters of the species.

Amino_Acids: Total number of amino acids in the sequence.

Uniprot_Acc: A stable way of identifying entries from release to release
and includes primary and secondary accession numbers

Data_Info: the date of creation and last modification of the database entry

Protein_Desc: General descriptive information about the sequence stored

Gene: The name of gene that code for the stored protein sequence

Organism: the organism which was the source of the stored sequence

Taxonomy_ID: the identifier of a specific organism in a taxonomic database

Primary_ref_id: includes Medline, PubMed and Digital object identifiers

Db_ref: include pointers to information related to entries and found in data
collections other than Uniprot

Keywords: provide information that can be used to generate indexes of the
sequence entries based on functional, structural, or other categories.

Mol_wt: Molecular weight of protein rounded to the nearest mass unit
Dalton
79

Check_value: the sequence 64-bit CRC (Cyclic Redundancy Check) value
(‘CRC64’)

Sequence: contains standard IUPAC one letter code amino acids
2. Pfam Domain Table: stores domain information of proteins

Uniprot_ID: Identifies a protein sequence.
1. Uniprot_Acc: A stable way of identifying entries from release to
release and includes primary and secondary accession numbers
2. Domain_ID: unique identifier of protein domain for pfam database
3. Domain_name: Name of the protein domain
4. Domain_desc: Description of the protein domain
5. Domain_occurrence: the number of occurrences of each domain in
a particular protein
6. Domain_pos: the start and end position of each occurrence of a
domain in a protein
3. String Protein Interaction Table: stores known and predicted protein-protein
interactions with scores. The score is a combined measure from the different
prediction algorithms.
7. Ensembl_protein_id_a: the identifier of protein
8. Ensembl_protein_id_b: the identifier of interacting protein
9. Nscore: conserved neighborhood score of interacting proteins i.e.
genes that occur repeatedly in close neighborhood in genomes
(maximum allowed intergenic distance is 300 base pairs)
10. Co-occurrence score: Shows the presence or absence of linked
orthologous groups across species
11. Gene fusion score: shows the individual gene fusion events per species
80
12. Dbscore:
shows
that
interacting
proteins
information
is
documented in databases
13. Experimental score: shows that interacting protein information is
obtained from an experiment
14. Text mining score: shows that interacting protein information is
mentioned in publications
15. total score: sum of all above mentioned scores of interacting proteins
4. Ophid interaction Table: stores known and predicted human protein-protein
interactions. It has been built from yeast, mouse, Drosophila and C.elegans HTP
data.
16. Dataset: the source of dataset
17. UniProt protein_id_a : the unique identifier of a protein
18. Uniprot protein_id_b: the unique identifier of interacting protein
5. Kegg Pathway Table: stores the pathway information of genes
19. Gene_name: A unique identifier for human gene
20. Gene_synonyms: The other names for human gene
21. Gene_desc: Description of gene
22. Pathway: identifier of pathway i.e. Kegg pathway id followed by
description of pathway
81
References
1. Chen J.Y., Sivachenko A.Y., Li L. Initial large-scale exploration of proteinprotein
interactions in human brain. Proceedings of IEEE Computational
Systems Biology (CSB), Stanford, CA, 2003, 18-23.
2. Briggs S. The Emerging Field of Systems Biology and its potential role in
understanding disease. Division of Biological Sciences, University of California,
San Diego, Biosphere Winter, 2004/5, 7.
3. Golemis E. Toward an Understanding of Protein Interactions. In Protein- Protein
Interactions – A Molecular Cloning Manual. Cold Spring Harbor Laboratory
Press, 2002, 1-5.
4. Nakai K. Protein sorting signals and prediction of subcellular localization. Adv.
Protein Chem. 2000, 54:277-344.
5. Hanash S. Disease proteomics. Nature 2003, 422:226-232.
6. Karin M., Ben-Neriah Y. Phosphorylation meets ubiquitination: The control of
NF- B activity. Annual Rev. Immunology 2000, 18: 621-663.
7. Pawson T., Nash P. Assembly of cell regulatory systems through protein
interaction domains. Science 2003, 300: 445-452.
8. Albert R., Jeong H., Barabasi AL. Error and attack tolerance of complex
networks. Nature 2000, 406: 378-382.
9. Apic G., Gough J., Teichmann S.A. Domain combinations in archaeal, eubacterial
and eukaryotic proteomes. J. Mol. Biol. 2001, 310:311–325.
10. Peri S. et al. Development of Human Protein Reference Database as an Initial
Platform for Approaching Systems Biology in Human. Genome Research 13,
2003, 2363-2371.
11. Kitano H. Computational systems biology. Nature, 2002, 420: 206-210.
12. Birney E., Clamp M., Hubbard T. Databases and tools for browsing genomes.
Annual Rev. Genomics Hum. Genetics, 2002, 3: 293-310.
13. Lacroix Z., Critchlow T. Bioinformatics: Managing Scientific Data. Morgan
Kaufmann series in multimedia information and systems, 2003, 1-32.
14. Golemis E. Protein–Protein Interactions: A Molecular Cloning Manual. Cold
Spring Harbor Laboratory Press 2002.
15. Xenarios I., Eisenberg D. Protein Interaction Databases. Current Opinion in
Biotechnology 2001, 12:334-339.
16. http://campus.queens.edu/faculty/jannr/bio103/helpPages/c11DNA.htm
17. Salwinski L., Eisenberg D. Computational methods of analysis of protein-protein
interactions. Current Opinion Structural Biology 2003, 13:377-382.
82
18. Wojcik J., Schachter V. Protein-protein interaction map inference using
interacting domain profile pairs. Bioinformatics 17, suppl 1, 2001, S296-S305.
19. Wojcik J., Boneca I.G., Legrain P. Prediction, assessment and validation of
protein interaction maps in bacteria. J. Mol. Biol. 323, 2002, 763–770.
20. Pazos F., Valencia A. In silico two-hybrid system for the selection of physically
interacting protein pairs. Proteins 47, 2002, 219–227.
21. Sprinzak E., Margalit H., Correlated sequence-signatures as markers of protein–
protein interaction. J. Mol. Biol. 311, 2001, 681–692.
22. Deng M., Mehta S., Sun F., Chen T. Inferring domain–domain interactions from
protein–protein interactions. Genome Res. 12, 2002, 1540–1548.
23. Lu L., Lu H., Skolnick J. Multiprospector: an algorithm for the prediction of
protein-protein interactions by multimeric threading. Proteins 49, 2002, 350–364.
24. Matthews L.R., Vaglio P., Reboul J., Ge H., Davis P., Garrels J., Vincent S.,
Vidal M. Identification of potential interaction networks using sequence-based
searches for conserved protein-protein interactions or "interologs". Genome Res.
11, 2001, 2120–2126.
25. Zhou X., Kao M.C., Wong W.H. Transitive functional annotation by shortest-path
analysis of gene expression data. Proc. Natl. Acad. Sci. USA 99, 2002, 12783–
12788.
26. Pellegrini M., Marcotte E.M., Thompson J., Eisenberg D., Yeates T.O. Assigning
protein functions by comparative genome analysis: protein phylogenetic profiles.
Proc. Natl. Acad. Sci. USA 96, 1999, 4285–4288.
27. Pazos F., Valencia A. Similarity of phylogenetic trees as indicator of proteinprotein interaction. Protein Eng. 14, 2001, 609–614.
28. Liberles D.A., Thoren A., Heijne G.V., Elofsson A. The use of phylogenetic
profiles for gene predictions. Current Genomics 3, 2002, 131–137.
29. Vert J.P., A tree kernel to analyse phylogenetic profiles. Bioinformatics 18 suppl
1, 2002, S276–S284.
30. Marcotte E.M., Pellegrini M., Ng H.L., Rice D.W., Yeates T.O., Eisenberg D.
Detecting protein function and protein-protein interactions from genome
sequences. Science 1999, 285: 751-753.
31. Enright A. Illioupolos I., Kyrpides N.C., Ouzounis C.A. Protein interaction maps
for complete genomes based on gene fusion events. Nature 1999, 402: 86-90.
32. Clark B. F. Towards a total human protein map. Nature 1981, 292:491-492.
33. Anderson N. G., Anderson N. L. Behring Inst. Mitt. 1979, 63:169 –210.
34. Von M.C., Jensen L.J., Snel B., Hooper S.D., Krupp M., Foglierini M., Jouffre N.,
Huynen M.A., Bork P. STRING: known and predicted protein–protein
associations integrated and transferred across organisms. Nucleic Acids Res.,
2005, 33 Database issue: D433-7.
83
35. Brown K.R., Jurisica I. Online Predicted Human Interaction Database.
Bioinformatics 2005 21(9):2076-2082.
36. Bairoch A., Apweiler R., Wu C. H., Barker W. C., Boeckmann B., Ferro S.,
Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A.,
O'Donovan C., Redaschi N., Yeh L.S. The Universal Protein Resource (UniProt)
Nucleic Acids Res. 33. 2005, D154-159.
37. Bateman A. Coin L., Durbin R., Finn R.D., Hollich V., Griffiths-Jones S., Khanna
A., Marshall M., Moxon S., Sonnhammer E.L., Studholme D.J., Yeats C., Eddy
S.R. The Pfam Protein Families Database. Nucleic Acids Research 2004,
Database Issue 32:D138-D141.
38. Kanehisa M. et al. From genomics to chemical genomics: new developments in
KEGG. Nucleic Acids Res. 2006, 34:D354-357.
39. NCBI http://www.ncbi.nih.gov/
40. Jain P., Kircher M., Parameswaran K.
http://posa3.org/workshops/ThreeTierPatterns/
41. Bellinzona M., Quercia D., Capece J.C., Campbell K.L. MAERC/IFAS Agro
ecosystem research program information system development and
implementation. Technical Report, University of Florida, 2002.
42. Dublin Core Metadata Element, version
http://dublincore.org./documents/dces, 2004.
1.1.:
Reference
Description,
43. Ambler S.W. Agile model driven development with UML2, The Object Primer
Third Edition, 2004, Chapter 6.
44. Blazer-Yost B.L., Vahle J.C., Byars J.M., Bacallao R. Real-time three
dimensional imaging of lipid signal transduction: Apical membrane insertion of
epithelial Na+ channels. Am.J. Physiol. Cell Physiol., 2004, 287:C1569-C1576.
45. Blazer-Yost B.L., Nofziger C. The role of the phosphoinositide pathway in
hormonal regulation of the epithelial sodium channel. Adv. Expert. Med. & Biol.,
2004, 559:359-368.
46. Nikitin A., Egorov S., Daraselia N., Mazo I. Pathway studio – the analysis and
navigation of molecular networks. Bioinformatics Vol.19 no.16 2003, 2155-2157.
84
Curriculum Vitae
SudhaRani Mamidipalli
[email protected]
http://informatics.iupui.edu/people/profile.php?id=230
3856 Cornwallis Lane, Carmel
IN 46032 Phone (317) 873 4746
Objective
Seeking a full-time position in Bioinformatics, Computational Biology, Information Management, Database
Management, Protein Interactomics or related areas.
Education
 M.S. Bioinformatics, Indiana University, Indianapolis, GPA 3.9 / 4.0, May 2006
o Thesis titled “HAPPI: A Bioinformatics Database Platform Enabling Network Biology
Studies”, advised by Dr. Jake Yue Chen
 Post Graduate Diploma in Computer Applications, Hyderabad, India
 B.S. Agricultural Science, Acharya N G Ranga Agricultural University, India
 Sun Certified Java2 Programmer, Sun Microsystems, USA
 Certificate in Web Markup & style coding and Access End-User, Indiana University
Skills
Scripting Languages
Markup Languages
Languages
Platforms
Databases
Software Tools
Scientific Tools
:
:
:
:
:
:
:
Perl, PHP, and Java script
XML, HTML, and XHTML
Java, SQL, PSQL, C, C++, COBOL, JCL, and CICS
UNIX (Solaris), Linux (Red Hat, Suse), Windows, MVS, DOS
Oracle, MySQL, DB2, PostgreSQL, MS Access, and SQL Server
SSH, Exceed, Toad, Erwin, Endnote, Komodo, AquaData Studio
Blast, FASTA, Spotfire, Affymetrix Suite, CyberLab, Phase,
Lab Track, Vector NTI, and TotalChrom
Proteomics Tools
: mzXML, ReAdW, mzXML2Other, PeptideProphet
Bioinformatics Databases : String, Ophid, UniprotKB/Swiss-prot, UniprotKB/Trembl, Pfam,
Kegg, Ensembl, IPI, Entrez, Refseq
Projects, Research Papers and Presentations








Gene Expression Data Management and Analysis
NCBI – Website Analysis
Medical Databases-Electronic Medical Records
Intelligent Electronic Laboratory Notebooks
Querying Multiple Bioinformatics Information Sources- Can Semantic Web Research Help?
Genetic Algorithms for Protein Folding Simulations
Surrostat Biomarker Analysis System
Information Representation, Retrieval and Visual Presentation in Bioinformatics
85
Professional Experience
* School of Informatics & School of Science, IUPUI
August, 2004 – May, 2006
Research Assistant, Discovery Informatics and Computing Group
 Research on Protein Interactomics: Mining functional links between proteins.
 Developed an application for oligo sequence analysis OligoMatcher and a sequence
annotated feature mapping tool SafMap.
 Installed and Configured the Oracle client and Toad for database development
 Downloaded (ftp), parsed (Perl regular expressions) and loaded data into tables (sqlloader)
 Defined metadata (Dublin Core standard) for data management and duplicity reduction.
 Developed conceptual (Entity-Relationship diagrams), logical and physical data models by
identifying entities, attributes, relationships, and assigning keys
 Normalized data for data integrity, optimized queries, faster index creation and sorting.
 Used data management tools (Spotfire) to visually and statistically interrogate the data.
 Administered Lacie Storage Server (Size: 2TB, users: 17)
 Trained new users of the core informatics resources.
* Dow AgroSciences, Indianapolis
May, 2005 - August, 2005
R&D Intern, Discovery Research Information Management
 Installed, configured and administered EMBOSS suite of Molecular Biology Programs,
wrappers for EMBOSS, and wEMBOSS on DAS Intranet of Bioinformatics that enables
project, data and results management.
 Integrated public biological databases into DAS Bioinformatics environment.
 Established a pipeline for tandem mass spectrometry (MS/MS) data analysis, validation and
quantification for Proteomics.
 Solved thorny problems by communicating with DAS users and external experts.
 Evaluated DAS Bioinformatics website and proposed usability enhancements.
 Promoted work to molecular biologists through seminars, conferences and meetings.
 Influenced Trait Genetics and Technologies scientists in using bioinformatics tools.
* School of Medicine, Indiana University
January, 2005 - May, 2005
Independent Study, Pharmacogenetic Bioinformatics
 Application of Bioinformatics in SNP discovery of INDO gene
* University Information Technology Services
May, 2004 - August, 2004
Web Analyst, Indiana University
 Collected user and technical requirements from professors and administrators.
 Designed and developed Music Theory Placement Exam for School of Music, IUPUI.
* CGI (formerly IMRGlobal), Bangalore, India
October, 1997 - May, 2000
Software Engineer
 Involved in analysis, design, development, and maintenance of mainframe applications.
 Conducted project meetings on weekly basis to resolve issues and problems.
 Developed test strategy that includes baseline, unit, parallel and integration testing of jobs.
 Interacted with onsite team on a daily basis via conference calls.
 Planned and conducted peer reviews for all deliverables.
 Maintained detailed record of decisions taken.
 Clients are Michelin Tires, Fleming Foods, and John Hancock Insurance etc.
86
* Compcore Info tech (India) Ltd., Hyderabad, India
January, 1996 - September, 1997
Y2K Project Trainee
 Downloaded, checked and unzipped the inventory by sub-system and language
 Analyzed the programs using Transform-2000 tool
 Applied macros to convert the tool outputs to the required format
 Prepared weekly time-sheets for having control on the hours spent on project
 Adhered to quality control standards to ensure the correctness and quality of work
Publication
 Mamidipalli SudhaRani, Mathew Palakal, Shuyu Dan Li. OligoMatcher: analysis and selection
of specific oligonucleotide sequences for gene silencing by antisense or siRNA. Applied
Bioinformatics Journal. In press.
Book Chapter
 Mamidipalli SudhaRani, Jake Yue Chen. “Network Biology Informatics: Enabling human
protein interactomics studies” in ‘Current Topics in Human Genetics: Studies of Complex
Diseases’. In press.
Abstract
 Arafayene, M., Mamidipalli, S., Philips, S., Cao, D., Flockhart, D., Wilkes, D., Skaar, T.
Identification of Functional Genetic Variants of the Indoleamine 2, 3 Dioxygenase Gene.
American Association for Cancer Research Annual Meeting 2006.
Academic Honors
 Featured as a top graduate from School of Informatics at IUPUI Commencement ceremony,
2006.
Professional Activities
 Grand Awards Judge, Intel International Science and Engineering Fair (ISEF 2006) - Medicine
and Health Sciences Category
 Paper Reviewer, ACM Symposium on Applied Computing (SAC 2006) - Bioinformatics Track
Affiliations
 Women and Hi Tech
 Informatics Women’s Organization
 IUPUI Computer Science Club
87