Download A platform for the integration of genomic and proteomic data

Document related concepts

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
A platform for the integration of genomic and
proteomic data
Lena Hansson
Master of Science
School of Informatics
University of Edinburgh
2003
Abstract
The goals for this project were to build a standardised, public, web-based, scalable,
modularised, up-to-date integration platform for genomic and proteomic data. This
platform would allow researchers to investigate a single gene, a single protein, a batch
of genes, a batch of proteins or any combination of the above.
The platform can store data for any kind of proteomic-experiment, but on the genomic side is it only possible to add microarray-experiments at the moment. The
platform is focused on being flexible enough to facilitate new kind of data. My guess
is that the first thing to be added to the structure is clinical data, so that it would be
possible to access that kind of information as well. To make it this flexible, I have divided the platform into three separate parts, a genomic, a proteomic and an integration
part. I believe that any other solution would have resulted in a less robust system.
I have used a combined approach of link driven integration, view integration and
warehouse to implement the platform, picking the best bits of each. The platform has
been implemented in Oracle 8i and the programs in Perl.
Such a platform could be used, as I have shown, to analyse the data that will become
available as biology enters a new phase, one of systems biology.
i
Acknowledgements
This project would not have been possible without the help from the following persons.
First of all I would like to thank Peter Ghazal and everybody at SCGTI for making this project possible. They have answered my questions, installed software and
hardware and given my useful tips along the way. I would also like to thank Alan Pemberton at Veterinary Clincial Studies, for granting me access to his data and answering
my questions. Perdita Barran and Jim Creanor at SIRCAMS, helped me realise what
kind of proteomic data needs to be stored and in what format. Both Alan and Perditan
also helped me “test” and “evaluate” the system in the end, thank you for your time.
Douglas Armstrong has made sure that both project and dissertation fulfills the
demands from the University.
To all of you, THANK YOU.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Lena Hansson)
iii
Table of Contents
1
Introduction
1
1.1
The goal with this project . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Why is this project interesting to do, now? . . . . . . . . . . . . . . .
3
1.3
Long-term goals for SCGTI (“biological relevance”) . . . . . . . . .
3
1.4
Steps on the way . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4.1
Design the proteomic database . . . . . . . . . . . . . . . . .
4
1.4.2
Map genomic data “onto” proteomic data . . . . . . . . . . .
4
1.4.3
Build the integration platform . . . . . . . . . . . . . . . . .
5
1.4.4
Analyse the data . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4.5
What will be left to do after this project . . . . . . . . . . . .
5
1.4.6
The problem at hand is both simple and complex . . . . . . .
6
1.4.7
What have I changed in these goals along the way? . . . . . .
7
What have I done? . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.5.1
The Genomic part . . . . . . . . . . . . . . . . . . . . . . .
8
1.5.2
The Proteomic part . . . . . . . . . . . . . . . . . . . . . . .
8
1.5.3
The Integration part . . . . . . . . . . . . . . . . . . . . . . .
9
1.6
Comparing with the testcase . . . . . . . . . . . . . . . . . . . . . .
9
1.7
Outline of the dissertation . . . . . . . . . . . . . . . . . . . . . . . .
9
1.5
2
Background information
11
2.1
Integration of biological data . . . . . . . . . . . . . . . . . . . . . .
11
2.1.1
What problems will arise . . . . . . . . . . . . . . . . . . . .
12
2.1.2
What has been done . . . . . . . . . . . . . . . . . . . . . .
14
iv
2.2
2.3
2.4
2.5
3
2.1.3
What has been done commercially . . . . . . . . . . . . . . .
16
2.1.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
The three approaches . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2.1
Link integration . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2.2
View integration . . . . . . . . . . . . . . . . . . . . . . . .
18
2.2.3
Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
Models and schemas . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.3.1
MIAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.3.2
MaxD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.3.3
PEDRo . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
UniGene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.4.1
What is it . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.4.2
How up-to-date is it? . . . . . . . . . . . . . . . . . . . . . .
25
2.4.3
What limitations/problems exists . . . . . . . . . . . . . . . .
25
2.4.4
Which species are covered, and why is that enough . . . . . .
26
2.4.5
UniGene is more appropriate than TIG . . . . . . . . . . . .
27
Systems biology and genetic networks . . . . . . . . . . . . . . . . .
27
Results
29
3.1
Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2
My solutions for the problems in section 2.1.1 . . . . . . . . . . . . .
30
3.3
From theory to practise . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.4
The Genomic Part . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Getting the UniGene clustering information (all species.pl) . .
33
The Proteomic Part . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.5.1
The PEDRo schema and my changes to it . . . . . . . . . . .
34
3.5.2
Performing a peptide mass fingerprint . . . . . . . . . . . . .
39
The Integration Part . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.6.1
The Project schema
. . . . . . . . . . . . . . . . . . . . . .
40
3.6.2
Query.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.6.3
Statistic.pl . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.6.4
Plot.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.4.1
3.5
3.6
v
3.7
3.8
4
ShowTable4.pl . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.6.6
Identifier.pl . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Testing and evaluation . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.7.1
Alan Pemberton . . . . . . . . . . . . . . . . . . . . . . . .
45
3.7.2
Perdita Barran . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.7.3
Results from a UniGene cluster multiple alignment . . . . . .
46
TestCase - A combined transcriptomic and proteomic survey of the
jejunal epithelial response to Trichinella spiralis infection in mice . .
47
3.8.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.8.2
Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.8.3
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.8.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.8.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.8.6
My comments
50
. . . . . . . . . . . . . . . . . . . . . . . . .
Discussion
4.1
5
3.6.5
51
Did I achieve my goals?
. . . . . . . . . . . . . . . . . . . . . . . .
51
4.1.1
Standardisation . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.1.2
Public . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.1.3
Web-based . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.1.4
Scalable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.1.5
Modularised . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.1.6
Up-to-date . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.2
What would I have done differently . . . . . . . . . . . . . . . . . .
53
4.3
What would I have done if I had had more time? . . . . . . . . . . . .
54
4.4
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
Summary
56
A Appendix A - Biological definition
57
A.1 The Central Dogma of Molecular Biology . . . . . . . . . . . . . . .
57
A.2 Amino acids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
A.3 Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
vi
A.4 Comparative genomics . . . . . . . . . . . . . . . . . . . . . . . . .
59
A.5 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
A.6 Gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
A.7 Genetic diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
A.8 Genetic network/pathway . . . . . . . . . . . . . . . . . . . . . . . .
61
A.9 Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
A.10 Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
A.11 Mass spectrometer . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
A.12 Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
A.13 Molecule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
A.14 Nucleotides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
A.15 Peptide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
A.16 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
A.17 Protein arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
A.18 Proteome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
A.19 Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
A.20 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
A.21 Systems biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
B Abbreviations
67
Bibliography
68
vii
List of Figures
1.1
This figure shows the dimensionality problem in biology [19] . . . . .
6
2.1
The three parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.2
The MaxD schema . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.3
The original schema [52] . . . . . . . . . . . . . . . . . . . . . . . .
24
3.1
The programs that I have written . . . . . . . . . . . . . . . . . . . .
29
3.2
Thess are the modified tables from PEDRo . . . . . . . . . . . . . . .
38
3.3
The Search-page . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4
The combined result from Mascot and Ms-fit . . . . . . . . . . . . .
41
3.5
The project part . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.6
The plot showing proteins vs genes . . . . . . . . . . . . . . . . . . .
43
3.7
The result for UniGene cluster Mm.14046 . . . . . . . . . . . . . . .
45
4.1
The welcome page for the system . . . . . . . . . . . . . . . . . . .
52
viii
Chapter 1
Introduction
The completion of the human genome marked the end of an era and the beginning of
a new. Castillo-Davis et al [7], claims that researches are now shifting their attention
towards post-genomic analysing techniques. This means that new, more sophisticated,
software such as data mining and hypothesistesting are going to be developed that
utilises both chemical, biological and evolutionary data. [7] This shift in what to look
for, from a single analyte to the whole system is important. Systems biology tells you
that focusing on only one part for isntance either on DNA, RNA or proteins, will not
lead to an understanding of the entire system. Although an understanding of the system
is important, it is not until you start looking at the dynamic within the system that real
knowledge is available. [27] This project is an attemp to design the database needed to
store the information that these analyses are going to use.
In an article in Nature [18] in July this year, the author claimed that the next big
thing is going to be proteomics. Gershon claims that it remains to be proven if proteomic information can be turned into real products. This even though most drugs on
the market targets proteins, not genes. The problem is that the threshold for starting
to investigate the proteome is higher than for the genome. I say she is wrong, the next
big thing is going to be systems biology, where you look at the genome and the proteome simultaneous, “skipping” this middle-step, that Gershon thinks biology will go
through.
Because systems biology has just recently emerged as a subject, there are not that
many products around that support this approach. There are many tools available that
1
Chapter 1. Introduction
2
have one purpose but no tool tries to include all information, all technologies and all
methods in one system, to reveal as much information as possible to the biologist.
1.1
The goal with this project
The goal is to build a standardised, public, web-based, scalable, modularised, upto-date integration platform for genomic and proteomic data.
Standardised data will allow SCGTI (Scottish Centre for Genomic Technology
and Informatics) to exchange information with external sources. They are going to export their own data to public repositories (after publication that is) and import external
data into their internal database. This means that if the data is not standardised then
methods for transforming the data from one format to the other will have to be written.
Therefore I will try to use standardised data whereever possible. Having the data in a
standard format, also makes it easier to submit articles since more and more journals
are demanding that the data is accessible in a standard format, in a public repository.
The software needs to be public, since SCGTI probably wants to make it available
to their collaborators.
If the data is available on the web it is automatically going to be public open to
those with a username and password.
It should be scalable since biology is an area where new techniques, new data and
new methods are invented/created all the time. This means that is has to be easy to
add new data to the system. If not, then it will become impossible to maintain and
disappear, just as K2 did. [10]
Modular systems are automatically relatively easy to update and maintain. The
principle of modularising a system is that changes to any one part should have no
effect on the other parts, or at least only marginal effects. Therefore I have tried to
keep the genomic part, the proteomic part and the integration part separate at all times.
To keep a software up-to-date in bioinformatics is hard, and takes a lot of time.
Therefore it should be “easy” to keep the most important part, i.e. the “integration
bridge”, updated.
Chapter 1. Introduction
1.2
3
Why is this project interesting to do, now?
The reason for this project being needed now, is that biology now moves towards systems biology, this means that you need information about the DNA, RNA and proteins.
Today this information is mainly available, but spread out on multiple sources. To integrate these sources, it has to be possible to “translate” mRNA into proteins, and the
other way around. One possible use of this platform is in the modelling of genetic
networks (see sections 1.3 and 2.5).
The distributed nature of biological data makes it a time-consuming task to gather
the sought information for one gene, but to do it for many is a manually intractable task.
This means that there is a need for a “query-based access to an integrated database that
disseminates biological rich information across large datasets and displays graphics
summaries of functional information.” [12]
There are a lot of tools available for collecting, storing, querying and visualising genomic data. When high throughput technologies start becoming available for
proteomics data similar tools will probably start to develop. Software for analysing,
quering and visalising the integrated data will when be the next thing, but before they
can be used, the data has to be stored in a centralised location and that is what I have
tried to do. I have designed a platform that makes it possible to access both genomic
and proteomic data through the same interface, as a first test case I have designed a
visualisation tool for looking at the combined gene and protein expression levels.
1.3
Long-term goals for SCGTI (“biological relevance”)
In the end of the year 1999, an article [25] was published describing how “doctors”
were on of the leading cause of death in the US. The article stated that between 50 and
100 000 people die from medical errors each year. About 7000 of these are caused by
people having adverse reactions to drugs. If it was possible to predict how a patient
would react to a specific drug, given the patients current biological state, then maybe
a large number of these tragic deaths could be prevented. To be able to do this, you
would have to have the technology to take a “snapshot” of the patients state. Given
this information, and a computer model over the corresponding network, it would then
Chapter 1. Introduction
4
be possible to predict how the patient would react then the drug is administrated. (see
section A.8). In conclusion, the long-term goal for SCGTI is to be able to predict the
response in a patient, given a specific disease after the administration of a drug.
To do this, they are going to have to be able to model genetic networks, and to
model these networks, they are going to need a lot of data, and that data is now available
in one interface, through this project.
Other interesting questions that genetic networks should be able to answer are:
1. Why do certain people get certain diseases?
2. What makes some people refractory to certain diseases?
3. Could information about those people be used to treat others?
But before this will be a reality there are some things that need to be done, and they
are discussed in the next section.
1.4
Steps on the way
1.4.1 Design the proteomic database
SCGTI have up to now not collected proteomic information, and this means that I am
going to design the entire database and fill it with data, I will also write programs
for storing, quering and visalising the data. I have used a schema called PEDRo (see
section 2.3.3) as a basis for what data should be stored, an in what format, but I have
made some changes to this model. I could probably have done an entire MSc project in
the area of designing a database for the storage of proteomic data, or even by evaluating
PEDRo as it is this was a part of this dissertation although a big part.
1.4.2 Map genomic data “onto” proteomic data
I will write a program that retrieves the mapping information from NCBI:s website
and saves it to the database so that it could later be used to map between genomic and
proteomic data.
Chapter 1. Introduction
5
This information will be kept seperate from the protemic database, and from the
genomic database that were already implemented. If it is seperate then the system
will become more modular and scalable and it will be easier to keep the mapping
information up-to-date.
The mappping has to be able to map genes to genes via a UniGene cluster but it
should also be able to map from a gene to the corresponding protein and back again.
1.4.3 Build the integration platform
Once the genomic and proteomic data is in place, it is time to start integrating. I will
implement two programs that are going to demonstrate how this integrated information
could be used to gain new knowledge. To do this properly I am going to write some
small modules with well-defined functions.
1.4.4 Analyse the data
This part should be kept seperate from the previous. There are two reasons for this.
The first is that is this part is integrated into the integration, then it will be harder to
continue adding more tools to this part then this project is finished. This will have to
be done since we do not know which tools are going to be useful, and therefore this
parts has to be easy to change. The second reason is that I will not have the time to
look at this part since the others are going to take too long.
Queries that could be of interest to ask is: How many elements/genes/proteins
were analysed. Based on this information it could be possible to investiagete if the
expression levels for the elements differ or corresond. If it is known whether or not
a specific gene is expressed: How many different forms or proteins are the from this
gene. This will answer the question: What dimensionality is there is this problem? Is
it worth investigating or is it to complex. (see fig 1.4.6.1).
1.4.5 What will be left to do after this project
Model the underlying networks since this is rather a PhD project than a MSc project in
terms of complexity and time. Once these networks are modeled then they will have
Chapter 1. Introduction
6
to be confirmed through biological experiments. by a biologist.
1.4.6 The problem at hand is both simple and complex
1.4.6.1
Biology
Figure 1.1: This figure shows the dimensionality problem in biology [19]
It is simple since there exists a link between DNA, RNA and proteins, and UniGene
is trying to provide it. It is complex since it will be dealing with the curse of dimensionality, when integrating biological data it is never possible to assume that the query
will only return one result.
Today the mapping between the different parts is done via the identifiers. The
demo-programs compiles a list of proteins and from it finds the corresponding genes. I
choose this “direction” since the testcase (see section 3.8) has a lot fewer proteins than
genes.
Chapter 1. Introduction
7
This means that it is quite possible that more than one of the genes on the array
codes for the same protein and in that case, all those genes would be used to match
against the protein. Until you start specifying a mapping table (this spot from this gel
correspond to this spot on this array) then this is a limitation that you have to accept.
1.4.6.2
Informatics
There are some problems that arise when trying to integrate any kind of data, these
problems are mostly connected to the structure you choose, link driven, view integration or warehouse.
I decided to go with a combination of all three. I choose this since I wanted to
keep the data in one database, I wanted to keep it separate and in its original form,
and in some places I wanted to link to external resources. This means that I will have
to consider all the disadvantages but also that I get all the benefits. I believe that the
approach has been appropriate for this project.
1.4.7 What have I changed in these goals along the way?
I have put a lot more emphasis on getting the data, especially the proteomics-data, in
place than I first thought that I would. Originally I was just going to demonstrate how
it could be done thereby only storing the data that I would need for the mapping itself,
but a couple of weeks into the project I found PEDRo, and decided to use this schema
for my protemic data. This meant that I had to evaluate the model and also that I ended
up designing many more tables. Some of these needed to be filled with data to make
the integration work. This meant that this part took a lot longer than I had originally
planned, and that left me with less time left in the end than I had hoped. This meant
that I did not have time to implement many different analysis tools.
1.5
What have I done?
I have implemented a platform for the integration of genomic and proteomic data. This
platform allows researchers to investigate a single gene, a single protein, many genes,
Chapter 1. Introduction
8
many proteins or any combination of the above.
The platform really consists of three separate parts, the genomic part, the proteomic part and the integration part, this since it allows for the largest flexibility. Any
other solution would have resulted in a less robust system. This way a change in for
instance how geldata is captured, will only affect the PEDRo-section, not the MaxD
nor the Project. Any change will probably effect the integration part but since this is
encapsulated into one single module (see section 3.6.2) the “damage” done should be
limited.
1.5.1 The Genomic part
I have implemented a program that gets the relevant information about the UniGene
clusters that SCGTI are using and stores them in the database. The program downloads
the newest mouse and human UniGene-clustering-information from the NCBI website. The program reads through the entire files, some 1000 mb of data, and stores
information about the essential clusters to the database. I decided to limit the program
to only the essential information since we are talking about a couple of million records
otherwise and out of these less than 400 000 are needed so there would be a lot of
unused data. Another reason is that right now SCGTI do not have unlimited space
on the database-server, so if I were to try and store all information I would fill up all
available space, and still not be able to store all the information. This is not a big
limitation since the program will recollect the needed information every two weeks.
1.5.2 The Proteomic part
I have designed and implemented a datastructure for storing all information about a
proteomics experiment, this a based on the PEDRo-schema. I have made some changes
to this model in order to make it more flexible and more appropiate for integrated
access to the data. Using PEDRo as the basis for the database-structure helped me
“fulfil” some of my goals, such as standardisation and modularisation and also helped
with the flexibility-part.
I wrote programs for adding, querying and visualising the data. I also wrote a
Chapter 1. Introduction
9
program that sends information regarding a peptide mass fingerprint search to two
different search-engines, Mascot and Ms-Fit. This approach will save time for the user
and hopefully lead to more realiable proteins-identifications since results from both
programs could be displayed side-by-side.
1.5.3 The Integration part
I have implemented two programs that demonstrate how the database could be used to
gain new knowledge. I did not have as much time left in the end as I would have liked,
so I looked at the draft that Pemberton et al [41] has written. I used one of their tables
(see section 3.1) as an example of what you could do with this integrated data, that lead
to showTable4.pl (see section 3.6.5) Once I had done that I thought one step further,
and decided to implement the same functionality in a graph instead thereby making it
easier to survey more proteins than the 52 they investigated, so I implemented plot.pl
(see figure 3.6.4).
1.6
Comparing with the testcase
The integration part, shows how identified proteins could be used to connect genomic
and proteomic information. There are some differences between the results that I got
and those that are in the paper by Pemberton et al [41]. There are some possible
explanations to these discrepancies and I will discuss them in section 3.7.1.1.
1.7
Outline of the dissertation
This first chapter talks about the goals of the project, it discusses why this project is
of general interest. It shows what needs to be done before such a platform can be
implemented and it also talks about the possible use of the platform.
The next chapter talks about the background and the theories that my project is
founded on, such as MIAME, PEDRo and UniGene.
The next chapter is about the results I got: what I did; how I did it; and what the
program can be used for, as well as how it is supposed to be used. I also discuss the
Chapter 1. Introduction
10
testcase I used to evaluate my system, and how I used this information to decide what,
and how my demo-programs should do.
Finally I discuss this dissertation. Here I ask myself if I achieved my goals; what I
would have done differently given more time, and some possible future work with the
platform. Then I sum up this dissertation.
At the end, I have added an appendix, Biological definitions. This chapter can
serve as a crash-course in biology or simply as a reminder, e.g. what is the difference
between the genome and the proteome. Word or phrases that are in this “dictionary”
are in italic, so as to distinguish them from the running text.
Last I added a short glossary of the abbreviations that I have used in this dissertation.
Chapter 2
Background information
There are many different programs that allow you to access different genomic and
proteomic resources, the problem is that it is hard to integrate them since no common
platform is available.
To investigate what genes are involved in a multigenic disease, such as cancer,
the researchers would start with the on-line resources to determine which genes (if
any) have previously been connected to the disease. The problem is that although the
information is available, it is spread over different data sources and resides in a variety
of models and formats.
Very often the data-model for biological data will include sequential data (i.e. lists)
and nested structures (trees). This type of information is not suited for a relational
database, but this complexity could be dealt with in an object-oriented database, but
these are unsuccessful because the structure of the data is changing too often.
It was hard finding information about similar systems. My belief is if they are out
there, they are built by companies that are not about to give up information about how
they have done it, just like GeneticXChange.
2.1
Integration of biological data
”Recent years have seen an explosion in the amount of available biological data. More
and more genomes are being sequenced and annotated, and protein and gene interac-
11
Chapter 2. Background information
12
tion data are accumulating. Biological databases have been invaluable for managing
these data and for making them accessible. Depending on the data that they contain,
the databases fulfill different functions. But, although they are architectually similar,
so far their integration has proven problematic.” This quote comes from an article that
was published in may this year by Lincoln Stein [46] and it discusses some problems
that arise when you are trying to integrate biological databases. I choose to start this
section like this since it describes some of the problems of integrating biological data.
The first problem, is the amount of data, how much data, and what kind of data should
be stored. The second is the fact that new techniques and methods are constantly being
invented, meaning that you do not know which format the data is going to be in. The
third is that the data is spread out in different sources, making it hard to overview and
access.
2.1.1 What problems will arise
There are, as Buneman et al, 1995 [5] discusses, some problems that only arise when
you try to integrate big data sources (such as biological ones), and some that are always
there no matter what.
1. The scope of a single database would be hard to define. What should be included
and what should be excluded? The designers would have to consider experimental design, sequence analysing, results and annotations. Should it be limited to
genomic-information or should proteomic-information also be included. These
questions have to be addressed before any integration attempt is made. [5] [46]
2. The biological information that you want to integrate is out there, it is just a
matter of finding, and accessing it. The external datasources all have different
interfaces and retrieval-facilities. This makes it hard for both the biologist and
the bioinformatican. The biologist has to master a lot of different “systems”
and the bioinformatican has to understand all these “systems” so the data the
biologist wants could be stored in a format they understand and can use. [46]
There are a lot different sources, such as relational databases, object-oriented
databases, flatfiles, programs such as BLAST and FASTA. How do you integrate
Chapter 2. Background information
13
these into one GUI, without losing the internal representation of the different
databases, making sure that the software knows the internal representation and
allowing both external and internal databases to change structure with only minimal effects on the software? This is the hardest question. Trying to fit all data
into the same structure would probably create more compromises than solutions.
3. Even though the data may look similar at first glance, it will be using different
identifiers and different names for the same thing (or same name for different
things). This makes any kind of automatic mapping between sources intractable,
and manual ones hard. [46]
4. Not only do you have to deal with different formats, different access methods
and other technical problems, an even more important question is: what kind of
information will we need tomorrow?
New techniques and new algorithms are making new kind of data available, as
well as updating old ones. This new data also needs to be incorporated. [46] This
means that any automated task of getting data, parsing it and storing it into a local
data structure would have to be constantly re-written and this is not possible. [57]
It is very difficult for one single database to handle all these different techniques.
In other words, the evolutionary aspect of biological data would make one big
database impossible. [5]
5. The queries that these data sources have to be able to answer are steadily increasing in complexity and size. This means that more advanced querying-capabilities
have to be supported. Another problem is that users get restless if they have to
wait for their results they want them directly. This puts a lot of pressure on both
the hardware and the software. [5]
6. The data needs to be standardised. This is especially important if you want to
build a high-throughput pipeline for biological discovery. It might be easier to
include external modules into the pipeline if the internal data is in a standardised
format otherwise conversion tools have to be built for every external connection,
and this is a lot of work. [4] Another point is that users would then know what
Chapter 2. Background information
14
to expect from the data. Both in terms of what data needs to be collected to follow the standards, and what data could be expected when downloading external
information. [51]
2.1.2 What has been done
2.1.2.1
From a genomic perspective
Arrogant
Arrogant is freely available web-based program that aims to facilitate the identification, annotation and comparison of large collections of genes. The authors describe
Arrogant as a “general-purpose tool for designing and analysing experiments involving
many genes/clones such as those from expression microarrays or DNA resequencing
efforts for variant single nucleotide polymorphisms (SNP) discovery.” [28] It could be
used to answer questions like: “Which of the genes in the collection are located on
chromosome 3, are unregulated by a factor of 3, have potentially polymorphic repeats,
and also have homologies in mouse which could be used for knockout experiments?”
[28]
Arrogant takes its information from a number of external data sources. This data
in then implemented locally, to make sure that the program is not dependent on a functioning network. This leads to improved performance and reliability over programs
that connects to the external datasources, but this also leads to maintenance issues.
[28]
Arrogant uses UniGene to identify the “same” gene from a batch of GenBank accession numbers. This mapping is done in real time, and I could therefore have used
this tool to go from a GenBank accession number to a UniGene identifier, instead of
writing my own mapping-tool. The first problem is that Arrogant cannot be downloaded and each session only handles 6500 genes. I have about 65000 genes today, and
this means that I would have to split up my data into 10 sets, send each set to Arrogant
and then compile the results. This just seemed like a lot more work than it was worth.
The second problem is that I doubt how up-to-date the program is: when I tried it the
results did not correspond to the results from the NCBI website.
Chapter 2. Background information
15
MatchMiner
MatchMiner is a software tool for navigating among multiple genes, or gene products, simultaneously. The program can, among other things, translate a GenBank
accession number to a UniGene number. The program is split in two parts, lookup
and merge. Lookup is translating the identifiers and merge takes two input files and
merges the similar genes in the sets. [6] This program looked very interesting since
is was freely available and down-loadable as a Java-program. But I tried it on 6 of
my genes, 6 I know were present in UniGene, and none of them was found. This
led to the conclusion that the program probably promised more than it could deliver.
(http://discover.nci.nih.gov/matchminer/html/MatchMinerLookup.jsp)
GeneMerge
GeneMerge is a web-based Perl-program that returns functional and genomic data
for given genes, as well as statistical rank scores for over-representation of particular
function or categories. In other words, it is focusing on those functions and categories
that are most abundant in the dataset. This also sounds like a good resource, but then
I checked out the home-page (http://genemerge.bioteam.net), it looked like you had
to provide the interaction data instead of the program providing it. So in order to use
this tool you would have to first find what interactions your genes take part in, and
that information might not always be accessible or easy to find. Another drawback is
that it is only interested in significantly overrepresented characteristics. [7] This is a
limitation that I determine if it is accepted or not.
Resourcerer
“Resourcerer: a database for annotating and linking microarray resources within
and across species” [55] You have to be able to cross-reference data from different
species to gain biological knowledge, not until then can you truly derive inference
regarding gene expression and disease state from a model organism such as the rat,
to the organism of interest such as humans. Resourcerer is developed to do just this.
It builds on analysis of ESTs and gene sequences provided by the TIGR gene index
(TGI).
Resourcerer can also be used within the same species, by identifying the similarities and different genes in two data sets [55]
Chapter 2. Background information
2.1.2.2
16
From an integrational point of view
DAVID
DAVID is a program developed to mine biological information, with the purpose
of combining functional descriptive data with intuitive graphical displays. DAVID also
provides visualisation tools that shows biochemical pathway maps and conserved protein domains annotations. DAVID is updated weekly to make sure that the information
is as up-to-date as possible. One big drawback with DAVID is that it only allows one
extra column for annotation. When you load up your list of identifiers it is possible
to add extra information such as experimental design but is has to be restricted to one
column. One good thing is that the dataset is stored throughout the session, allowing
you to switch between the views without having to upload the datasets again and again.
The thing I like about DAVID is that is allows you to sub-query the data, in however
many steps you like but it is still easy to get back to the top-level.
2.1.3 What has been done commercially
GeneticXChange
A “High Throughput Data Integration And Analysis For Optimising Drug Discovery”. [14] this sounded just right but the fact that it is a commercial product limited
the amount of information available. I did manage to find out that GeneticXChange
leaves the data in the original form [15] and depends on wrappers to support different
data formats. [16]. Had the program been free-ware, I think that it would have been
really interesting to investigate, but as it is, I could not do much investigation.
2.1.4 Conclusions
There is a reason that this section is mostly focusing on what has been done on the
genomic side and it is simply this: there are no public repositories for proteomic data
and therefore it is not as interesting to develop tools for that side. We are now, as
I have mentioned, seeing a shift in attention, and it will lead to public repositories
and tools being made available. One reason for still including information about the
genomic side in this dissertation is that when you are searching for tools that integrate
Chapter 2. Background information
17
biological data this that is what you get. So I choose to include it since it shows what is
being considered as integration of biological data today and this dissertation is trying
to show what could be done.
All those attempts that I have described above are all based on one common concept, the user load a list of identifiers to a web-page and get a collection of links back.
No one tries to integrate the data itself, just the identifiers. This means that you cannot annotate the data in any way, nor can you make sure that the data is curated or
up-to-date. There seems to be a common “idea” that you should build a server that
functions as a connection between the different data sources. This server is responsible for linking the data sources, creating the interface and interacting with the users.
The functional parts therefore need to consist of some kind of “translation-layer” that
takes the raw data and transforms it into a structure that the server can handle.
Another conclusion is that more programs use UniGene as identifier than TIG.
I mailed NCBI and told them that I wanted an up-to-date program that would translate a GenBank accession number into a UniGene cluster id, and asked them to recommend something, and they told me to write my own parser, so I did. [37]
2.2
The three approaches
There are three approaches to the integration of biological data, and they are link-driven
federations, view-integrating and data warehousing. [10]
Buneman et al [11] writes: “Various approaches to integration are appearing
throughout the bioinformatics community and it is unlikely that there will be a single satisfactory strategy.” I think they are right, I think that you have to mix these
different solutions to get one that will work in most circumstances and so I have (see
section 3.3.0.6).
2.2.1 Link integration
Link driven integration is driven by the user. The user has a starting point and from it
follows explicit hyper-links to other sources to collect all interesting data. [10] Since
Chapter 2. Background information
18
this approach builds on the structure of the internet it has been the most successful one.
Researchers are already familiar with the web know and how to move around it.
There are some serious problems with link integration. The fist one is vulnerability
to naming clashes and ambiguities. The name of genes, for instance, are not the same
across different species. This makes it hard to compare results from different species
and that makes model organisms less useful than they could be. The second concerns
update issues. An external link assumes that the target is still valid, up-to-date and
relevant. The third is that the integration is in the hand of the researcher. It is up to
the individual researcher to decide how the date is connected and how it should be
interpreted. [46]
2.2.2 View integration
View integration leaves the data in its source format but builds an environment around
these sources to make them look like one big database. [10] The interface/environment
is given a query and splits it up into smaller parts. These parts get sent to the different
databases, the interface/environment then compiles the different results to make look
like one big database has been queried.
There are some real possibilities with this approach, but in spite of that it has not
been successful yet. The main reason is probably the fact that it is slow since the processing capability is limited by the slowest data source since the interface/environment
has to wait for all results before it can show the result to the user. There is another
reason for these systems not to be of wide-spread use, they are difficult to implement
and maintain. [46]
K2, is the most extensive attempt at view integration so far. [10]
2.2.3 Warehouse
This approach tries to take all data and bring it into one big database. Therefore the
first step is to model a unified database-model that can be used to accommodate all
the information that is going to be integrated. The next step is to develop the software
that will fetch the data from the external data-sources, transform it to the appropriate
Chapter 2. Background information
19
format and then save it to the database. This database can then be used to query the
data in all the ways that the separate external sources can, and also in new integrated
ways. This is faster than the view integration since the data is collected in one location.
The biggest problem is to keep the data up-to-date since new data is being added
and invented and the design of the databases are changed. All this means that the
software that fetches the data from the external sources has to be constantly modified
to accommodate these changes.
Stein writes that the most ambitious attempt at a warehouse approach was the Integrated Genome Database (IGD) project that survived about a year before it collapsed.
IGD was about combining human sequence data with genetical and physical maps for
the human genome. The project collapsed since the software that imported the data had
to be re-written on average once every two weeks and this finally became too much.
[46] This leads to the conclusion that any data in such a repository will probably be
out-of-date, meaning that the scientist will not have access to the newest data. This
also leads to the conclusion that one big data source is not a good idea.
The authors of the article “K2/Kleisli and GUS: Experiments in integrated access
to genomic data sources” [10] conclude that there is no clear winner between the view
integration and the warehouse, therefore it depends on the situation which strategy
should be used.
2.3
Models and schemas
I choose to split up my program, and this dissertation, into three parts a genomic, a
proteomic and an integration part. This has allowed me to use those standards that are
out there and has made the system more modularised. I think that this split has made
the system more robust since a change to one of the parts will not affect the others. The
resulting structure also makes the system more scalable since new data could easily be
added into the existing structure without any other changes having to be made. So by
splitting the project into three parts, I have come a long way to realise 3 out of my 6
goals. The database is standardised (more or less), it is scalable (just add another type
of data, and connect it to projectexperiment) and it is modularised. This is not all I
Chapter 2. Background information
20
have done to fulfil my goals but it got me a some part of the way (see section 4.1).
Figure 2.1: The three parts
2.3.1 MIAME
Mged.org (Microarray Gene expression data society) is a consortium of companies,
organisations and research groups that comes together to decide upon standards that
are to be followed then performing micoarray experiments. Among their guidelines is
a document called MIAME (Minimum Information About a Microarray Experiment).
The aim of this document is to make sure that all the information that is needed to
reproduce the experiment (and get the same results) are recorded, and easily accessible. This means that the information specified in this document somehow has to be
present in your data structure. MIAME aims to “guide the development of microarray
databases and data management software.” This means that new software being developed for microarray data will hopefully be based on MIAME. MIAME should not be
considered a set of rules but should be regarded as a guideline and if you wnat to store
more information that is encouraged.
Although MIAME concentrates on the content, rather than the structure of the data
it does provide a conceptual structure. This structure includes information about array
Chapter 2. Background information
21
design description, features, control elements, experimental design, samples, hybridisations procedures and measurements among other things. [32]
2.3.2 MaxD
MaxD is a database developed by the bioinformatics department at the University of
Manchester. It aims to store the information needed for microarray-experiments to
follow the MIAME guidelines. Although it does do that to a certain extend, some
information needs to be stored in the non-descriptive field “description” rather than in
a column of its own. There are a lot of useful tools, such as MaxDLoader, MaxDView
and MAGE-ML export functions available [31] that compensates for this drawback.
2.3.3 PEDRo
The field of proteomics is moving towards high-throughput technologies such as protein arrays. To take advantage of this, it is a necessary that standards are developed,
standards for storing experimental data analogous to MIAME. Since proteomic-data is
more complex than genomics it is very important that the metadata is both extensive
and in place before any high-throughput technologies are implemented.
PEDRo is described as a “modular, expansible model” [51], and since I am trying
to build a modular system this is very convient. The authors are refering to the fact that
it is possible to connect the table AnalyteProcessingStep to any other table, thereby
making it possible to add new tables when new techniques are invented.
PEDRo stands for Proteomics Experiments Data Repository, and has two goals and
they are [50]
1. “The repository should contain sufficient information to allow users to recreate
any of the experiments whose results are stored within it.”
2. “The information stored should be organised in a manner reflecting the structure
of the experimental procedures that generated it.”
To encompass all this information the model needs to be fairly large, but it cannot
be made too large because then it will not be used. [50] Much of this information
Chapter 2. Background information
Figure 2.2: The MaxD schema
22
Chapter 2. Background information
23
would be available in the lab anyhow. There are advantages to storing all this data, nonstandard queries such as extraction technique are then made possible. Then integrated
even more advanced search and querying facilities could be added. An even more
important benefit, is that it should be easier to exhange information since all the data
is collected in one location. [51]
There exists a need for public repositories for proteomic data that allows for rigourous
comparisons between gels, arrays and so on. These reposotories must capture information such as where the data comes from, who made it and how, as well as annotation
and identifications. PEDRo could become that standard.
2.4
UniGene
2.4.1 What is it
“UniGene is an experimental system for automatically partitioning GenBank sequences
into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene as well as related information such as the tissue
types in which the gene has been expressed and map location”. [34] In other words,
it is an algorithm for clustering GenBank sequences, making sure that the same gene
has one unique UniGene cluster-id and that all the GenBank accession numbers that
exists for that gene are linked with one another through this id. This makes it possible
to realise that gene A and B really are the same gene since they are in the same cluster.
When the clustering gets updated the old information is not stored at NCB. A
cluster-id, will only be retired, never reused. It could be retired for one of the following
reasons: the sequences in the cluster might be retracted by the submitters because they
are found to have contanimants, two clusters may be joined making one ID retire or
a cluster could be split into two or more clusters. [35] SCGTI have decided that old
clustering information is of no interest because it depends mostly on the fact that the
GenBank sequences have been updated by its authors, or that the algorithm has been
enhanced. In the future, this could be interesting, but for now it is just changing too
often to be of interest, so I delete all old information before entering the new data.
If you “find” a sequence, you will probably name it according to some internal
Chapter 2. Background information
Figure 2.3: The original schema [52]
24
Chapter 2. Background information
25
naming scheme, making it impossible to assume that Rad17 in human, is the same
gene in C.elegans, but UniGene works around this by assigning a name to a cluster of
genes instead. Otherwise it would be impossible to use the name as a description of
what the gene is, or does. [35]
2.4.2 How up-to-date is it?
Not only are the well-characterised genes represented but hundreds of thousands of
novel expressed sequence tags (ESTs) have also been included. These ESTs are constantly in change, meaning that the UniGene clusters also are. The update rate varies
from between 1 week to one month. [35] This is not ideal but will probably have to
be considered acceptable. NCBI say: “...it should be noted that the procedures for
automated sequence clustering are still under development and the results may change
from time to time as improvements are made.” [36]
2.4.3 What limitations/problems exists
EST are characterised by being short (about 400-600 bases) and relatively inaccurate
(around 2% errors). They are obtained by doing one single read meaning that it is a
low-cost method, no attempt is made to characterise or identify the clones: they are
simply identified by their sequences. Therefore they will be redundant and some will
represent contanimants or cloning artifacts. It is much more expensive to do a full
high-quality sequencing so it is not worth it in all cases, and therefore the ESTs are an
invaluable resource, even though they are incorrect. About 66% of all submissions to
GenBank are still ESTs. [36]
It is important to eliminate low-quality or apparently artificial sequences prior to
clustering because even a small level of noise can have a large corrupting effect on the
result. Therefore sequences from foreign origin are removed, as well as mRNA and
mitochondroial sequences. For a sequence to be included into UniGene it has to have
at least 100 bp (base pairs) and be of high-quality and non-repetitive. NCBI require
that the relations show a “dove-tail” relationship, meaning that they have to extend as
far as possible preferably to include both ends.
Chapter 2. Background information
26
For a given set of sequences it is important to determine if they are derived from
the same gene or not (obviously since that is what UniGene is trying to do), but some
level of mismatches has to be tolerated since the ESTs could have substitutions errors
in them, allowing too much would cause highly similar paralogous genes to cluster together. Multiple incomplete, but non-overlapping fragments of the same gene are often
recognised when the complete sequence is submitted which could lead to withdrawals
of old sequences.
2.4.4 Which species are covered, and why is that enough
UniGene covers a wide range of organisms, both animals and plants. The species has
been chosen to provide a wide range of model species, as well as some additional ones
where the amount of ESTs available is greatest and all species that are fully sequenced.
It has species from a wide variety of families to allow for similarities matches across
species. More species could be added after requests. [34]
It is enough since already in the beginning of the Human Genome Project, at the
late 1980s, it was realised that model organisms would be needed, therefore a lot of
other sequencing projects started at the same time.
If we only look at humans, we will only be able to track those diseases that we
discover today, and we will not be able to predict then they are going to start (unless
we start gene manipulation on humans) therefore we need model organisms to study,
and for that we need their sequences. For a study of a disease in, for instance, a mouse
it is possible to create gene knockout experiments, where the gene in question is being
disabled, or to create a transgenomic mouce where the gene is being added.
The genes have to a large extent stayed the same through evolution, making it possible to look at gene A in a mouse and realise how gene A in a human will behave under
the similar conditions. [29] “Comparative genomic is the ultimate key to functional
genomics.” [29]
It is important to pick the right model organism for what you are studying right
now. The human genome is about 63% similar to that of the mouse, 57% to the fruit
fly, 38% to C.elegans, 20% to Arabidopis and 15% to bakers yeast, making it possible
to use these organisms in a multitude of different experiments. The different model
Chapter 2. Background information
27
organism is specificially useful for different things, for instance the house mouse has
been used as a model organism for almost all kinds of studies, from cancer, to psychiatric disorders, to genetics and development, immunology and pharmacology. [3]
2.4.5 UniGene is more appropriate than TIG
2.4.5.1
What is it
TIG is a database maintained by The Institute for Genomic Research (TIGR). TIGR
was founded in 1992 and are interested in structural, functional and comparative analsyis of genomes and gene-products. [53]
The TGI database is a collection of the publicly available ESTs and gene sequence
data. The goal is to identify transcripts and to place them in a genomic context. [55]
TGI uses ESTs and coding sequences to assemble tentative consensus sequences.
The ESTs are downloaded daily and cleaned from unwanted information. The genes
and the ESTs are compared pair-wise to identify overlaps using BLAST. A cluster is
formed then any two sequences form a 95% match over a 45bp long sequence. [55] As
in UniGene previously used identifiers for a cluster is never reused.
2.4.5.2
Comparison to UniGene
There are two big advantages to using UniGene, the first one is that UniGene stores
information about protein-products as well, wheras TIG are only intrested in genes.
The other is that there seems to be more software that are using UniGene than
TIGR. Even TIGR themselves say that their clustering-algorithm builds on publications from NCBI. [54]
2.5
Systems biology and genetic networks
Being able to predict how the body would react given the know current state would
mean that all companies interested in drug design could save time on the experimental
side (the results would still have to be validated, but the models would tell you what
to validate). Saving time saves money. Today over 90% of the potential drugs entering
Chapter 2. Background information
28
clinical trials fail to make it to market, bringing the average cost of a new drug to $770
million.
Today about 400 proteins are being used as drug targets, leaving a couple of thousands that could be exploited if the technology allowed it. When those 400 proteins
are analysed it seems that they all fall within the boundaries of just a few families.
These families includes GPCRs, kinases, proteases and peptidases. [39] This means
that there are many of families that could be investigated, and for that we are going to
need more information, as well as better technologies.
Systems biology is the theory of going beyond the individual parts, such as genes
and proteins, to investigate how they work together to form complex structures. Not
until we have integrated multiple levels of biological information such as genome sequences, proteomic analysis and microarrays can we hope to get a global perspective
on how the system works. When we have this data it will be possible to create computational models of these networks. The models could then be used to enhance our
understanding of how the body works as well as to create new methods for developing drugs since it could be possible to test the drug on a computer before testing it
on humans. [21] All these parts has to be investigated and understood before systems
biology can become the “main stem in biological sciences in this century.” [49]
It might be possible in 20 years time, to take a piece of a patients genome and
analyse it, and the result may be your future health history. This would make it possible
to give you preventive medecine, and that would mean that maybe we could expand the
life-span of humans. Another possibility is a change in how we are making vaccines,
what if it would be possible to make a “super-vaccine” that would work for a group of
diseases. [47]
Chapter 3
Results
3.1
Technology
I implemented the database into the existing database-structure at SCGTI, which meant
Oracle 8i. The tools which I have built are all been in Perl since Perl is a scriptlanguage
thereby facilitating fast development. Perl is also the language SCGTI were using for
most of their web-development.
Figure 3.1: The programs that I have written
29
Chapter 3. Results
3.2
30
My solutions for the problems in section 2.1.1
Buneman et al [11] suggested 5 things to do to make integration of biological data work
(see section 2.1.1) and they are: Transformation of various schemas to a common data
model, match semantically related schema objects, integrate the schemas, transform
the data to a federated database on demand and match semantically equivalent data.
I have done none of these. I do not have a common data model, therefore I have not
matched my objects, neither have I integrated the schemas. I have not transformed the
data in any way, nor matched the equivalent data (for instance the sample part). All this
comes from the fact that I decided not to build a warehouse and these recommendations
are for that case. Nor did I decide to build a view integration which also would have
had to deal with these parts, although in a different way. I did a combination which
meant that I had to deal with some, and I could ignore others, for instance I had to
consider naming clashes, but I did not have to solve them.
Below are my solutions to the problems in section 2.1.1.
1. I decided to let the scope be all kind of proteomic data but only microarray
data on the genomic side but with possibilities for adding more data if the need
should arise. The scope also includes information about the projects that SCGTI
are involved in and information about how to map from a gene to a protein.
2. I let the data stay in its original form, this means that any one part could be
changed without affecting the others, it also means that standardised tools could
be used without any changes having to be made. So I did not write any tools for
getting or transforming the data in any way.
3. I have made sure that it would be easy to add more data to the model. This means
that I do not have to answer the question, rather I work around it leaving it open
and thereby allowing for flexibility.
4. Same as above.
5. Since the data is separate I do not have to worry about the fact that identifiers
are called different things, and that different kind of data is called the same. The
integration part is the only part that needs to know this.
Chapter 3. Results
31
6. This is a problem in my system. The demo-queries that I have implemented are
very complex. For the 52 proteins it has to make a couple of hundreds of simple
queries to the database, it has to make 52 complex ones (matching the proteins
against the genes) and it has to perform 104 t-tests. All this means that the demos
are very slow. There are things that could be done to optimise the queries, but I
did not consider that part of this project scope.
7. MaxD and PEDRo got me a short way towards solving this problem. For MaxD
there are already loaders and export-functions implemented, and the same will
probably be true for PEDRo, should it get accepted as the standard.
3.3
3.3.0.3
From theory to practise
UniGene
Does not contain all biological information, and it is therefore not possible to map
across all genes and proteins. Out of the 52 proteins that I am trying to map, all are
found which is a good sign.
3.3.0.4
MIAME and MaxD
MaxD is a database-schema that follows the guidelines that MIAME species, and since
it was already implemented at SCGTI I saw absolutely no reason for changing it, although there are things that I am not totally happy with.
One very simple thing is that there is no database-schema on the website for MaxD,
which makes it hard to understand how the tables are connected. After reverse engineering on the database the resulting schema was shown in fig 2.2. Once I had got this
it was much easier to understand how the tables were connected.
There are some discrepancies between the MIAME guidelines and MaxD, which is
not good. It is possible to store all information that MIAME requests, but in that case
it has to be stored in the non-descriptive field “description”. I say that this is wrong.
This is even worse then you are considering that this limits the possible uses for the
system. The more information that is stored, the better it is from an integration point
Chapter 3. Results
32
of view, since this means that there is more information that could be used for novel
investigations, to learn new things.
Then there is something that SCGTI are doing that I am not too happy with: they
are not filling in the table “Gene”. This table could have been used as a lookuptable that
stores information about a gene, the same as “Protein” does in PEDRo, although the
information in UniGene could be used instead, but some information are not captured
here. The more information there is in this system, the more ways of integrating the
different parts and the more stable the system will be.
3.3.0.5
PEDRo
I used this schema as a foundation then I decided on the database-structure. Once I had
implemented the schema I started changing it, but every change has a good reason, so
I am hoping that these changes will be accepted by Taylor et al, so that the new version
that I am suggesting will become the standard.
3.3.0.6
Link Integration, View integration and a Warehouse
I decided to use a “combination” of the above techniques. There are several reasons
for this, the first is that I wanted the data in its original form. That would have meant
to use either a link integration or a view integration, but I also wanted the data in one
database since I did not have to wait for external resources, this would have meant
a warehouse. Then I also had to consider the fact that I wanted the user to consider
the “database” as one resource, this would have meant either a view integration or a
warehouse. So instead of choosing one and sticking to it, I decided to pick ideas from
the three approaches and implement a mixture.
I decided to let the user interface link to some external resources, such as NCBI
and SwissProt that are fairly stable. I included these links since they contain more
information than is represented in the system.
The Project-part, that I built ontop of MaxD and PEDRo, could be considered as a
view integration, because MaxD does not “know that PEDRo exits” and vice verse. It
is in this layer that querying facilities, such as query.pl, statistic.pl and so on, are being
dealt with.
Chapter 3. Results
33
The warehouse since all data is in the same database, and therefore internal, with
no external dependecies and that makes the querying facilities quicker.
So in a way I have used neither link integration, view integration nor a warehouse.
In a way I have implemented a link driven integration, a view integration system and
a warehouse. So to conclude, I have picked the parts that I wanted from the three
approaches to get a good mixture.
3.4
The Genomic Part
3.4.1 Getting the UniGene clustering information (all species.pl)
I have written a program that connects to the NCBI website every two weeks and
downloads the new clustering-information for human and mouse. Once the files are
downloaded, they are unpacked, then a Perl-program gets called. This program parses
these datafiles (it is in Perl because it is good at text-matching and there is a lot of that).
The program reads through about 1000 MB of data, and from it picks out information
about the UniGene clusters that are being used by SCGTI. I have limited the import
to only essential clusters since it is simply too much data otherwise. As it is I have
about 350 000 records of data. If I were to store all information then that would be a
couple of million records and SCGTI simply does not have the space, or the need for
storing all this excess information. But this means that the program somehow has to
know which clusters are used, this is done through an external connectionfile. This file
specifies, the database-path, the username, the password, and the sql-statement. These
sql-statements gets a list of the GenBank accession numbers that are in the database.
This means that if more tables are added then simply add another row to this file to get
those GenBank-accession numbers as well. To make it a bit more useful, I also let it
read a file with GenBank accession numbers, this makes it possible for the system to
have information about clusters that are not yet in the database. Since the system has
to match these GenBank accession numbers to those in the files, the information from
the file is stored in hashes, and then compared to the identifiers that are of interest. If
one is found it is written to a textfile. Once the 200 000 records of UniGene clusters
have been processed, the textfiles are loaded into the database using sqlldr (a program
Chapter 3. Results
34
optimised for loading large quantities of data). First the old information gets deleted
and then the new added, this makes sure that all information gets updated. Using sqlldr
saves time which is very useful considering that the program takes about 2 hours to run.
This mainly depends on the fact that the server runs out of memory when 2000 entries
are saved to the hashes, so 1000 entries are processed at the same time. If this number
could be increased I think that the time required might be severly lowered.
The information stored for a cluster is the corresponding locus link, gene expression information, chromosome, GenBank accession numbers and protein-identifier.
The information is stored in three tables, UniGene, UniGene Mapping and UniGene Protein
(see section 3.6.1).
The information stored is then used to cross species (in BioLink), to cross analytes
and to show experimental information in my program.
3.5
The Proteomic Part
An entire pipeline for storing, querying and analysing proteomic data.
As can be seen in figure ?? the proteomic part is not independent, all programs are
connected in some way to other programs. I decided to do it like this since the resulting
structure makes it easier for the user, and good programming style should never take
precedence over what the user wants.
3.5.1 The PEDRo schema and my changes to it
Here I had nothing at all to work with, so the first thing I had to decide was what information to store. For this I went to SIRCAMS (Scottish Instrumentation and Resource
Centre for Advanced Mass Spectrometry, http://www.sircams.ed.ac.uk) and investigated what data they use and in what format they prefer it.
When I started this part, I was only going to save the information connected with
identifying a protein so that I could just show how the integration was thought to be
done. After a while, I realised that the platform would be much more useful if it was
possible to store all information about a proteomics experiment in the database, and so
I found PEDRo.
Chapter 3. Results
35
PEDRo is an experimental suggestion for a standard in the proteomics area. The
tool that I have developed should be used for integration which meant that I needed to
make some changes to the schema. I changed it to better accommodate the possibilities
for different high-throughput technologies. I also made some changes that will allow
it to be more flexible. I will describe the changes I made below, but for reference, the
“original” schema was shown in figure 2.3.
3.5.1.1
SpotPeakList
This table connects a Spot to a Peaklist. This makes it possible to choose to say that
a certain PeakList come from a Spot. It is a separate table because otherwise it would
not be possible to send the same spot to many peaklists, for instance with different
ListProcessings connected to it, or some other difference.
3.5.1.2
DBSearch
To this table I added more information about the search, such as a search title, who
performed it and when it was performed. This information could be of interest when
comparing two results, trying to decide which one is the correct one. A database
search may be resubmitted if the results are not quite right, or if new knowledge is
made available, then this added information would be used to separate the searches.
3.5.1.3
DBSearchResults
I added this table, since a DBSearch-object could get many different results. For instance if the same DBSearch is submitted to both Mascot and Ms-fit, then two results
would have to be associated with each DBSearch, but each would have its own information. The information contained within this table is the program used (ex. “Mascot”), how many results were found and how many peptide masses were submitted.
It also connects to the table “Chosen”. It also has an annotation-field since biologist
wants to be able to comment everything.
Chapter 3. Results
3.5.1.4
36
Chosen
This table connects DBSearchResults and ProteinHit. This makes it possible to identify more than one Protein for each spot that is being investigated. The table also lets
the user add a probability for the identification, saying how sure it is, it also has an
annotation-field. This information could for instance also be used by an expertsystem
that tries to identify proteins. Then the annotation-field could be set to “system” (or
equivalent) and the probability could be calculated based on a set of rules for the system. Or the information could be set by one biologist to show that he/she is not too sure
about this identification, and then another biologist could look at it and either agree or
disagree.
Another reason for this table is that each spot might actually be a mixture of proteins and when a DBSearchResults has to be associated with many ProteinHits.
3.5.1.5
ProteinHit
The table named ProteinHit and the table named PeptideHit in the original model has
now changed places since one DBSearchResult can get many ProteinHits, and each
ProteinHit consists of many PeptideHits.
Now ProteinHit also contains information about hitnumber (from the program), the
score, the score-type (ex. “ms-fit”), the molweight (mw), the charge (pI), how many
peptides were matched as well as the total sequence coverage for these peptides.
3.5.1.6
PeptideHit
PeptideHit now stores information about the start position and the end position for the
peptide, as well as the mass that was submitted and the one observed, the delta, the
number of missed cleavages and the sequence for the peptide.
3.5.1.7
Protein
The thing I have changed here is the format of the columns, in each case the column
has been extended so that the proteins that I store will fit. Even though the sequence
Chapter 3. Results
37
column has been extended from 500 to 3900 characters, I get a protein that cannot be
saved, since the sequence is longer than this, but I left that for now.
3.5.1.8
ListProcessing
Here I removed the column “background threshold”, and created a new table Threshold instead. I added a column to the table that stores information about the software,
such as name and version. The software could affect the both the results and the outputformat so this information needs to be kept for future references.
3.5.1.9
Threshold
This table is added, so that it will be possible to set different cut-off-values (i.e. thresholds) in different parts of the spectrum. This is useful since there is less noise in the
higher mass ranges, and the threshold in that region could then be set lower than in the
lower range. To make sure that it would be possible to split the spectrum in as many
ranges as the user wants, I decided to add another table instead of just another column
to ListProcessing. This leads to increased flexibility but also to increased complexity.
Threshold contains information about start and end position for the section and the
threshold-value.
3.5.1.10 CutOffValue
This table is also new and it contains a list of the tops in the spectrum that are not going
to be included in the resulting mass-list. This could for instance be all tryptic peaks if
you know that you used this enzyme to cut the protein and that they are likely to have
contanimated the sample. This list could also contain information about peaks that you
want to ignore when searching for the protein. For instance, it is very common to get
keratin contaminations in the samples.
3.5.1.11 MassSpecExperiment
Here I added a column, “massspecmachine”, that connects this table to a certain MassSpecMachine. This connection needs to be made, since it is not 100% certain that the same
Chapter 3. Results
38
machine will be used in all experiments, though likely. If the machine differs, then
so could be results, since the format, and detection-range probably will be different.
The only reason for not having a connection has to be that they are assuming that you
only have one, and that is probably right. But that machine will probably be upgraded
sooner or later, and then you have to be able to see which experiments were made with
the new, and which were made with the old.
3.5.1.12 The resulting model
Then all these changes has been made, the new model looks like this.
Figure 3.2: Thess are the modified tables from PEDRo
My hope is that this model will be more robust than the original, and also that the
Chapter 3. Results
39
changes that I propose will be accepted as standards for how to capture data in a proteomic-experiment by PSI (Proteomics Standards Initiative, http://psidev.sourceforge.net).
3.5.2 Performing a peptide mass fingerprint
To perform a peptide mass fingerprint search a list of massvalues for the interesting
peptides has to be specified. Today SIRCAMS are using both Mascot and Ms-fit as
search-engines. Based on the results from both they decide upon the identity of the
protein.
This step, of searching and identifying is made easier through my system. This
since search.html (see below) will accept information about the search, and then send
it to both Mascot and Ms-fit. The user could choose to specify the list of masses by
hand, or to connect the search with already existing information through MassSpecExperiment and PeakList (see below).
Figure 3.3: The Search-page
The resulting pages will looks exactly like the result-pages from the programs (in
Chapter 3. Results
40
fact they are), except for a button for storing the result to the database. Pressing this
button will save the information and it will then will be possible to show the result
using show.pl. This page could show just the result from Mascot, or from Ms-fit, but it
could also show the result from both searches at the same time.
This approach has one big advantage over the way they are using the programs
now, it saves time. It saves time since the user do not have choose the right variabels
twice, neither do they have to wait for the results from two webpages, nor do they need
to cut and paste the masslist, it could be taken from the system automatically. There is
another benefit too, and it is the fact the the information from Mascot and Ms-fit could
be shown next to one another, letting the user has more information available when
making the identification. This would hopefully lead to more accurate identifications.
Below is the result from one such search, and as you see, the result from the programs differ substantially suggest different proteins. If I had only searched using on
of the resources, I would probably have assumed that the protein that the program suggested were the correct one considering the scores it got, but with this knowledge I
wonder.
3.6
The Integration Part
3.6.1 The Project schema
The database is divided into three separate table-spaces, Project, MaxD and PEDRo.
Even though PEDRo and MaxD could be easily integrated (since they to certain extent
contain the same information) I have choosen to keep them separate and to duplicate
the information. As I have already mentioned, this is because I keep the data “standardised”. There is another reason as well, this approach is more correct than integrating
the data since an experiment in biology really is a project, a project consisting of different parts such as a microarray experiment, a 2DGel-experiment, sample preperation
and so on. With this datastructure this relationship is reflected in the database. This is
also what makes it easy to add new kinds of data. All you have to do is consider the
data a new “kind of experiment” and it will fit in with the existing structure.
Projectexperiment, stores information about what kind of experiment is being per-
Chapter 3. Results
41
Figure 3.4: The combined result from Mascot and Ms-fit
formed through experiment analyte (ex. DNA, RNA or protein) and experiment technology
that for instance could be microarray, Western Blotting or Clinical Trial.
Another benefit that comes from adding another level on top of PEDRo and MaxD,
is that it made it possible to add keywords to both projects and experiments. This table
will store keywords from the MESH, the GO and the MGED terminology. This adds
another dimension to the project/experiment at hand, a dimension that could be queried
to find out what occureences/species SCGTI are interested in.
Chapter 3. Results
42
Figure 3.5: The project part
3.6.2 Query.pl
This is the perl-module that contains all the sql-queries that queries both sides at the
same time. This means that the complex queries are gathered at one place, making the
system more robust and easier to maintain. This module could therefore be said to be
the view integration part of the system.
3.6.3 Statistic.pl
Collecting all statistical tests in one module, makes it easy to see which are implemented, and which are not. It also makes it easy to update an algorithm, since this is
the only module that needs to be updated.
I have only implemented the Welch t-test (assuming unequal variance) since it is the
only one that my demo-programs are using. This module needs to know if it is going
to query the MaxD or the PEDRo database since it needs to know which information
to request from query.pl. A log-transformation is made on the data since I have only 4
uninfected samples and 4 infected. A log-transformation will smooth the intensities so
Chapter 3. Results
43
that an outliner will not have quite the same influence on the data.
3.6.4 Plot.pl
This is a web-page that will plot three different graphs. The first graph will plot information for those genes and proteins there both expression levels have changed in
a statistically significant way. This is one way to visualise which genes and proteins
that it might be interesting to examine further. I designed it since I thought it could
be interesting to see there the changes are corresponding, i.e. if the gene changes in a
3-fold positive way will the protein also change 3-fold positive or not. Either way this
data could be used as basis for understanding the networks.
Figure 3.6: The plot showing proteins vs genes
Plot two will show those proteins that have changed there no statistically significant
change in gene-expression could be detected. This could be interpreted as a posttranslational change but we know far too little abouth this to say anything at all about
these proteins. (See section 3.8 for a discussion about this.)
Chapter 3. Results
44
Plot three will show those genes where no corresponding change in the proteinlevel could be detected. This could be used to determine which proteins are affected
by a pathway, rather than a single gene.
3.6.5 ShowTable4.pl
This page shows more or less the same information as plot.pl just in a different format.
The information for the expression levels and the identifiers are being displayed in a
table. In the protemic and genomic column, there are 4 possibilities. Either a statistical
significant change has occured or no change when “NS” is shown. Then it is possible
that the analyted was not expressed in the uninfected samples just the infected then
“Added” would be displayed. It is also possible that the analyte ceased expression on
infection then “Disappeared” is shown.
3.6.6 Identifier.pl
This was my first attempt at bridging between the three parts. The page allows the
user to choose to search for a UniGene cluster id, a GenBank accession number or a
SwissProt identifier.
If you choose to search for an UniGene cluster, then all information known for that
cluster will be shown (see below).
Underneath this information a list of the projects where this identifier has been
used are shown. Since you choose to search for a cluster, it will show both genomic
and proteomic experiments.
Should you instead choose to search for a GenBank accession number, then information know for that GenBank will be shown, and the genomic experiment where it
has been used. The corresponding information is shown for a SwissProt identifier.
Chapter 3. Results
45
Figure 3.7: The result for UniGene cluster Mm.14046
3.7
Testing and evaluation
3.7.1 Alan Pemberton
I demonstrated the platform and the system to Pemberton. I showed him how the
platform was conceptually built up by three different parts that were connected through
tables that gets updated every other week. I also showed him the structure of the data,
though the emphasis of the demonstration was rather on giving him an understanding
of the system, and letting him come with suggestions to what the platform could be
used, and to comment if any information were missing or if there were something
“wrong” with the system. He did come with some suggestions like connecting it to
GenMap, adding pathway information into the system. I showed him the integration
part and explained how they were supposed to be used interpreted. He seemed to like
the idea of plotting the information making it possible to survey the results graphically.
Chapter 3. Results
46
He seemed interested in becoming involved in the further development of useful
tools for the platform.
3.7.1.1
Comparison with his results
The result I get differs from the one that Pemberton got. This could depend on many
causess. We could be using different t-tests and they would therefore give different
results. Another reason could come from the fact that I log-transform my intensitiesvalues but since this should rather smooth out the results than make it more extreme it
is surprising that I get more statistically significant changes than them.
I do not know the reason for the results to differ, so this is something that has to be
determined before the statistic module continues to be developed.
3.7.2 Perdita Barran
I showed her what I have had done, and her response was, useful! Although she did
have some comments about one of the fields in search.html (3.3) She suggested that I
would ignore the fact that the plotting takes about 30 minutes to run, and just tell the
biologist that this program takes this long to run.
I am supposed to come back in October/November and present it to everybody in
mass-spec, that has to be high praise.
3.7.3 Results from a UniGene cluster multiple alignment
One interesting control that could be made, is to take a cluster from UniGene at random
and do a multiple sequence alignment on the GenBank accession numbers that make
up the cluster.
I did the multiple aligment for cluster Hs.288856, Prefoldin 5. For each of the
9 GenBank accession-numbers that are in it I got at least 3 others in the result from
the EBI FASTA Server (http://www.ebi.ac.uk/fasta33/). In each case the scores for the
sequences were very high like e-100 .
For each alignment there were two other sequences AX590145 and D89667 that
also got very good scores. I then looked at the name of the sequences and 5 out of the
Chapter 3. Results
47
9 were named prefoldin 5. The other 4 I compared to these 2 using NCBIs Sequence
Viewer (http://www.ncbi.nih.gov/entrez/). The result from this comparison was that
AX590145 was obvious not similar to the others, but I did think D89667 looked very
similar, all 5 are c-myc binding proteins and from the MM-1 gene. Then I started
comparing the sequence matching manually, and I found that between D89667 and
the others there were no mis-matches in the sequences, though there were some in
the beginning and the end of the sequence, meaning that it probably do not fulfil the
“dove-tail” demand.
I am happy with the result from this alignment.
3.8
TestCase - A combined transcriptomic and proteomic
survey of the jejunal epithelial response to Trichinella
spiralis infection in mice
I have used this dataset to show how the system could be used to gain new knowledge.
I will be discussing the aim of the experiment, the methods used, the results obtained
and conclusions that the authors made.
3.8.1 Background
Before the study was commenced it was known that when this parasite infection attacked mice a combined reaction of different immune responses worked together to
eliminate the infection. The choice the Trichinella spiralis, which is a nematode, was
based on the fact that it causes both profound and stereotypical pathological responses
that are common to many nematode infections. The pathology is most marked at the
time of worm expulsion which often takes place 14 days after infection.
3.8.2 Aim
The aim of the study was to use transcriptomic methods to study the gene and protein
changes that occur when the worm is being expelled. The hope was to show some
Chapter 3. Results
48
of the pathways that are involved in this function. The authors [41] only interest was
the state of the pathways at the time of expulsion and how it differes from before the
infection, so that is only two data-sets.
3.8.3 Method
8 adult female mice were infected with Trichinella spiralis. These results were later
compared to an unaffected control-group.
DNA microarrays were used to examine the genes, and from this a list of significant
up- or down-regulated genes were compiled.
Then 2D-gels were used to see which proteins were changed. This list of 52 proteins were then examined with peptide mass fingerprint search to find the protein that
those peptides build.
The students t-test was used to determine which genes had interesting changes
given the relative spot volumes for the proteins.
3.8.4 Results
In paper [41] they discuss their results and conclude that some of the results correspond
to information already known and some was not known. They where hoping to show
how the expression of genes and proteins in these pathways correspond but since they
only found one case where this happened, this hope was not fulfilled. They conclude
that any changes in protein-levels that have no correspondance in changes in geneexpression levels, could be from post-translational modifications, rather than a change
in the expressed genes.
2671 genes were detected in both the uninfected and the infected samples. Of
these 10% were notably increased on infection and 13% were decreased. 10 genes
were switched off in the infected sample, and 58 were switched on.
In the project they compared the effects of two separately Trichinella spiralis infections on the 2D profiles, 2-9% of the spots matched to control gels were notably
up-regulated, and 10-14% were down-regulated. Significant changes in the protein
expression levels were observed for 16 spots in total.
Chapter 3. Results
49
The changes observed by proteomics were then compared to the changes observed
in transcription of the corresponding genes. In only 1 case out of 11 was a change
in gene transcription (-2.8 fold) accompanied by a significant alteration in the corresponding protein level (-4.0 fold). The other 10 showed no statistically significant
change.
The table below is just a selection of spots, showing what he chose to highlight as
interesting.
Gene
Identity
SP
Peptides
ATP5B
Proteomics Microarray
ATP synthase beta chain...
P56480
15 (34%)
+7.2
-
ITLN
intelectin (spot 1)
088310
5 (14%)
new spot
NS
LAP3
cytosol aminopeptidase
Q9CPY7
6 (12%)
-4.0
-2.8
PDIA1
protein disuplhide isomerase
P09103
12 (26%)
NS
absent
PGAM1 phosphoglycerate mutase 1
Q9DBJ1
11 (42%)
NS
+1.4
PKM2
pyruvate kinase, M2 isozyme
P52480
12 (25%)
spot lost
NS
PKM2
pyruvate kinase, M2 isozyme
P52480
20 (43%)
spot lost
NS
PNLIP
pancreatic lipase (rat)
P27657
6 (19%)
new spot
-
Table 3.1: The results that Pemberton et al got [41]
3.8.5 Conclusions
They conclude that there were significant changes occurring in the gene expression
profiles at the time of worm expulsion and that some post- translational modifications
could be detected as well. These genes includes those that are known to be involved in
the immune-system.
There are some drawbacks with both microarray technologies and 2D-gels. For
microarray it is a question of detection level. Signals below a certain cut-off will not
show up, and important information will be lost. Though 2D-gels allows you to directly
look at the protein-level, they restrict the range of the pI and the mol-weight of the
protein. Choice of extraction method will also narrow the detection-range. Together
-
-
Chapter 3. Results
50
this means that up to a third of the proteins will never be found, and only the most
abundant ones can be identified.
In conclusion, the authors did, with the help of these methods, find some changes
in the expression levels, but they there mostly unrelated.
3.8.6 My comments
With the technologies that are available today (and by available I mean not too expensive and fairly robust) it is more or less possible to investigate the entire genome in
one project but it is far from possible to investigate the entire proteome. That is why
Pemberton et al only investigated 52 proteins. Not until new technologies are invented,
such as protein arrays, will it be possible to really start looking at organisms from a
systems biology approach. When it becomes possible this platform needs to be in place
already.
Chapter 4
Discussion
4.1
Did I achieve my goals?
4 of the goals were achieved, 2 others were only partly achieved.
4.1.1 Standardisation
It is standardised since it uses MaxD and PEDRo. MaxD encompasses all information specified by MIAME that is the standard for storing information about microarray
-experiments. PEDRo is a part of PSI. PEDRo was changed to improve support for
high-throughput technologies, and some other considerations, so not totally standardised.
4.1.2 Public
It is right now open to everybody within the University of Edinburgh, but the idea is
that it will be password-protected so that only the people that are entering data into
the database can see it, and then only their own data. The tools themselves are public.
Everything except the password/usernames for the database that is. The only limitation
is that without a similar structure the programs cannot be used but if anybody is interested in the programs, then it is likely they already have a similar database-structure
already.
51
Chapter 4. Discussion
52
Figure 4.1: The welcome page for the system
4.1.3 Web-based
All parts, except for all species.pl are accessible through the web, through the page
welcome.html that is shown below.
It would not be appropriate for all species.pl to be web-based since its not something the user starts, nor something that processes data that should be displayed to the
user. The program is a text-parser and as such it should run on the server, hidden from
view. I would not even like to think about how long it would take for the program to
complete a run if it were web-based.
4.1.4 Scalable
All the connection-information to the database, is in one file “connectDB.pl”, so this
will not be hard to change. Simply change the machine name in the connection-string
and all programs will automatically connect to the new database instead.
MaxD could only be used for microarray-data so any other kind of genomicexperiments would have to be added to the database-structure, but with the project,
Chapter 4. Discussion
53
projectexperiment and experiment structure that should be easy.
The table AnalyteProcessingStep in PEDRo makes it possible to tie in any kind of
proteomic experiment to PEDRo.
4.1.5 Modularised
The genomic side really consists of one single program “all species.pl” and the corresponding cron-job.
On the proteomic side there are many programs calling one another, each program
is responsible for one, well-defined function, so this makes it modularised, though a
bit harder to over-see. The reason for connecting them is that is makes it easier for the
user and programming goals are never allowed to supercede the need of the user.
With the help of query.pl and statistic.pl is the integration part pretty modularised,
and all connecting information is within the same boundaries, making any changes
necessary easier to perform, and the whole system easier to maintain. These two “modules” could be considered the glue that ties all the different parts together.
4.1.6 Up-to-date
The whole system depends on the information that gets retrieved in the genomic part.
Today this information is updated by manually calling a script.
4.2
What would I have done differently
Given more space to store the database, I would probably have stored all information
for UniGene, UniGene Mapping and UniGene Protein. Now I can just store information about the ones that we are using, and therefore the tool is a bit restricted as
mapping goes. Had I had more space that part of the system would have been easy to
write but now I have had some real problems with it. I store the needed information
in hashes, and the system runs gets of memory errors before 2000 items have been
stored, so my program stores 1000 at a time, and then saves the relevant ones to the
Chapter 4. Discussion
54
file, and then empties the hash, to store the next 1000... until all 200 000 records has
been processed.
4.3
What would I have done if I had had more time?
• I would definitely have got the cronjob to work, or worked around it somehow.
• I would have put more time on query.pl and statistic.pl. So that they would have
been even more encapsulated and object-oriented. Now the same information
may be on two different places, which is not ideal.
• Even though I was aware of the difficulties from the beginning, I did hope that
I would have more time to implement different analysing-tools, and different
statistics-tools, but I did not. If I had know from the beginning that I would have
more time, when I would have read and tried to understand more statistics and
analysis steps.
• I have absolutely no idea why msfit.cgi all of the sudden decided to crash, and
never to work again properly after that day. What happens is that the function
(LWP::UserAgent) that sends the information to the web-page is returning 400
OK, but the resultset is empty, meaning that the user never gets to see the result.
4.4
Future work
• Change the sequence field in PEDRo.Protein to a Blob or equivalent.
• One of the first things to add to the database-structure will probably be clinical
data since it is in the foundation for what SCGTI are doing.
• Add the terminolgy to the keyword-table. I have not done this since the three
terminologies are pretty extensive and to sit and map from one to the other two
by hand would take a lot of time and since it is not included in my project-scope,
I did not do this mapping.
Chapter 4. Discussion
55
• It would probably have been interesting to look at other genomic-technologies
than microarrays. Just to see what is out there, and maybe add one into the
database to show how it could be done.
• There are many exciting things happening in the area of proteomics. This development should be monitored, and when a technique is considered stable enough,
it should be added to PEDRo as a new AnalyteProcessingStep.
• If dataloaders is not made for PEDRo, then this needs to be made, The reason
that I am talking about dataloaders is that today SCGTI has a designated curator
that enters all data and therefore the program will not need a “flashy” interface, it
should just be a simple tool for adding a large collection of data into the database.
• The platform may be in place, but there are many different kinds of anaylsis
tools that “should” be implemented once the platform is in place. These include
automatic normalisations tools, different statistical test, different visualisation
tools and so on. One of the first things to add in this area is the capability to
ask those questions that I raise in step 6 in section 1.4. The next is to make it
possible to search on keywords.
• Another possibility would be to write a table SpotMeasurement in the project
tablespace. This table would map a spot from a 2Dgel-experiment onto a measurement from an microarray-experiment. This would be a mapping that maps
across the boundaries for cases when the identity is not known.
• When all this is done, genetic networks should be modelled using the data that
is integrated with this platform.
Chapter 5
Summary
I have designed and implemented a platform (in Oracle 8i) for the integration of genomic and proteomic data. I have also written a couple of programs for storing, querying and visualising the data. The integration bridge depends on data that gets updated
every second week, meaning that the results may change from time to time and that
the system is dynamic.
The platform accept information about projects, microarray-experiments and protemics-experiments. It is possible to add other types of data to the platform without
changing the exisitng structure. The platform is intended to be able to capture all biological data that is needed for an experiment in systems biology. This is a new area of
biology that looks at organisms as systems rather than well-defined, isolated parts. The
problem is that is is not known what data that will be needed for this so the platform
has to be able to grow and encompass new kind of data. I believe that I have succeded
in this, I think that it is relatively easy to add, change or delete a module from the system. The only part that will get affected is the integration bridge. This part will have
to know how to get information from all parts of the database, the other do not.
I have designed and implemented a demo-program that can plot relevant changes
in gene-expression levels against those in proteins-levels. This plot could be used as
a first tool when trying to understand how the networks are connected, or to decide
which genes and proteins which should be studied further.
56
Appendix A
Appendix A - Biological definition
This project is about bioinformatics - the discipline of using computers as a tool to
gain knowledge from biological data. This dissertation is going to read by people with
no background in biology, but since it demands certain biological knowledge I have
included this appendix for those people that are not entirely sure what the difference
is between the genome and the proteome. It could also be used as a “guide” to how
I have used the terms (this since I am convinced that if you asked five biologist of a
definition, you would get five different answers back).
Terms included in this appendix are in italic so that they are easily noticed. The
appendix is organised in an alphabetical way rather than a logical way so that it will
be easier to find the term you are looking up. In this glossary I have included a short
description of one of the most fundamental concepts for this dissertation, the central
dogma of molecular biology. It will tell you why DNA microarrays will not reveal the
entire picture and why it is interesting to look at the combination of gene and protein
expressions.
A.1
The Central Dogma of Molecular Biology
There are 3 steps involved in this process, and they are:
1. The replication phase. Here DNA (a double-stranded molecule) gets duplicated,
i.e. a perfect copy is made.
57
Appendix A. Appendix A - Biological definition
58
2. The transcription phase. By using DNA as a blueprint, a single-stranded RNA
sequence is created, preferably a perfect copy.
3. The translation phase. Proteins are made from RNA. This is done by translating
3 consequent bases (a codon) into an amino acid. These amino acids gets linked
together to eventually build up the proteins. [17]
In short, the central dogma of molecular biology says that our genetic material is
stored in DNA, and that DNA is a static occurence. In spite of this, the “reflection” of
DNA in our bodies is quite different, for instance compare our toes with our hair.
The DNA gets transcribed into RNA, or more specific mRNA. The same mRNA is
not always made from the same DNA, there are atleast six different possible mRNAs
for each DNA since there are six open reading frames (ORF). The genetic code works
is groups of 3 (the above mentioned codon), and transcription can start at site 1, 2 or
3, but then it is also possible for the transcription to work “backwards” adding another
3 possibilities, totalling at 6. If you also take into consideration the fact the pre-mRNA
can be spliced (i.e. cut) at different locations in the chain, leaving different mRNA:s.
Next to happen is for the RNA to get translated into proteins. This happens by
taking a codon from the mRNA and translate it to the corresponding amino acid. The
genetic code is redundant meaning that even though you only have 20 different amino
acids you have 64 (33 ) possible codons, some codons code for the same amino acid.
There is also a start-codon and two stop-codons that tells the tRNA where to start
translating the mRNA and where to form the finished protein. Once the protein is
created and released it will become modified in different ways so that it can “act”
the right way in the right place. Some parts of the chain may be cut off, functional
groups may be added, it can fold up and/or form different 3-dimensional structures (the
structure of a protein will determine its function). Then all this is done, the protein is
transported to the target-location, and there it can have different effects depending on
where in the body this is. [17]
The central dogma of molecular biology is then the reason for there being a big
dimensionality issue in biology, as seen in the figure 1.4.6.1.
Appendix A. Appendix A - Biological definition
A.2
59
Amino acids
Amino acids are the building parts of peptides and therefore also of proteins. An amino
acid is built up by five parts. In the middle there is a carbon-atom (C), to this a NH2
(nitrogen), a COOH (carbon-di-oxide) and a hydrogen atom bound. Apart from this
there is also a variable sidechain (R). It is in the sidechaing that the amino acids differ.
[24]
A.3
Bases
A base is a pair of nucleotides. An A is bound to a T, and a G to a C through hydrogenbonds. Since the two pairs, AT and GC are of equal length, the diameter of the DNAmolecule is uniform. [30]
A.4
Comparative genomics
The effort to sequence (and compare) different species. [3]
A.5
DNA
DNA (Deoxy RiboNucleic Acid) is perhaps best known as the storage site for all genetic material, the unit that transfers the parents traits onto the children. [30] The DNA
contains all information needed to build up every single cell in our bodies. [30]
There are two different ways of transferring DNA from one generation to the next,
either cloning then the DNA gets transfered without any changes. The other is sexual
reproduction then the genetic material from the mother is crossed with that of the father
forming a unique blend of their genetic material in the offspring. [43]
“Chemically” DNA is a double-stranded molecule that consists of nucleotides.
There are four different nucleotides that are used to build a DNA-molecule, and that is
A, C, G and T. These nucleotides form long sequences that are bound together in such
a way that DNA is usually in the form of a double-helix. [30] (p100)
Appendix A. Appendix A - Biological definition
A.6
60
Gene
Genes could shortly be described as the protein encoding sequences in the DNA [30].
Genes could be on or off, a gene that is on is said to be expressed and that means
that the protein that the gene is coding for is being produced. Since it is the proteins
that are responsible for the functions in our bodies it is very interesting to know which
genes that are expressed and which are not. By staying on, the gene makes sure that
its product the protein continues to get produced. [30] Gene expression levels could
therefore be used as a measure for which proteins that are needed in a given situation.
By using gene expression levels as an indicator it is possible to understand which genes
that are involved in which diseases. This knowledge might lead to new drug targets
since the pharmaceutical industry when will know which proteins that are needed.
Today it is mostly proteins that are used as drug targets.
A.7
Genetic diseases
There are in principle two kinds of genetic diseases simple and complex. The simple
ones are the ones there a single molecule (often a gene) are solely responsible for the
entire disease state. This is not very beneficial from a evolutionary perspective so the
number of simple genetic diseases are very low, in comparison. The complex ones
are there a genetic network are cooperating to create the disease, this groups involves
cancer, heart diseases and diabetes. (This leads to the conclusion that a genetic network
is an evolutionary stable form of organisation.) In these diseases, no single gene can
affect the disease, therefore it is likely that more than one mutation has to occur before
the disease breaks out. Genes may contribute in different degrees to the disease and
they get influenced by their environment to different degrees too. “Unravelling these
networks of events will undoubtedly be a challenge for some time to come, and will be
amply assisted by the availability of the sequence of the human genome.” [33]
Appendix A. Appendix A - Biological definition
A.8
61
Genetic network/pathway
I am taking genetic networks as an example of for what this platform could be used
since it today is something that researchers are interesting in and since it shows the
way that DNA, RNA and proteins cooperate to build up our bodies.
A genetic network could best be described as the network or circuit which describes
the interactions of genes in a particular system. These network differs in size, the
could be everything for tiny (2-3 interactions) to huge (a couple of ten thousands of
interactions). [8]
Networks are responsible for the most complex forms of biological phenomena
including cancer. They are a collection of genes, that (when expressed) have a common
biological function. This means that a gene from a network cannot by its own make
the function it need the others, and it need them in different degrees (the weights of the
network). Networks could also be used as a predictive source for how individual genes
will act under certain circumstances, since you will know how the network influences
the gene you will be able to predict how the gene will behave. This information could
possibly be used to predict whether a person will get a specific disease or not.
A pathway is a “sub-network”, i.e. a group of genes highly interconnected to
each other, but not as connected to the “outside” world (outside meaning outside that
pathway). Since they usually are a lot smaller than the networks themselves, they are
often easier to model. But its highly unlikely that they wouldn’t exchange information
with one-another, this means that you will have to be aware that there are information
missing then drawing conclusions based in pathways instead of networks. [2]
The term genetic networks, really encompasses so much more than genes, it refers
to all those things affecting genes as well, such as proteins, mRNA and other small
molecules such hormones and ions, as long as they are connected.
A.9
Genome
The genome is the entire genetic material for a specific organism. [38]
Appendix A. Appendix A - Biological definition
A.10
62
Genomics
The goal of genomics is to determine the complete DNA sequence for the entire genome.
Whereas functional genomics wants to determine the function of the proteome, and
structural genomics is the systematic effort to put together a completed structural description of a defined set of molecules. The ultimate goal of the combined effort of
these is to have the entire genome and proteome mapped. [38]
A.11
Mass spectrometer
Mass spectrometer is an analytical technique that is used for 3 different purposes, the
identification of unknown compounds, the quantification of known materials and the
interpretation of physical and structural properties of ions. [45]
The mass spectrometry in this project has been used to identify unknown substances. This is done by choosing interesting samples (for instance a spot from a
2Dgel). This sample is then subjected to a mass spectrometer that cuts up the protein (usually using trypsin as enzyme) into peptides. The instrument then determines
the weight of these peptides and produces a list of peptide weights for the protein.
These weights are then matched against known peptides, to try and decide which peptides are present. From this list a protein is suggested this process could be done by
either Mascot (http://www.matrixscience.com) or Ms-fit (http://prospector.uscf.ed.us)
and this technique is known as peptide mass fingerprinting.
A.12
Microarrays
Microarrays is a technique used to gain a “snapshot” image of the level of thousands
of mRNA in a biological sample at a given time. The mRNA “profile” is just an indirect measure of the gene activity, [9] this since if a gene is on, it is producing the
corresponding protein and to do that is has to create mRNA.
Appendix A. Appendix A - Biological definition
A.13
63
Molecule
A collection of atoms that have strong enough binding in between them not to fall
apart (at least not easily). It is the smallest building-block of any substance, and are
determining the properties for that substance. [44]
A.14
Nucleotides
Nucleotides are the buildingstones of DNA. The nucleotides in DNA are in turn built
up by three parts, a sugar (in this case deoxyribose, giving the molecule its name), a
phosphate group and a nitrogen-containing-base. There are four different bases in DNA
and they are, adednine (A), guaniene (G), cytosine (C) and thymine (T). The first two
are called purines, and the last two pyrimidines. The amount of purines in the DNA is
equal to the amount of pyrimidines. [30] The nucletides are also known as bases.
A.15
Peptide
A peptide is a short chain of amino acids linked by peptide bonds. Longer peptides, is
called polypeptides, and even longer ones are proteins. Peptides is usually only 20-30
amino acids long. [30]
A.16
Protein
Proteins are the molecules that acts as building blocks in our bodies. They are responsible for forming most of our body structure such as skin and hair and to produce
most substances such as enzymes and antibodies. [42] They are also responsible for
the metabolism and are making sure that every reaction in our bodies is happening at
the right time, in the right location. [20] In short, they are responsible for more of less
everything, except for the skeleton, the spinal cord and some parts of our brains.
The importance of proteins was very clear already from the beginning, which is
understood, then you are considering that the word comes form the Greek “proteos”,
Appendix A. Appendix A - Biological definition
64
that means the first or the most important. [40]
Proteins are formed by chains of amino acids, varying from two bases to thousands
of bases. Since there are 20 naturally occurring amino acids there are an infinite big
number of possible proteins. An N long protein will result in 20N possible proteins.
A.17
Protein arrays
With protein arrays it will be possible to screen thousands of proteins simultaneously,
just as it is for microarrays.
Protein arrays could be used for many different things, such as detecting presence
or absence of proteins, investigating protein expression levels, determining interaction
and functions for specific proteins, or to look at protein-interactions, protein-protein,
protein-antibody or protein-drug. [13]
A.18
Proteome
The proteome is the protein complement that is coded for by the genome. Although the
genome is static (except for mutations) the proteome is dynamic. The current proteome
at any given time depends on many different factors where the surrounding environment is one. [38] The complexity of the proteome far exceeds that of the genome
since a gene can code for many different proteins. So far we do not know exactly
how many different proteins there are, but taking into account different splice sites,
reading-frames and different post-translational modifications, one guess is about 500
000 proteins. [13] Another estimate comes from the Human Genome Project [22], they
think that the human proteome should be at least 10 times larger than that of the fruit
fly (Drosphilia) and the roundworm (C.elegans), which would mean around 300-380
000 different proteins.
Appendix A. Appendix A - Biological definition
A.19
65
Proteomics
Proteomics is the subject of quantifying the expression level of the proteome at any
given time. Now the term usually also encompasses procedures for determining the
function of a set of proteins, this means that the term is a synonyme to functional
genomics. [38]
The area encompas the study of identity, abundance, distribution, modifications,
interactions, structure and functions of proteins.
Proteomics is such a vast area that “experts” predict that there will never be one
technology that can do all steps. [18]
A.20
RNA
The most important function of RNA (RiboNucleic Acid) is to act as a middle product
between DNA and proteins. The double-helix of the DNA is unwounded and each
strand is copied, forming two exact RNA copies of the DNA strands. The resulting
RNA is then spliced, forming mRNA (messenger-RNA) strands.
Three bases in a row is known as a codon. One by one a tRNA (transfer-RNA) will
bind to one of these codons building up a long chain of amino acids that eventually
will separate from the mRNA and form a protein.
There are two big differences between a DNA and a RNA molecule, the first one is
that RNA is single-stranded and DNA double-stranded, the second is that the base T in
DNA is replaced for the base U in RNA. [30]
One of the most important things to remember in biology, is that there are always
exceptions to every rule for instance for some organisms (i.e. some viruses) it is RNA
not DNA that is being used as storage site for the genetic material. [23]
A.21
Systems biology
“Moving past DNA (deoxyribonucleic acid) sequences, researchers are interested in
the corresponding protein sequences, their structure, and function. Beyond sequences,
researchers wish to understand the “space” and “time” dimensions of genes, as for
Appendix A. Appendix A - Biological definition
66
example, what genes are expressed in which tissues and during what stages of development”. [10]
The human cell consists of thousands of interactions between proteins, genes and
other biomolecules. It is these interactions that are resposible for switching on or off
genes, controlling which proteins are being produced and responding to signals from
the environment. This means that even small changes in one of these interactions could
lead to the outbreak of a disease. [21] Today, we do not know what would happen when
a malfunction is introduced in the system. We do not know how stable it is, neither
do we know the design underlying it, and we do not know if it would be possible to
modify these to make a more robust system. [27]
Appendix B
Abbreviations
Arabidopis
Arabidopis thaliana (cress)
C.elegans
Caenorhditis elegans (worm)
EST
Expressed Sequence Tag
Mged
Microarray Gene Expression Data
MIAME
Mimimum Information About a Microarray Experiment
NCBI
National Center for Biotechnology Information
ORF
Open Reading Frame
PEDRo
Proteomics Experiments Data Repository
PSI
Proteomic Standards Initiativ
SCGTI
Scottish Centre for Genomic Technology and Informatics
SIRCAMS
Scottish Instrumentation and Resource Centre for Advanced Mass Spectrometry
SQL
Structured Query Language
TIG
TIGR Gene Indicies
TIGR
The Institute for Genomic Research
67
Bibliography
[1] Anastassiou,
ing
D.
of
genetic
(2002).
networks
Computational
[WWW
document]
ModelURL
http://www.cvn.columbia.edu/courses/Spring2002/ELENE6901.htm
[2] Arkin, A. (2003). Cellular Systems Analysis and Quantitative Biology. [WWW
document]. URL http://www.hhmi.org/research/investigators/arkin.html
[3] Boguski, M.S. (2002) Comparative genomics: The mouse that roared Nature
420, 515-516 (05 December 2002)
¯
[4] Brazma, A. (2001). Editorial - on the importance of standardisation in life science. Bioinformatics, vol 17, no 2. Pages 113-114
[5] Buneman,
Integrating
P.
Davidson,
Biollogical
S.B.
Data
Overton,
C.
(1995)
Sources
[WWW
Challenges
document]
in
URL
http://citeseer.nj.nec.com/davidson95challenges.html
[6] Bussey, K. J. & Kane, D. & Sunshine, M. Narasimhan, & S. Nishizuka, S. &
Reinhold, W. C. & Zeeberg, B. & Jaay & Weinstein, J. N. (2003) MatchMiner:
a tool for batch navigation among gene and gene priduct identifiers Genome
Biology 4(4):R27
[7] Castillo-Davis, C. Hartl, D.L. (2002) GeneMerge - post-genomic analysis, data
mining, and hypothesis testing. Bioinformatics. Vol 19. No 7. pp 891-892
[8] Using Artificial Genomes to model Genetic Networks [WWW document] URL
http://homepage.ntlworld.com/cjl.clarke/proposal.html
68
Bibliography
69
[9] Cuminskey, M. & Levine, J. & Armstrong, D. (2002). Gene Network Reconstruction Using a Distributed GA with a Backprop Local Search. Proceedings
of the 1st European Workshop on Evolutionary Bioinformatics (EvoBIO 2003),
Springer.
[10] Davidson, S.B. Crabtree, J. Brunk, B.P. Schug, J. Tannen, V. Overton, G.C.
Stoeckert, C.J. (2001). K2/Kleisli and GUS: Experiments in integrated access
to genomic data sources. IBM Systems Journal, vol 40
[11] Davidson,
in
S.B.
Integrating
Overton,
Biological
C.
Buneman,
Data
Surces
P.
(1995)
[WWW
Challenges
document]
URL
http://citesser.nj.nec.com/davidson95challenges.html
[12] Dennis, G. Jr. Sherman, B.T. Hosack, D.A. Yang, J. Gao, W. Lane, Clifford, H.
Lempicki, R.A. (2003) DAVID: Database for Annotation, Visualization, and Integrated Discovery Genome Biology. URL http://genomebiolgy.com/2003/4/5/P3
[13] FunctionalGenomics.org.uk. Protein Arrays Resource Page. [WWW document]
URL http://www.functionalgenomics.org.uk/sections/protein arrays.htm#intro
[14] GeneticXchange.
sis
For
High
optimizing
Througput
Drug
Data
Discovery.
Integration
[WWW
and
Analy-
document].
URL
http://www.genetixchange.com/v3/product/whitepapers/gxdatasheet1102.pdf
[15] GeneticXchange.
sion
to
speed
Exploating
new
drug
the
life
discovery.
science
[WWW
data
explo-
document].
URL
http://www.geneticxchange.com/v3/product/whitepapers/WPexplosion.pdf
[16] GeneticXchange. discoveryHub - “Standard Edition”. [WWW document]. URL
http://geneticxchange.com/v3/index.php?doc=product/standardedition.html&lvl=1
[17] Gerlof, D. (2002). Lecture Notes for the Course Communicating about biological
Data.
[18] Gershon, D. (2003). Proteomics technologies: Probing the proteome Nature 424,
pp 581-587
Bibliography
70
[19] Ghazal, P. Talk about pathwaybiology.
[20] Gustafsson, A. (2002). De lagger nya bokstaver till livets alfabet. Nyteknik, p 17.
2002 Mars 7
[21] Halim,
ating
N.
Withers,
M.
(2002)
the
circuits
of
life
Systems
[WWW
Biology:
Cre-
document]
URL
http://www.wi.mit.edu/nap/2002/nap feature sysbio.html
[22] The
the
Human
human
Genome
genome
Project.
Preliminary
projects
[WWW
findings
of
document]
URL
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/H/HGP.html
[23] Hunter,
L.
AI
and
molecular
biology.
[WWW
document].
URL
http://www.aaai.org//Library/Books/Hunter/01-Hunter.pdf
[24] [WWW document] URL http://www.hyperdictionary.com/dictionary/amino+acid
[25] Institute
man:
of
Medecin-Division
Building
a
Safer
of
Health
Care
To
Health
System
[WWW
Err
is
Hu-
document]
URL
http://www4.nas.edu/news.nsf/isbn/0309068371?OpenDocument
[26] Johnson, K. & Lin, S. (2003). QA/QC as a Pressing Need for Microarray Analysis: Meeting Report from CAMDA’02. Biotechnique vol 34. Pages S62-S63
[27] Kitano,
H.
(2002).
Systems
biology:
A
Brief
Overview
Science
(http://www.sciencemag.org), vol 295. 1 March 2002. pp 1662-1664
[28] Kulkarni, A.V. Williams, N.S. Wren, J.D. Mittelman, D. Pertsemlidis, A. Garner,
H.R. (2002). ARROGANT: an application to manipulate large gene collections
Bioinformatics, Vol18. No 11 2002 pp 1410-1417
[29] Lesney,
M.
S.
on
comparative
(2001).
genomics
Ecce
homology:
[WWW
A
primer
document]
UTL
http://pubs.acs.org/subscribe/journals/mdd/v04/i11/11lesney.html
Bibliography
71
[30] Lodish, H. & Berk, A. & Zipursky, S.L. & Matsudaira, P. & Baltimore, D. & Darnell, J. (2000). Molecular Cell Biology. New York:W.H. Freeman and Company.
pp 5,100
[31] [WWW document] URL http://bioinf.man.ac.uk/microarray/maxd/
[32] Mged.
periment
(2002).
-
Minimum
Miame
Information
1.1
Draft
6.
About
a
[WWW
Microarray
Ex-
document].
URL
document].
URL
http://www.mged.org/Workgroups/MIAME/miame 1.1.html
[33] NCBI.
(2001).
Genes
and
diseases.
[WWW
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=gnd.preface.91
[34] NCBI. UniGene [WWW document] http://ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene
[35] NCBI.
Frequently
asked
questions
[WWW
document]
http://ncbi.nlm.nih.gov/UniGene/FAQ.shtml
[36] NCBI.
of
(2003).
the
UniGene:
transcriptome
A
[WWW
unified
view
document]
URL
http://www.ncbi.noh.gov/books/bv.fcgi?call=bv.View...ShowSection&rid=handbook.chapter.857
[37] NCBI. (2003). Mail from NCBI regarding UniGene cluster ids [private communication]
[38] Nilges, M. & Linge, J.P. Bioinformatics - a definition. [WWW document]. URL
http://www.pasteur.fr/recherche/unites/Binfs/definition/bioinformatics definition.html
[39] Passino, M. Structural Bioinformatics in Drug Discovery [PPT document] URL
http://www.sdsc.edu/pb/edu/pharma202/Passino.ppt
[40] Paulun, F. Protein: Vad ar det bra for? [WWW document]. URL http://bkspotrsmag.se/artiklar/protein - vad ar det bra for.htm
[41] Pemberton, A.D. &Knight, P.A. & Robertson, K. & Wright, S.H. & Roz, D.
& Miller, H.R.P. (2003) A combined transciptomic and proteomic survey of the
jejunal epithelial response to Trichinella spiralis infection in mice [Unpublished
article]
Bibliography
72
[42] Phoenix5. (2002). Phoenix5’s Prostate Cancer Glossary. [WWW document].
URL http://www.phoenix5.org/glossary/protein.htm
[43] Purves, B. Sadava, D. Orians, G. Heller, C. (2001). Life The Science of biology
Sixth Edition. Sinauer Associates. pp 165
[44] Quist, P. (1998) Kemisten - molekylernas mastare? [WWW document] URL
http://www.teknat.umu.se/popvet/POP/kemi.html
[45] Smoot,
tein
M.
mass
(2001).
2-D
spectrometry.
gel
electrophoresis
[WWW
and
document]
proURL
http://hesweb1.med.virginia.edu/biostat/teaching/statbio/Spring01/2-D GelMass Spec.ppt
[46] Stein, L. (2003). Integrating Biological Databases. Nature Reviews — Genetics,
Volume 4.
[47] Stewart, B. (2002). An interview with Dr. Leroy Hood [WWW document] URL
http://www.oreillynet.com/lpt/a/1499
[48] Szallasi, Z. (2001). Genetic network analysis - From the bench to computers and
back. 2nd International Conference on Systems Biology
[49] Systems Biology.org. (2003) Systems Biology - English [WWW document] URL
http://www.systems-biology.org/000/
[50] Taylor, C.F. Paton, N.W. Garwood, K.L. Kirby, P.D. Stead, D.A. Yin, Z. Deutsch,
E.W. Selway, L. Walker, J. Riba-Garcia, I. Mohammed, S. Deery, M. Howard,
J. A. Dunkley, T. Aebersold, R. Kell, D.B. Lilley, K.S. Roepstorff, P. Yates, J.R.
Brass, A. Brown, A.J. Cach, P. Gaskell, S.J. Hubbard, S.J. Oliver, S.G (2003) A
systematic approach to modeling, capturing, and disseminating proteomics experimental data Nature, vol 21. Number 3, pp 247-254
[51] Taylor, C.F. Paton, N.W. Garwood, K.L. Kirby, P.D. Stead, D.A. Yin, Z.
Deutsch, E.W. Selway, L. Walker, J. Riba-Garcia, I. Mohammed, S. Deery,
Bibliography
73
M. Howard, J. A. Dunkley, T. Aebersold, R. Kell, D.B. Lilley, K.S. Roepstorff, P. Yates, J.R. Brass, A. Brown, A.J. Cach, P. Gaskell, S.J. Hubbard, S.J. Oliver, S.G (2003) A systematic approach to modeling, capturing, and disseminating proteomics experimental data [WWW document] URL
http://pedro.man.ac.uk/files/PEDRo PowerPoint talk.zip
[52] Taylor, C.F. [WWW document] URL http://pedro.man.ac.uk/model.shtml
[53] The Institute for Genomic Research. About TIGR. [WWW document] URL
http://www.tigr.org/about
[54] The Institute for Genomic Research. TIGR Gene Indices Software Tools. [WWW
document] URL http://www.tigr.org/tdb/tgi/software
[55] Tsai, J. Sultana, R. Lee, Y. Pertea, G. Karamycheva, S. Antonescu, V. Cho, J.
Paravizi, B. Cheung, F. Quackenbush, J. (2001). RESOURCERER: a database for
annotating and linking microarray resourcers within and across species. Genome
Biology 2001, 2(11)
[56] Wikipedia.
(2003).
Model
organism
[WWW
document]
URL
http://www.wikipedia.org/wiki/Model organism
[57] Zhong, S. Li, C. Wong, W. H. (2003) ChipInfo: software for extracting gene
annotation and gene ontology information for mixroarray analysis Nucleic Acid
Research. vol 31. No 13. pp 3483-3486