Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A platform for the integration of genomic and proteomic data Lena Hansson Master of Science School of Informatics University of Edinburgh 2003 Abstract The goals for this project were to build a standardised, public, web-based, scalable, modularised, up-to-date integration platform for genomic and proteomic data. This platform would allow researchers to investigate a single gene, a single protein, a batch of genes, a batch of proteins or any combination of the above. The platform can store data for any kind of proteomic-experiment, but on the genomic side is it only possible to add microarray-experiments at the moment. The platform is focused on being flexible enough to facilitate new kind of data. My guess is that the first thing to be added to the structure is clinical data, so that it would be possible to access that kind of information as well. To make it this flexible, I have divided the platform into three separate parts, a genomic, a proteomic and an integration part. I believe that any other solution would have resulted in a less robust system. I have used a combined approach of link driven integration, view integration and warehouse to implement the platform, picking the best bits of each. The platform has been implemented in Oracle 8i and the programs in Perl. Such a platform could be used, as I have shown, to analyse the data that will become available as biology enters a new phase, one of systems biology. i Acknowledgements This project would not have been possible without the help from the following persons. First of all I would like to thank Peter Ghazal and everybody at SCGTI for making this project possible. They have answered my questions, installed software and hardware and given my useful tips along the way. I would also like to thank Alan Pemberton at Veterinary Clincial Studies, for granting me access to his data and answering my questions. Perdita Barran and Jim Creanor at SIRCAMS, helped me realise what kind of proteomic data needs to be stored and in what format. Both Alan and Perditan also helped me “test” and “evaluate” the system in the end, thank you for your time. Douglas Armstrong has made sure that both project and dissertation fulfills the demands from the University. To all of you, THANK YOU. ii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Lena Hansson) iii Table of Contents 1 Introduction 1 1.1 The goal with this project . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Why is this project interesting to do, now? . . . . . . . . . . . . . . . 3 1.3 Long-term goals for SCGTI (“biological relevance”) . . . . . . . . . 3 1.4 Steps on the way . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4.1 Design the proteomic database . . . . . . . . . . . . . . . . . 4 1.4.2 Map genomic data “onto” proteomic data . . . . . . . . . . . 4 1.4.3 Build the integration platform . . . . . . . . . . . . . . . . . 5 1.4.4 Analyse the data . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.5 What will be left to do after this project . . . . . . . . . . . . 5 1.4.6 The problem at hand is both simple and complex . . . . . . . 6 1.4.7 What have I changed in these goals along the way? . . . . . . 7 What have I done? . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5.1 The Genomic part . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.2 The Proteomic part . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.3 The Integration part . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Comparing with the testcase . . . . . . . . . . . . . . . . . . . . . . 9 1.7 Outline of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 2 Background information 11 2.1 Integration of biological data . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 What problems will arise . . . . . . . . . . . . . . . . . . . . 12 2.1.2 What has been done . . . . . . . . . . . . . . . . . . . . . . 14 iv 2.2 2.3 2.4 2.5 3 2.1.3 What has been done commercially . . . . . . . . . . . . . . . 16 2.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 The three approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Link integration . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 View integration . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Models and schemas . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 MIAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 MaxD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.3 PEDRo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 UniGene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 What is it . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 How up-to-date is it? . . . . . . . . . . . . . . . . . . . . . . 25 2.4.3 What limitations/problems exists . . . . . . . . . . . . . . . . 25 2.4.4 Which species are covered, and why is that enough . . . . . . 26 2.4.5 UniGene is more appropriate than TIG . . . . . . . . . . . . 27 Systems biology and genetic networks . . . . . . . . . . . . . . . . . 27 Results 29 3.1 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 My solutions for the problems in section 2.1.1 . . . . . . . . . . . . . 30 3.3 From theory to practise . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 The Genomic Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Getting the UniGene clustering information (all species.pl) . . 33 The Proteomic Part . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.1 The PEDRo schema and my changes to it . . . . . . . . . . . 34 3.5.2 Performing a peptide mass fingerprint . . . . . . . . . . . . . 39 The Integration Part . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6.1 The Project schema . . . . . . . . . . . . . . . . . . . . . . 40 3.6.2 Query.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6.3 Statistic.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6.4 Plot.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1 3.5 3.6 v 3.7 3.8 4 ShowTable4.pl . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.6 Identifier.pl . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Testing and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7.1 Alan Pemberton . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7.2 Perdita Barran . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7.3 Results from a UniGene cluster multiple alignment . . . . . . 46 TestCase - A combined transcriptomic and proteomic survey of the jejunal epithelial response to Trichinella spiralis infection in mice . . 47 3.8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.8.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.8.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8.6 My comments 50 . . . . . . . . . . . . . . . . . . . . . . . . . Discussion 4.1 5 3.6.5 51 Did I achieve my goals? . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.1 Standardisation . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.2 Public . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.3 Web-based . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1.4 Scalable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1.5 Modularised . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.6 Up-to-date . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 What would I have done differently . . . . . . . . . . . . . . . . . . 53 4.3 What would I have done if I had had more time? . . . . . . . . . . . . 54 4.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Summary 56 A Appendix A - Biological definition 57 A.1 The Central Dogma of Molecular Biology . . . . . . . . . . . . . . . 57 A.2 Amino acids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.3 Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 vi A.4 Comparative genomics . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.5 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.6 Gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.7 Genetic diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.8 Genetic network/pathway . . . . . . . . . . . . . . . . . . . . . . . . 61 A.9 Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.10 Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.11 Mass spectrometer . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.12 Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.13 Molecule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A.14 Nucleotides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A.15 Peptide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A.16 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A.17 Protein arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 A.18 Proteome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 A.19 Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A.20 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A.21 Systems biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B Abbreviations 67 Bibliography 68 vii List of Figures 1.1 This figure shows the dimensionality problem in biology [19] . . . . . 6 2.1 The three parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 The MaxD schema . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 The original schema [52] . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 The programs that I have written . . . . . . . . . . . . . . . . . . . . 29 3.2 Thess are the modified tables from PEDRo . . . . . . . . . . . . . . . 38 3.3 The Search-page . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 The combined result from Mascot and Ms-fit . . . . . . . . . . . . . 41 3.5 The project part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 The plot showing proteins vs genes . . . . . . . . . . . . . . . . . . . 43 3.7 The result for UniGene cluster Mm.14046 . . . . . . . . . . . . . . . 45 4.1 The welcome page for the system . . . . . . . . . . . . . . . . . . . 52 viii Chapter 1 Introduction The completion of the human genome marked the end of an era and the beginning of a new. Castillo-Davis et al [7], claims that researches are now shifting their attention towards post-genomic analysing techniques. This means that new, more sophisticated, software such as data mining and hypothesistesting are going to be developed that utilises both chemical, biological and evolutionary data. [7] This shift in what to look for, from a single analyte to the whole system is important. Systems biology tells you that focusing on only one part for isntance either on DNA, RNA or proteins, will not lead to an understanding of the entire system. Although an understanding of the system is important, it is not until you start looking at the dynamic within the system that real knowledge is available. [27] This project is an attemp to design the database needed to store the information that these analyses are going to use. In an article in Nature [18] in July this year, the author claimed that the next big thing is going to be proteomics. Gershon claims that it remains to be proven if proteomic information can be turned into real products. This even though most drugs on the market targets proteins, not genes. The problem is that the threshold for starting to investigate the proteome is higher than for the genome. I say she is wrong, the next big thing is going to be systems biology, where you look at the genome and the proteome simultaneous, “skipping” this middle-step, that Gershon thinks biology will go through. Because systems biology has just recently emerged as a subject, there are not that many products around that support this approach. There are many tools available that 1 Chapter 1. Introduction 2 have one purpose but no tool tries to include all information, all technologies and all methods in one system, to reveal as much information as possible to the biologist. 1.1 The goal with this project The goal is to build a standardised, public, web-based, scalable, modularised, upto-date integration platform for genomic and proteomic data. Standardised data will allow SCGTI (Scottish Centre for Genomic Technology and Informatics) to exchange information with external sources. They are going to export their own data to public repositories (after publication that is) and import external data into their internal database. This means that if the data is not standardised then methods for transforming the data from one format to the other will have to be written. Therefore I will try to use standardised data whereever possible. Having the data in a standard format, also makes it easier to submit articles since more and more journals are demanding that the data is accessible in a standard format, in a public repository. The software needs to be public, since SCGTI probably wants to make it available to their collaborators. If the data is available on the web it is automatically going to be public open to those with a username and password. It should be scalable since biology is an area where new techniques, new data and new methods are invented/created all the time. This means that is has to be easy to add new data to the system. If not, then it will become impossible to maintain and disappear, just as K2 did. [10] Modular systems are automatically relatively easy to update and maintain. The principle of modularising a system is that changes to any one part should have no effect on the other parts, or at least only marginal effects. Therefore I have tried to keep the genomic part, the proteomic part and the integration part separate at all times. To keep a software up-to-date in bioinformatics is hard, and takes a lot of time. Therefore it should be “easy” to keep the most important part, i.e. the “integration bridge”, updated. Chapter 1. Introduction 1.2 3 Why is this project interesting to do, now? The reason for this project being needed now, is that biology now moves towards systems biology, this means that you need information about the DNA, RNA and proteins. Today this information is mainly available, but spread out on multiple sources. To integrate these sources, it has to be possible to “translate” mRNA into proteins, and the other way around. One possible use of this platform is in the modelling of genetic networks (see sections 1.3 and 2.5). The distributed nature of biological data makes it a time-consuming task to gather the sought information for one gene, but to do it for many is a manually intractable task. This means that there is a need for a “query-based access to an integrated database that disseminates biological rich information across large datasets and displays graphics summaries of functional information.” [12] There are a lot of tools available for collecting, storing, querying and visualising genomic data. When high throughput technologies start becoming available for proteomics data similar tools will probably start to develop. Software for analysing, quering and visalising the integrated data will when be the next thing, but before they can be used, the data has to be stored in a centralised location and that is what I have tried to do. I have designed a platform that makes it possible to access both genomic and proteomic data through the same interface, as a first test case I have designed a visualisation tool for looking at the combined gene and protein expression levels. 1.3 Long-term goals for SCGTI (“biological relevance”) In the end of the year 1999, an article [25] was published describing how “doctors” were on of the leading cause of death in the US. The article stated that between 50 and 100 000 people die from medical errors each year. About 7000 of these are caused by people having adverse reactions to drugs. If it was possible to predict how a patient would react to a specific drug, given the patients current biological state, then maybe a large number of these tragic deaths could be prevented. To be able to do this, you would have to have the technology to take a “snapshot” of the patients state. Given this information, and a computer model over the corresponding network, it would then Chapter 1. Introduction 4 be possible to predict how the patient would react then the drug is administrated. (see section A.8). In conclusion, the long-term goal for SCGTI is to be able to predict the response in a patient, given a specific disease after the administration of a drug. To do this, they are going to have to be able to model genetic networks, and to model these networks, they are going to need a lot of data, and that data is now available in one interface, through this project. Other interesting questions that genetic networks should be able to answer are: 1. Why do certain people get certain diseases? 2. What makes some people refractory to certain diseases? 3. Could information about those people be used to treat others? But before this will be a reality there are some things that need to be done, and they are discussed in the next section. 1.4 Steps on the way 1.4.1 Design the proteomic database SCGTI have up to now not collected proteomic information, and this means that I am going to design the entire database and fill it with data, I will also write programs for storing, quering and visalising the data. I have used a schema called PEDRo (see section 2.3.3) as a basis for what data should be stored, an in what format, but I have made some changes to this model. I could probably have done an entire MSc project in the area of designing a database for the storage of proteomic data, or even by evaluating PEDRo as it is this was a part of this dissertation although a big part. 1.4.2 Map genomic data “onto” proteomic data I will write a program that retrieves the mapping information from NCBI:s website and saves it to the database so that it could later be used to map between genomic and proteomic data. Chapter 1. Introduction 5 This information will be kept seperate from the protemic database, and from the genomic database that were already implemented. If it is seperate then the system will become more modular and scalable and it will be easier to keep the mapping information up-to-date. The mappping has to be able to map genes to genes via a UniGene cluster but it should also be able to map from a gene to the corresponding protein and back again. 1.4.3 Build the integration platform Once the genomic and proteomic data is in place, it is time to start integrating. I will implement two programs that are going to demonstrate how this integrated information could be used to gain new knowledge. To do this properly I am going to write some small modules with well-defined functions. 1.4.4 Analyse the data This part should be kept seperate from the previous. There are two reasons for this. The first is that is this part is integrated into the integration, then it will be harder to continue adding more tools to this part then this project is finished. This will have to be done since we do not know which tools are going to be useful, and therefore this parts has to be easy to change. The second reason is that I will not have the time to look at this part since the others are going to take too long. Queries that could be of interest to ask is: How many elements/genes/proteins were analysed. Based on this information it could be possible to investiagete if the expression levels for the elements differ or corresond. If it is known whether or not a specific gene is expressed: How many different forms or proteins are the from this gene. This will answer the question: What dimensionality is there is this problem? Is it worth investigating or is it to complex. (see fig 1.4.6.1). 1.4.5 What will be left to do after this project Model the underlying networks since this is rather a PhD project than a MSc project in terms of complexity and time. Once these networks are modeled then they will have Chapter 1. Introduction 6 to be confirmed through biological experiments. by a biologist. 1.4.6 The problem at hand is both simple and complex 1.4.6.1 Biology Figure 1.1: This figure shows the dimensionality problem in biology [19] It is simple since there exists a link between DNA, RNA and proteins, and UniGene is trying to provide it. It is complex since it will be dealing with the curse of dimensionality, when integrating biological data it is never possible to assume that the query will only return one result. Today the mapping between the different parts is done via the identifiers. The demo-programs compiles a list of proteins and from it finds the corresponding genes. I choose this “direction” since the testcase (see section 3.8) has a lot fewer proteins than genes. Chapter 1. Introduction 7 This means that it is quite possible that more than one of the genes on the array codes for the same protein and in that case, all those genes would be used to match against the protein. Until you start specifying a mapping table (this spot from this gel correspond to this spot on this array) then this is a limitation that you have to accept. 1.4.6.2 Informatics There are some problems that arise when trying to integrate any kind of data, these problems are mostly connected to the structure you choose, link driven, view integration or warehouse. I decided to go with a combination of all three. I choose this since I wanted to keep the data in one database, I wanted to keep it separate and in its original form, and in some places I wanted to link to external resources. This means that I will have to consider all the disadvantages but also that I get all the benefits. I believe that the approach has been appropriate for this project. 1.4.7 What have I changed in these goals along the way? I have put a lot more emphasis on getting the data, especially the proteomics-data, in place than I first thought that I would. Originally I was just going to demonstrate how it could be done thereby only storing the data that I would need for the mapping itself, but a couple of weeks into the project I found PEDRo, and decided to use this schema for my protemic data. This meant that I had to evaluate the model and also that I ended up designing many more tables. Some of these needed to be filled with data to make the integration work. This meant that this part took a lot longer than I had originally planned, and that left me with less time left in the end than I had hoped. This meant that I did not have time to implement many different analysis tools. 1.5 What have I done? I have implemented a platform for the integration of genomic and proteomic data. This platform allows researchers to investigate a single gene, a single protein, many genes, Chapter 1. Introduction 8 many proteins or any combination of the above. The platform really consists of three separate parts, the genomic part, the proteomic part and the integration part, this since it allows for the largest flexibility. Any other solution would have resulted in a less robust system. This way a change in for instance how geldata is captured, will only affect the PEDRo-section, not the MaxD nor the Project. Any change will probably effect the integration part but since this is encapsulated into one single module (see section 3.6.2) the “damage” done should be limited. 1.5.1 The Genomic part I have implemented a program that gets the relevant information about the UniGene clusters that SCGTI are using and stores them in the database. The program downloads the newest mouse and human UniGene-clustering-information from the NCBI website. The program reads through the entire files, some 1000 mb of data, and stores information about the essential clusters to the database. I decided to limit the program to only the essential information since we are talking about a couple of million records otherwise and out of these less than 400 000 are needed so there would be a lot of unused data. Another reason is that right now SCGTI do not have unlimited space on the database-server, so if I were to try and store all information I would fill up all available space, and still not be able to store all the information. This is not a big limitation since the program will recollect the needed information every two weeks. 1.5.2 The Proteomic part I have designed and implemented a datastructure for storing all information about a proteomics experiment, this a based on the PEDRo-schema. I have made some changes to this model in order to make it more flexible and more appropiate for integrated access to the data. Using PEDRo as the basis for the database-structure helped me “fulfil” some of my goals, such as standardisation and modularisation and also helped with the flexibility-part. I wrote programs for adding, querying and visualising the data. I also wrote a Chapter 1. Introduction 9 program that sends information regarding a peptide mass fingerprint search to two different search-engines, Mascot and Ms-Fit. This approach will save time for the user and hopefully lead to more realiable proteins-identifications since results from both programs could be displayed side-by-side. 1.5.3 The Integration part I have implemented two programs that demonstrate how the database could be used to gain new knowledge. I did not have as much time left in the end as I would have liked, so I looked at the draft that Pemberton et al [41] has written. I used one of their tables (see section 3.1) as an example of what you could do with this integrated data, that lead to showTable4.pl (see section 3.6.5) Once I had done that I thought one step further, and decided to implement the same functionality in a graph instead thereby making it easier to survey more proteins than the 52 they investigated, so I implemented plot.pl (see figure 3.6.4). 1.6 Comparing with the testcase The integration part, shows how identified proteins could be used to connect genomic and proteomic information. There are some differences between the results that I got and those that are in the paper by Pemberton et al [41]. There are some possible explanations to these discrepancies and I will discuss them in section 3.7.1.1. 1.7 Outline of the dissertation This first chapter talks about the goals of the project, it discusses why this project is of general interest. It shows what needs to be done before such a platform can be implemented and it also talks about the possible use of the platform. The next chapter talks about the background and the theories that my project is founded on, such as MIAME, PEDRo and UniGene. The next chapter is about the results I got: what I did; how I did it; and what the program can be used for, as well as how it is supposed to be used. I also discuss the Chapter 1. Introduction 10 testcase I used to evaluate my system, and how I used this information to decide what, and how my demo-programs should do. Finally I discuss this dissertation. Here I ask myself if I achieved my goals; what I would have done differently given more time, and some possible future work with the platform. Then I sum up this dissertation. At the end, I have added an appendix, Biological definitions. This chapter can serve as a crash-course in biology or simply as a reminder, e.g. what is the difference between the genome and the proteome. Word or phrases that are in this “dictionary” are in italic, so as to distinguish them from the running text. Last I added a short glossary of the abbreviations that I have used in this dissertation. Chapter 2 Background information There are many different programs that allow you to access different genomic and proteomic resources, the problem is that it is hard to integrate them since no common platform is available. To investigate what genes are involved in a multigenic disease, such as cancer, the researchers would start with the on-line resources to determine which genes (if any) have previously been connected to the disease. The problem is that although the information is available, it is spread over different data sources and resides in a variety of models and formats. Very often the data-model for biological data will include sequential data (i.e. lists) and nested structures (trees). This type of information is not suited for a relational database, but this complexity could be dealt with in an object-oriented database, but these are unsuccessful because the structure of the data is changing too often. It was hard finding information about similar systems. My belief is if they are out there, they are built by companies that are not about to give up information about how they have done it, just like GeneticXChange. 2.1 Integration of biological data ”Recent years have seen an explosion in the amount of available biological data. More and more genomes are being sequenced and annotated, and protein and gene interac- 11 Chapter 2. Background information 12 tion data are accumulating. Biological databases have been invaluable for managing these data and for making them accessible. Depending on the data that they contain, the databases fulfill different functions. But, although they are architectually similar, so far their integration has proven problematic.” This quote comes from an article that was published in may this year by Lincoln Stein [46] and it discusses some problems that arise when you are trying to integrate biological databases. I choose to start this section like this since it describes some of the problems of integrating biological data. The first problem, is the amount of data, how much data, and what kind of data should be stored. The second is the fact that new techniques and methods are constantly being invented, meaning that you do not know which format the data is going to be in. The third is that the data is spread out in different sources, making it hard to overview and access. 2.1.1 What problems will arise There are, as Buneman et al, 1995 [5] discusses, some problems that only arise when you try to integrate big data sources (such as biological ones), and some that are always there no matter what. 1. The scope of a single database would be hard to define. What should be included and what should be excluded? The designers would have to consider experimental design, sequence analysing, results and annotations. Should it be limited to genomic-information or should proteomic-information also be included. These questions have to be addressed before any integration attempt is made. [5] [46] 2. The biological information that you want to integrate is out there, it is just a matter of finding, and accessing it. The external datasources all have different interfaces and retrieval-facilities. This makes it hard for both the biologist and the bioinformatican. The biologist has to master a lot of different “systems” and the bioinformatican has to understand all these “systems” so the data the biologist wants could be stored in a format they understand and can use. [46] There are a lot different sources, such as relational databases, object-oriented databases, flatfiles, programs such as BLAST and FASTA. How do you integrate Chapter 2. Background information 13 these into one GUI, without losing the internal representation of the different databases, making sure that the software knows the internal representation and allowing both external and internal databases to change structure with only minimal effects on the software? This is the hardest question. Trying to fit all data into the same structure would probably create more compromises than solutions. 3. Even though the data may look similar at first glance, it will be using different identifiers and different names for the same thing (or same name for different things). This makes any kind of automatic mapping between sources intractable, and manual ones hard. [46] 4. Not only do you have to deal with different formats, different access methods and other technical problems, an even more important question is: what kind of information will we need tomorrow? New techniques and new algorithms are making new kind of data available, as well as updating old ones. This new data also needs to be incorporated. [46] This means that any automated task of getting data, parsing it and storing it into a local data structure would have to be constantly re-written and this is not possible. [57] It is very difficult for one single database to handle all these different techniques. In other words, the evolutionary aspect of biological data would make one big database impossible. [5] 5. The queries that these data sources have to be able to answer are steadily increasing in complexity and size. This means that more advanced querying-capabilities have to be supported. Another problem is that users get restless if they have to wait for their results they want them directly. This puts a lot of pressure on both the hardware and the software. [5] 6. The data needs to be standardised. This is especially important if you want to build a high-throughput pipeline for biological discovery. It might be easier to include external modules into the pipeline if the internal data is in a standardised format otherwise conversion tools have to be built for every external connection, and this is a lot of work. [4] Another point is that users would then know what Chapter 2. Background information 14 to expect from the data. Both in terms of what data needs to be collected to follow the standards, and what data could be expected when downloading external information. [51] 2.1.2 What has been done 2.1.2.1 From a genomic perspective Arrogant Arrogant is freely available web-based program that aims to facilitate the identification, annotation and comparison of large collections of genes. The authors describe Arrogant as a “general-purpose tool for designing and analysing experiments involving many genes/clones such as those from expression microarrays or DNA resequencing efforts for variant single nucleotide polymorphisms (SNP) discovery.” [28] It could be used to answer questions like: “Which of the genes in the collection are located on chromosome 3, are unregulated by a factor of 3, have potentially polymorphic repeats, and also have homologies in mouse which could be used for knockout experiments?” [28] Arrogant takes its information from a number of external data sources. This data in then implemented locally, to make sure that the program is not dependent on a functioning network. This leads to improved performance and reliability over programs that connects to the external datasources, but this also leads to maintenance issues. [28] Arrogant uses UniGene to identify the “same” gene from a batch of GenBank accession numbers. This mapping is done in real time, and I could therefore have used this tool to go from a GenBank accession number to a UniGene identifier, instead of writing my own mapping-tool. The first problem is that Arrogant cannot be downloaded and each session only handles 6500 genes. I have about 65000 genes today, and this means that I would have to split up my data into 10 sets, send each set to Arrogant and then compile the results. This just seemed like a lot more work than it was worth. The second problem is that I doubt how up-to-date the program is: when I tried it the results did not correspond to the results from the NCBI website. Chapter 2. Background information 15 MatchMiner MatchMiner is a software tool for navigating among multiple genes, or gene products, simultaneously. The program can, among other things, translate a GenBank accession number to a UniGene number. The program is split in two parts, lookup and merge. Lookup is translating the identifiers and merge takes two input files and merges the similar genes in the sets. [6] This program looked very interesting since is was freely available and down-loadable as a Java-program. But I tried it on 6 of my genes, 6 I know were present in UniGene, and none of them was found. This led to the conclusion that the program probably promised more than it could deliver. (http://discover.nci.nih.gov/matchminer/html/MatchMinerLookup.jsp) GeneMerge GeneMerge is a web-based Perl-program that returns functional and genomic data for given genes, as well as statistical rank scores for over-representation of particular function or categories. In other words, it is focusing on those functions and categories that are most abundant in the dataset. This also sounds like a good resource, but then I checked out the home-page (http://genemerge.bioteam.net), it looked like you had to provide the interaction data instead of the program providing it. So in order to use this tool you would have to first find what interactions your genes take part in, and that information might not always be accessible or easy to find. Another drawback is that it is only interested in significantly overrepresented characteristics. [7] This is a limitation that I determine if it is accepted or not. Resourcerer “Resourcerer: a database for annotating and linking microarray resources within and across species” [55] You have to be able to cross-reference data from different species to gain biological knowledge, not until then can you truly derive inference regarding gene expression and disease state from a model organism such as the rat, to the organism of interest such as humans. Resourcerer is developed to do just this. It builds on analysis of ESTs and gene sequences provided by the TIGR gene index (TGI). Resourcerer can also be used within the same species, by identifying the similarities and different genes in two data sets [55] Chapter 2. Background information 2.1.2.2 16 From an integrational point of view DAVID DAVID is a program developed to mine biological information, with the purpose of combining functional descriptive data with intuitive graphical displays. DAVID also provides visualisation tools that shows biochemical pathway maps and conserved protein domains annotations. DAVID is updated weekly to make sure that the information is as up-to-date as possible. One big drawback with DAVID is that it only allows one extra column for annotation. When you load up your list of identifiers it is possible to add extra information such as experimental design but is has to be restricted to one column. One good thing is that the dataset is stored throughout the session, allowing you to switch between the views without having to upload the datasets again and again. The thing I like about DAVID is that is allows you to sub-query the data, in however many steps you like but it is still easy to get back to the top-level. 2.1.3 What has been done commercially GeneticXChange A “High Throughput Data Integration And Analysis For Optimising Drug Discovery”. [14] this sounded just right but the fact that it is a commercial product limited the amount of information available. I did manage to find out that GeneticXChange leaves the data in the original form [15] and depends on wrappers to support different data formats. [16]. Had the program been free-ware, I think that it would have been really interesting to investigate, but as it is, I could not do much investigation. 2.1.4 Conclusions There is a reason that this section is mostly focusing on what has been done on the genomic side and it is simply this: there are no public repositories for proteomic data and therefore it is not as interesting to develop tools for that side. We are now, as I have mentioned, seeing a shift in attention, and it will lead to public repositories and tools being made available. One reason for still including information about the genomic side in this dissertation is that when you are searching for tools that integrate Chapter 2. Background information 17 biological data this that is what you get. So I choose to include it since it shows what is being considered as integration of biological data today and this dissertation is trying to show what could be done. All those attempts that I have described above are all based on one common concept, the user load a list of identifiers to a web-page and get a collection of links back. No one tries to integrate the data itself, just the identifiers. This means that you cannot annotate the data in any way, nor can you make sure that the data is curated or up-to-date. There seems to be a common “idea” that you should build a server that functions as a connection between the different data sources. This server is responsible for linking the data sources, creating the interface and interacting with the users. The functional parts therefore need to consist of some kind of “translation-layer” that takes the raw data and transforms it into a structure that the server can handle. Another conclusion is that more programs use UniGene as identifier than TIG. I mailed NCBI and told them that I wanted an up-to-date program that would translate a GenBank accession number into a UniGene cluster id, and asked them to recommend something, and they told me to write my own parser, so I did. [37] 2.2 The three approaches There are three approaches to the integration of biological data, and they are link-driven federations, view-integrating and data warehousing. [10] Buneman et al [11] writes: “Various approaches to integration are appearing throughout the bioinformatics community and it is unlikely that there will be a single satisfactory strategy.” I think they are right, I think that you have to mix these different solutions to get one that will work in most circumstances and so I have (see section 3.3.0.6). 2.2.1 Link integration Link driven integration is driven by the user. The user has a starting point and from it follows explicit hyper-links to other sources to collect all interesting data. [10] Since Chapter 2. Background information 18 this approach builds on the structure of the internet it has been the most successful one. Researchers are already familiar with the web know and how to move around it. There are some serious problems with link integration. The fist one is vulnerability to naming clashes and ambiguities. The name of genes, for instance, are not the same across different species. This makes it hard to compare results from different species and that makes model organisms less useful than they could be. The second concerns update issues. An external link assumes that the target is still valid, up-to-date and relevant. The third is that the integration is in the hand of the researcher. It is up to the individual researcher to decide how the date is connected and how it should be interpreted. [46] 2.2.2 View integration View integration leaves the data in its source format but builds an environment around these sources to make them look like one big database. [10] The interface/environment is given a query and splits it up into smaller parts. These parts get sent to the different databases, the interface/environment then compiles the different results to make look like one big database has been queried. There are some real possibilities with this approach, but in spite of that it has not been successful yet. The main reason is probably the fact that it is slow since the processing capability is limited by the slowest data source since the interface/environment has to wait for all results before it can show the result to the user. There is another reason for these systems not to be of wide-spread use, they are difficult to implement and maintain. [46] K2, is the most extensive attempt at view integration so far. [10] 2.2.3 Warehouse This approach tries to take all data and bring it into one big database. Therefore the first step is to model a unified database-model that can be used to accommodate all the information that is going to be integrated. The next step is to develop the software that will fetch the data from the external data-sources, transform it to the appropriate Chapter 2. Background information 19 format and then save it to the database. This database can then be used to query the data in all the ways that the separate external sources can, and also in new integrated ways. This is faster than the view integration since the data is collected in one location. The biggest problem is to keep the data up-to-date since new data is being added and invented and the design of the databases are changed. All this means that the software that fetches the data from the external sources has to be constantly modified to accommodate these changes. Stein writes that the most ambitious attempt at a warehouse approach was the Integrated Genome Database (IGD) project that survived about a year before it collapsed. IGD was about combining human sequence data with genetical and physical maps for the human genome. The project collapsed since the software that imported the data had to be re-written on average once every two weeks and this finally became too much. [46] This leads to the conclusion that any data in such a repository will probably be out-of-date, meaning that the scientist will not have access to the newest data. This also leads to the conclusion that one big data source is not a good idea. The authors of the article “K2/Kleisli and GUS: Experiments in integrated access to genomic data sources” [10] conclude that there is no clear winner between the view integration and the warehouse, therefore it depends on the situation which strategy should be used. 2.3 Models and schemas I choose to split up my program, and this dissertation, into three parts a genomic, a proteomic and an integration part. This has allowed me to use those standards that are out there and has made the system more modularised. I think that this split has made the system more robust since a change to one of the parts will not affect the others. The resulting structure also makes the system more scalable since new data could easily be added into the existing structure without any other changes having to be made. So by splitting the project into three parts, I have come a long way to realise 3 out of my 6 goals. The database is standardised (more or less), it is scalable (just add another type of data, and connect it to projectexperiment) and it is modularised. This is not all I Chapter 2. Background information 20 have done to fulfil my goals but it got me a some part of the way (see section 4.1). Figure 2.1: The three parts 2.3.1 MIAME Mged.org (Microarray Gene expression data society) is a consortium of companies, organisations and research groups that comes together to decide upon standards that are to be followed then performing micoarray experiments. Among their guidelines is a document called MIAME (Minimum Information About a Microarray Experiment). The aim of this document is to make sure that all the information that is needed to reproduce the experiment (and get the same results) are recorded, and easily accessible. This means that the information specified in this document somehow has to be present in your data structure. MIAME aims to “guide the development of microarray databases and data management software.” This means that new software being developed for microarray data will hopefully be based on MIAME. MIAME should not be considered a set of rules but should be regarded as a guideline and if you wnat to store more information that is encouraged. Although MIAME concentrates on the content, rather than the structure of the data it does provide a conceptual structure. This structure includes information about array Chapter 2. Background information 21 design description, features, control elements, experimental design, samples, hybridisations procedures and measurements among other things. [32] 2.3.2 MaxD MaxD is a database developed by the bioinformatics department at the University of Manchester. It aims to store the information needed for microarray-experiments to follow the MIAME guidelines. Although it does do that to a certain extend, some information needs to be stored in the non-descriptive field “description” rather than in a column of its own. There are a lot of useful tools, such as MaxDLoader, MaxDView and MAGE-ML export functions available [31] that compensates for this drawback. 2.3.3 PEDRo The field of proteomics is moving towards high-throughput technologies such as protein arrays. To take advantage of this, it is a necessary that standards are developed, standards for storing experimental data analogous to MIAME. Since proteomic-data is more complex than genomics it is very important that the metadata is both extensive and in place before any high-throughput technologies are implemented. PEDRo is described as a “modular, expansible model” [51], and since I am trying to build a modular system this is very convient. The authors are refering to the fact that it is possible to connect the table AnalyteProcessingStep to any other table, thereby making it possible to add new tables when new techniques are invented. PEDRo stands for Proteomics Experiments Data Repository, and has two goals and they are [50] 1. “The repository should contain sufficient information to allow users to recreate any of the experiments whose results are stored within it.” 2. “The information stored should be organised in a manner reflecting the structure of the experimental procedures that generated it.” To encompass all this information the model needs to be fairly large, but it cannot be made too large because then it will not be used. [50] Much of this information Chapter 2. Background information Figure 2.2: The MaxD schema 22 Chapter 2. Background information 23 would be available in the lab anyhow. There are advantages to storing all this data, nonstandard queries such as extraction technique are then made possible. Then integrated even more advanced search and querying facilities could be added. An even more important benefit, is that it should be easier to exhange information since all the data is collected in one location. [51] There exists a need for public repositories for proteomic data that allows for rigourous comparisons between gels, arrays and so on. These reposotories must capture information such as where the data comes from, who made it and how, as well as annotation and identifications. PEDRo could become that standard. 2.4 UniGene 2.4.1 What is it “UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene as well as related information such as the tissue types in which the gene has been expressed and map location”. [34] In other words, it is an algorithm for clustering GenBank sequences, making sure that the same gene has one unique UniGene cluster-id and that all the GenBank accession numbers that exists for that gene are linked with one another through this id. This makes it possible to realise that gene A and B really are the same gene since they are in the same cluster. When the clustering gets updated the old information is not stored at NCB. A cluster-id, will only be retired, never reused. It could be retired for one of the following reasons: the sequences in the cluster might be retracted by the submitters because they are found to have contanimants, two clusters may be joined making one ID retire or a cluster could be split into two or more clusters. [35] SCGTI have decided that old clustering information is of no interest because it depends mostly on the fact that the GenBank sequences have been updated by its authors, or that the algorithm has been enhanced. In the future, this could be interesting, but for now it is just changing too often to be of interest, so I delete all old information before entering the new data. If you “find” a sequence, you will probably name it according to some internal Chapter 2. Background information Figure 2.3: The original schema [52] 24 Chapter 2. Background information 25 naming scheme, making it impossible to assume that Rad17 in human, is the same gene in C.elegans, but UniGene works around this by assigning a name to a cluster of genes instead. Otherwise it would be impossible to use the name as a description of what the gene is, or does. [35] 2.4.2 How up-to-date is it? Not only are the well-characterised genes represented but hundreds of thousands of novel expressed sequence tags (ESTs) have also been included. These ESTs are constantly in change, meaning that the UniGene clusters also are. The update rate varies from between 1 week to one month. [35] This is not ideal but will probably have to be considered acceptable. NCBI say: “...it should be noted that the procedures for automated sequence clustering are still under development and the results may change from time to time as improvements are made.” [36] 2.4.3 What limitations/problems exists EST are characterised by being short (about 400-600 bases) and relatively inaccurate (around 2% errors). They are obtained by doing one single read meaning that it is a low-cost method, no attempt is made to characterise or identify the clones: they are simply identified by their sequences. Therefore they will be redundant and some will represent contanimants or cloning artifacts. It is much more expensive to do a full high-quality sequencing so it is not worth it in all cases, and therefore the ESTs are an invaluable resource, even though they are incorrect. About 66% of all submissions to GenBank are still ESTs. [36] It is important to eliminate low-quality or apparently artificial sequences prior to clustering because even a small level of noise can have a large corrupting effect on the result. Therefore sequences from foreign origin are removed, as well as mRNA and mitochondroial sequences. For a sequence to be included into UniGene it has to have at least 100 bp (base pairs) and be of high-quality and non-repetitive. NCBI require that the relations show a “dove-tail” relationship, meaning that they have to extend as far as possible preferably to include both ends. Chapter 2. Background information 26 For a given set of sequences it is important to determine if they are derived from the same gene or not (obviously since that is what UniGene is trying to do), but some level of mismatches has to be tolerated since the ESTs could have substitutions errors in them, allowing too much would cause highly similar paralogous genes to cluster together. Multiple incomplete, but non-overlapping fragments of the same gene are often recognised when the complete sequence is submitted which could lead to withdrawals of old sequences. 2.4.4 Which species are covered, and why is that enough UniGene covers a wide range of organisms, both animals and plants. The species has been chosen to provide a wide range of model species, as well as some additional ones where the amount of ESTs available is greatest and all species that are fully sequenced. It has species from a wide variety of families to allow for similarities matches across species. More species could be added after requests. [34] It is enough since already in the beginning of the Human Genome Project, at the late 1980s, it was realised that model organisms would be needed, therefore a lot of other sequencing projects started at the same time. If we only look at humans, we will only be able to track those diseases that we discover today, and we will not be able to predict then they are going to start (unless we start gene manipulation on humans) therefore we need model organisms to study, and for that we need their sequences. For a study of a disease in, for instance, a mouse it is possible to create gene knockout experiments, where the gene in question is being disabled, or to create a transgenomic mouce where the gene is being added. The genes have to a large extent stayed the same through evolution, making it possible to look at gene A in a mouse and realise how gene A in a human will behave under the similar conditions. [29] “Comparative genomic is the ultimate key to functional genomics.” [29] It is important to pick the right model organism for what you are studying right now. The human genome is about 63% similar to that of the mouse, 57% to the fruit fly, 38% to C.elegans, 20% to Arabidopis and 15% to bakers yeast, making it possible to use these organisms in a multitude of different experiments. The different model Chapter 2. Background information 27 organism is specificially useful for different things, for instance the house mouse has been used as a model organism for almost all kinds of studies, from cancer, to psychiatric disorders, to genetics and development, immunology and pharmacology. [3] 2.4.5 UniGene is more appropriate than TIG 2.4.5.1 What is it TIG is a database maintained by The Institute for Genomic Research (TIGR). TIGR was founded in 1992 and are interested in structural, functional and comparative analsyis of genomes and gene-products. [53] The TGI database is a collection of the publicly available ESTs and gene sequence data. The goal is to identify transcripts and to place them in a genomic context. [55] TGI uses ESTs and coding sequences to assemble tentative consensus sequences. The ESTs are downloaded daily and cleaned from unwanted information. The genes and the ESTs are compared pair-wise to identify overlaps using BLAST. A cluster is formed then any two sequences form a 95% match over a 45bp long sequence. [55] As in UniGene previously used identifiers for a cluster is never reused. 2.4.5.2 Comparison to UniGene There are two big advantages to using UniGene, the first one is that UniGene stores information about protein-products as well, wheras TIG are only intrested in genes. The other is that there seems to be more software that are using UniGene than TIGR. Even TIGR themselves say that their clustering-algorithm builds on publications from NCBI. [54] 2.5 Systems biology and genetic networks Being able to predict how the body would react given the know current state would mean that all companies interested in drug design could save time on the experimental side (the results would still have to be validated, but the models would tell you what to validate). Saving time saves money. Today over 90% of the potential drugs entering Chapter 2. Background information 28 clinical trials fail to make it to market, bringing the average cost of a new drug to $770 million. Today about 400 proteins are being used as drug targets, leaving a couple of thousands that could be exploited if the technology allowed it. When those 400 proteins are analysed it seems that they all fall within the boundaries of just a few families. These families includes GPCRs, kinases, proteases and peptidases. [39] This means that there are many of families that could be investigated, and for that we are going to need more information, as well as better technologies. Systems biology is the theory of going beyond the individual parts, such as genes and proteins, to investigate how they work together to form complex structures. Not until we have integrated multiple levels of biological information such as genome sequences, proteomic analysis and microarrays can we hope to get a global perspective on how the system works. When we have this data it will be possible to create computational models of these networks. The models could then be used to enhance our understanding of how the body works as well as to create new methods for developing drugs since it could be possible to test the drug on a computer before testing it on humans. [21] All these parts has to be investigated and understood before systems biology can become the “main stem in biological sciences in this century.” [49] It might be possible in 20 years time, to take a piece of a patients genome and analyse it, and the result may be your future health history. This would make it possible to give you preventive medecine, and that would mean that maybe we could expand the life-span of humans. Another possibility is a change in how we are making vaccines, what if it would be possible to make a “super-vaccine” that would work for a group of diseases. [47] Chapter 3 Results 3.1 Technology I implemented the database into the existing database-structure at SCGTI, which meant Oracle 8i. The tools which I have built are all been in Perl since Perl is a scriptlanguage thereby facilitating fast development. Perl is also the language SCGTI were using for most of their web-development. Figure 3.1: The programs that I have written 29 Chapter 3. Results 3.2 30 My solutions for the problems in section 2.1.1 Buneman et al [11] suggested 5 things to do to make integration of biological data work (see section 2.1.1) and they are: Transformation of various schemas to a common data model, match semantically related schema objects, integrate the schemas, transform the data to a federated database on demand and match semantically equivalent data. I have done none of these. I do not have a common data model, therefore I have not matched my objects, neither have I integrated the schemas. I have not transformed the data in any way, nor matched the equivalent data (for instance the sample part). All this comes from the fact that I decided not to build a warehouse and these recommendations are for that case. Nor did I decide to build a view integration which also would have had to deal with these parts, although in a different way. I did a combination which meant that I had to deal with some, and I could ignore others, for instance I had to consider naming clashes, but I did not have to solve them. Below are my solutions to the problems in section 2.1.1. 1. I decided to let the scope be all kind of proteomic data but only microarray data on the genomic side but with possibilities for adding more data if the need should arise. The scope also includes information about the projects that SCGTI are involved in and information about how to map from a gene to a protein. 2. I let the data stay in its original form, this means that any one part could be changed without affecting the others, it also means that standardised tools could be used without any changes having to be made. So I did not write any tools for getting or transforming the data in any way. 3. I have made sure that it would be easy to add more data to the model. This means that I do not have to answer the question, rather I work around it leaving it open and thereby allowing for flexibility. 4. Same as above. 5. Since the data is separate I do not have to worry about the fact that identifiers are called different things, and that different kind of data is called the same. The integration part is the only part that needs to know this. Chapter 3. Results 31 6. This is a problem in my system. The demo-queries that I have implemented are very complex. For the 52 proteins it has to make a couple of hundreds of simple queries to the database, it has to make 52 complex ones (matching the proteins against the genes) and it has to perform 104 t-tests. All this means that the demos are very slow. There are things that could be done to optimise the queries, but I did not consider that part of this project scope. 7. MaxD and PEDRo got me a short way towards solving this problem. For MaxD there are already loaders and export-functions implemented, and the same will probably be true for PEDRo, should it get accepted as the standard. 3.3 3.3.0.3 From theory to practise UniGene Does not contain all biological information, and it is therefore not possible to map across all genes and proteins. Out of the 52 proteins that I am trying to map, all are found which is a good sign. 3.3.0.4 MIAME and MaxD MaxD is a database-schema that follows the guidelines that MIAME species, and since it was already implemented at SCGTI I saw absolutely no reason for changing it, although there are things that I am not totally happy with. One very simple thing is that there is no database-schema on the website for MaxD, which makes it hard to understand how the tables are connected. After reverse engineering on the database the resulting schema was shown in fig 2.2. Once I had got this it was much easier to understand how the tables were connected. There are some discrepancies between the MIAME guidelines and MaxD, which is not good. It is possible to store all information that MIAME requests, but in that case it has to be stored in the non-descriptive field “description”. I say that this is wrong. This is even worse then you are considering that this limits the possible uses for the system. The more information that is stored, the better it is from an integration point Chapter 3. Results 32 of view, since this means that there is more information that could be used for novel investigations, to learn new things. Then there is something that SCGTI are doing that I am not too happy with: they are not filling in the table “Gene”. This table could have been used as a lookuptable that stores information about a gene, the same as “Protein” does in PEDRo, although the information in UniGene could be used instead, but some information are not captured here. The more information there is in this system, the more ways of integrating the different parts and the more stable the system will be. 3.3.0.5 PEDRo I used this schema as a foundation then I decided on the database-structure. Once I had implemented the schema I started changing it, but every change has a good reason, so I am hoping that these changes will be accepted by Taylor et al, so that the new version that I am suggesting will become the standard. 3.3.0.6 Link Integration, View integration and a Warehouse I decided to use a “combination” of the above techniques. There are several reasons for this, the first is that I wanted the data in its original form. That would have meant to use either a link integration or a view integration, but I also wanted the data in one database since I did not have to wait for external resources, this would have meant a warehouse. Then I also had to consider the fact that I wanted the user to consider the “database” as one resource, this would have meant either a view integration or a warehouse. So instead of choosing one and sticking to it, I decided to pick ideas from the three approaches and implement a mixture. I decided to let the user interface link to some external resources, such as NCBI and SwissProt that are fairly stable. I included these links since they contain more information than is represented in the system. The Project-part, that I built ontop of MaxD and PEDRo, could be considered as a view integration, because MaxD does not “know that PEDRo exits” and vice verse. It is in this layer that querying facilities, such as query.pl, statistic.pl and so on, are being dealt with. Chapter 3. Results 33 The warehouse since all data is in the same database, and therefore internal, with no external dependecies and that makes the querying facilities quicker. So in a way I have used neither link integration, view integration nor a warehouse. In a way I have implemented a link driven integration, a view integration system and a warehouse. So to conclude, I have picked the parts that I wanted from the three approaches to get a good mixture. 3.4 The Genomic Part 3.4.1 Getting the UniGene clustering information (all species.pl) I have written a program that connects to the NCBI website every two weeks and downloads the new clustering-information for human and mouse. Once the files are downloaded, they are unpacked, then a Perl-program gets called. This program parses these datafiles (it is in Perl because it is good at text-matching and there is a lot of that). The program reads through about 1000 MB of data, and from it picks out information about the UniGene clusters that are being used by SCGTI. I have limited the import to only essential clusters since it is simply too much data otherwise. As it is I have about 350 000 records of data. If I were to store all information then that would be a couple of million records and SCGTI simply does not have the space, or the need for storing all this excess information. But this means that the program somehow has to know which clusters are used, this is done through an external connectionfile. This file specifies, the database-path, the username, the password, and the sql-statement. These sql-statements gets a list of the GenBank accession numbers that are in the database. This means that if more tables are added then simply add another row to this file to get those GenBank-accession numbers as well. To make it a bit more useful, I also let it read a file with GenBank accession numbers, this makes it possible for the system to have information about clusters that are not yet in the database. Since the system has to match these GenBank accession numbers to those in the files, the information from the file is stored in hashes, and then compared to the identifiers that are of interest. If one is found it is written to a textfile. Once the 200 000 records of UniGene clusters have been processed, the textfiles are loaded into the database using sqlldr (a program Chapter 3. Results 34 optimised for loading large quantities of data). First the old information gets deleted and then the new added, this makes sure that all information gets updated. Using sqlldr saves time which is very useful considering that the program takes about 2 hours to run. This mainly depends on the fact that the server runs out of memory when 2000 entries are saved to the hashes, so 1000 entries are processed at the same time. If this number could be increased I think that the time required might be severly lowered. The information stored for a cluster is the corresponding locus link, gene expression information, chromosome, GenBank accession numbers and protein-identifier. The information is stored in three tables, UniGene, UniGene Mapping and UniGene Protein (see section 3.6.1). The information stored is then used to cross species (in BioLink), to cross analytes and to show experimental information in my program. 3.5 The Proteomic Part An entire pipeline for storing, querying and analysing proteomic data. As can be seen in figure ?? the proteomic part is not independent, all programs are connected in some way to other programs. I decided to do it like this since the resulting structure makes it easier for the user, and good programming style should never take precedence over what the user wants. 3.5.1 The PEDRo schema and my changes to it Here I had nothing at all to work with, so the first thing I had to decide was what information to store. For this I went to SIRCAMS (Scottish Instrumentation and Resource Centre for Advanced Mass Spectrometry, http://www.sircams.ed.ac.uk) and investigated what data they use and in what format they prefer it. When I started this part, I was only going to save the information connected with identifying a protein so that I could just show how the integration was thought to be done. After a while, I realised that the platform would be much more useful if it was possible to store all information about a proteomics experiment in the database, and so I found PEDRo. Chapter 3. Results 35 PEDRo is an experimental suggestion for a standard in the proteomics area. The tool that I have developed should be used for integration which meant that I needed to make some changes to the schema. I changed it to better accommodate the possibilities for different high-throughput technologies. I also made some changes that will allow it to be more flexible. I will describe the changes I made below, but for reference, the “original” schema was shown in figure 2.3. 3.5.1.1 SpotPeakList This table connects a Spot to a Peaklist. This makes it possible to choose to say that a certain PeakList come from a Spot. It is a separate table because otherwise it would not be possible to send the same spot to many peaklists, for instance with different ListProcessings connected to it, or some other difference. 3.5.1.2 DBSearch To this table I added more information about the search, such as a search title, who performed it and when it was performed. This information could be of interest when comparing two results, trying to decide which one is the correct one. A database search may be resubmitted if the results are not quite right, or if new knowledge is made available, then this added information would be used to separate the searches. 3.5.1.3 DBSearchResults I added this table, since a DBSearch-object could get many different results. For instance if the same DBSearch is submitted to both Mascot and Ms-fit, then two results would have to be associated with each DBSearch, but each would have its own information. The information contained within this table is the program used (ex. “Mascot”), how many results were found and how many peptide masses were submitted. It also connects to the table “Chosen”. It also has an annotation-field since biologist wants to be able to comment everything. Chapter 3. Results 3.5.1.4 36 Chosen This table connects DBSearchResults and ProteinHit. This makes it possible to identify more than one Protein for each spot that is being investigated. The table also lets the user add a probability for the identification, saying how sure it is, it also has an annotation-field. This information could for instance also be used by an expertsystem that tries to identify proteins. Then the annotation-field could be set to “system” (or equivalent) and the probability could be calculated based on a set of rules for the system. Or the information could be set by one biologist to show that he/she is not too sure about this identification, and then another biologist could look at it and either agree or disagree. Another reason for this table is that each spot might actually be a mixture of proteins and when a DBSearchResults has to be associated with many ProteinHits. 3.5.1.5 ProteinHit The table named ProteinHit and the table named PeptideHit in the original model has now changed places since one DBSearchResult can get many ProteinHits, and each ProteinHit consists of many PeptideHits. Now ProteinHit also contains information about hitnumber (from the program), the score, the score-type (ex. “ms-fit”), the molweight (mw), the charge (pI), how many peptides were matched as well as the total sequence coverage for these peptides. 3.5.1.6 PeptideHit PeptideHit now stores information about the start position and the end position for the peptide, as well as the mass that was submitted and the one observed, the delta, the number of missed cleavages and the sequence for the peptide. 3.5.1.7 Protein The thing I have changed here is the format of the columns, in each case the column has been extended so that the proteins that I store will fit. Even though the sequence Chapter 3. Results 37 column has been extended from 500 to 3900 characters, I get a protein that cannot be saved, since the sequence is longer than this, but I left that for now. 3.5.1.8 ListProcessing Here I removed the column “background threshold”, and created a new table Threshold instead. I added a column to the table that stores information about the software, such as name and version. The software could affect the both the results and the outputformat so this information needs to be kept for future references. 3.5.1.9 Threshold This table is added, so that it will be possible to set different cut-off-values (i.e. thresholds) in different parts of the spectrum. This is useful since there is less noise in the higher mass ranges, and the threshold in that region could then be set lower than in the lower range. To make sure that it would be possible to split the spectrum in as many ranges as the user wants, I decided to add another table instead of just another column to ListProcessing. This leads to increased flexibility but also to increased complexity. Threshold contains information about start and end position for the section and the threshold-value. 3.5.1.10 CutOffValue This table is also new and it contains a list of the tops in the spectrum that are not going to be included in the resulting mass-list. This could for instance be all tryptic peaks if you know that you used this enzyme to cut the protein and that they are likely to have contanimated the sample. This list could also contain information about peaks that you want to ignore when searching for the protein. For instance, it is very common to get keratin contaminations in the samples. 3.5.1.11 MassSpecExperiment Here I added a column, “massspecmachine”, that connects this table to a certain MassSpecMachine. This connection needs to be made, since it is not 100% certain that the same Chapter 3. Results 38 machine will be used in all experiments, though likely. If the machine differs, then so could be results, since the format, and detection-range probably will be different. The only reason for not having a connection has to be that they are assuming that you only have one, and that is probably right. But that machine will probably be upgraded sooner or later, and then you have to be able to see which experiments were made with the new, and which were made with the old. 3.5.1.12 The resulting model Then all these changes has been made, the new model looks like this. Figure 3.2: Thess are the modified tables from PEDRo My hope is that this model will be more robust than the original, and also that the Chapter 3. Results 39 changes that I propose will be accepted as standards for how to capture data in a proteomic-experiment by PSI (Proteomics Standards Initiative, http://psidev.sourceforge.net). 3.5.2 Performing a peptide mass fingerprint To perform a peptide mass fingerprint search a list of massvalues for the interesting peptides has to be specified. Today SIRCAMS are using both Mascot and Ms-fit as search-engines. Based on the results from both they decide upon the identity of the protein. This step, of searching and identifying is made easier through my system. This since search.html (see below) will accept information about the search, and then send it to both Mascot and Ms-fit. The user could choose to specify the list of masses by hand, or to connect the search with already existing information through MassSpecExperiment and PeakList (see below). Figure 3.3: The Search-page The resulting pages will looks exactly like the result-pages from the programs (in Chapter 3. Results 40 fact they are), except for a button for storing the result to the database. Pressing this button will save the information and it will then will be possible to show the result using show.pl. This page could show just the result from Mascot, or from Ms-fit, but it could also show the result from both searches at the same time. This approach has one big advantage over the way they are using the programs now, it saves time. It saves time since the user do not have choose the right variabels twice, neither do they have to wait for the results from two webpages, nor do they need to cut and paste the masslist, it could be taken from the system automatically. There is another benefit too, and it is the fact the the information from Mascot and Ms-fit could be shown next to one another, letting the user has more information available when making the identification. This would hopefully lead to more accurate identifications. Below is the result from one such search, and as you see, the result from the programs differ substantially suggest different proteins. If I had only searched using on of the resources, I would probably have assumed that the protein that the program suggested were the correct one considering the scores it got, but with this knowledge I wonder. 3.6 The Integration Part 3.6.1 The Project schema The database is divided into three separate table-spaces, Project, MaxD and PEDRo. Even though PEDRo and MaxD could be easily integrated (since they to certain extent contain the same information) I have choosen to keep them separate and to duplicate the information. As I have already mentioned, this is because I keep the data “standardised”. There is another reason as well, this approach is more correct than integrating the data since an experiment in biology really is a project, a project consisting of different parts such as a microarray experiment, a 2DGel-experiment, sample preperation and so on. With this datastructure this relationship is reflected in the database. This is also what makes it easy to add new kinds of data. All you have to do is consider the data a new “kind of experiment” and it will fit in with the existing structure. Projectexperiment, stores information about what kind of experiment is being per- Chapter 3. Results 41 Figure 3.4: The combined result from Mascot and Ms-fit formed through experiment analyte (ex. DNA, RNA or protein) and experiment technology that for instance could be microarray, Western Blotting or Clinical Trial. Another benefit that comes from adding another level on top of PEDRo and MaxD, is that it made it possible to add keywords to both projects and experiments. This table will store keywords from the MESH, the GO and the MGED terminology. This adds another dimension to the project/experiment at hand, a dimension that could be queried to find out what occureences/species SCGTI are interested in. Chapter 3. Results 42 Figure 3.5: The project part 3.6.2 Query.pl This is the perl-module that contains all the sql-queries that queries both sides at the same time. This means that the complex queries are gathered at one place, making the system more robust and easier to maintain. This module could therefore be said to be the view integration part of the system. 3.6.3 Statistic.pl Collecting all statistical tests in one module, makes it easy to see which are implemented, and which are not. It also makes it easy to update an algorithm, since this is the only module that needs to be updated. I have only implemented the Welch t-test (assuming unequal variance) since it is the only one that my demo-programs are using. This module needs to know if it is going to query the MaxD or the PEDRo database since it needs to know which information to request from query.pl. A log-transformation is made on the data since I have only 4 uninfected samples and 4 infected. A log-transformation will smooth the intensities so Chapter 3. Results 43 that an outliner will not have quite the same influence on the data. 3.6.4 Plot.pl This is a web-page that will plot three different graphs. The first graph will plot information for those genes and proteins there both expression levels have changed in a statistically significant way. This is one way to visualise which genes and proteins that it might be interesting to examine further. I designed it since I thought it could be interesting to see there the changes are corresponding, i.e. if the gene changes in a 3-fold positive way will the protein also change 3-fold positive or not. Either way this data could be used as basis for understanding the networks. Figure 3.6: The plot showing proteins vs genes Plot two will show those proteins that have changed there no statistically significant change in gene-expression could be detected. This could be interpreted as a posttranslational change but we know far too little abouth this to say anything at all about these proteins. (See section 3.8 for a discussion about this.) Chapter 3. Results 44 Plot three will show those genes where no corresponding change in the proteinlevel could be detected. This could be used to determine which proteins are affected by a pathway, rather than a single gene. 3.6.5 ShowTable4.pl This page shows more or less the same information as plot.pl just in a different format. The information for the expression levels and the identifiers are being displayed in a table. In the protemic and genomic column, there are 4 possibilities. Either a statistical significant change has occured or no change when “NS” is shown. Then it is possible that the analyted was not expressed in the uninfected samples just the infected then “Added” would be displayed. It is also possible that the analyte ceased expression on infection then “Disappeared” is shown. 3.6.6 Identifier.pl This was my first attempt at bridging between the three parts. The page allows the user to choose to search for a UniGene cluster id, a GenBank accession number or a SwissProt identifier. If you choose to search for an UniGene cluster, then all information known for that cluster will be shown (see below). Underneath this information a list of the projects where this identifier has been used are shown. Since you choose to search for a cluster, it will show both genomic and proteomic experiments. Should you instead choose to search for a GenBank accession number, then information know for that GenBank will be shown, and the genomic experiment where it has been used. The corresponding information is shown for a SwissProt identifier. Chapter 3. Results 45 Figure 3.7: The result for UniGene cluster Mm.14046 3.7 Testing and evaluation 3.7.1 Alan Pemberton I demonstrated the platform and the system to Pemberton. I showed him how the platform was conceptually built up by three different parts that were connected through tables that gets updated every other week. I also showed him the structure of the data, though the emphasis of the demonstration was rather on giving him an understanding of the system, and letting him come with suggestions to what the platform could be used, and to comment if any information were missing or if there were something “wrong” with the system. He did come with some suggestions like connecting it to GenMap, adding pathway information into the system. I showed him the integration part and explained how they were supposed to be used interpreted. He seemed to like the idea of plotting the information making it possible to survey the results graphically. Chapter 3. Results 46 He seemed interested in becoming involved in the further development of useful tools for the platform. 3.7.1.1 Comparison with his results The result I get differs from the one that Pemberton got. This could depend on many causess. We could be using different t-tests and they would therefore give different results. Another reason could come from the fact that I log-transform my intensitiesvalues but since this should rather smooth out the results than make it more extreme it is surprising that I get more statistically significant changes than them. I do not know the reason for the results to differ, so this is something that has to be determined before the statistic module continues to be developed. 3.7.2 Perdita Barran I showed her what I have had done, and her response was, useful! Although she did have some comments about one of the fields in search.html (3.3) She suggested that I would ignore the fact that the plotting takes about 30 minutes to run, and just tell the biologist that this program takes this long to run. I am supposed to come back in October/November and present it to everybody in mass-spec, that has to be high praise. 3.7.3 Results from a UniGene cluster multiple alignment One interesting control that could be made, is to take a cluster from UniGene at random and do a multiple sequence alignment on the GenBank accession numbers that make up the cluster. I did the multiple aligment for cluster Hs.288856, Prefoldin 5. For each of the 9 GenBank accession-numbers that are in it I got at least 3 others in the result from the EBI FASTA Server (http://www.ebi.ac.uk/fasta33/). In each case the scores for the sequences were very high like e-100 . For each alignment there were two other sequences AX590145 and D89667 that also got very good scores. I then looked at the name of the sequences and 5 out of the Chapter 3. Results 47 9 were named prefoldin 5. The other 4 I compared to these 2 using NCBIs Sequence Viewer (http://www.ncbi.nih.gov/entrez/). The result from this comparison was that AX590145 was obvious not similar to the others, but I did think D89667 looked very similar, all 5 are c-myc binding proteins and from the MM-1 gene. Then I started comparing the sequence matching manually, and I found that between D89667 and the others there were no mis-matches in the sequences, though there were some in the beginning and the end of the sequence, meaning that it probably do not fulfil the “dove-tail” demand. I am happy with the result from this alignment. 3.8 TestCase - A combined transcriptomic and proteomic survey of the jejunal epithelial response to Trichinella spiralis infection in mice I have used this dataset to show how the system could be used to gain new knowledge. I will be discussing the aim of the experiment, the methods used, the results obtained and conclusions that the authors made. 3.8.1 Background Before the study was commenced it was known that when this parasite infection attacked mice a combined reaction of different immune responses worked together to eliminate the infection. The choice the Trichinella spiralis, which is a nematode, was based on the fact that it causes both profound and stereotypical pathological responses that are common to many nematode infections. The pathology is most marked at the time of worm expulsion which often takes place 14 days after infection. 3.8.2 Aim The aim of the study was to use transcriptomic methods to study the gene and protein changes that occur when the worm is being expelled. The hope was to show some Chapter 3. Results 48 of the pathways that are involved in this function. The authors [41] only interest was the state of the pathways at the time of expulsion and how it differes from before the infection, so that is only two data-sets. 3.8.3 Method 8 adult female mice were infected with Trichinella spiralis. These results were later compared to an unaffected control-group. DNA microarrays were used to examine the genes, and from this a list of significant up- or down-regulated genes were compiled. Then 2D-gels were used to see which proteins were changed. This list of 52 proteins were then examined with peptide mass fingerprint search to find the protein that those peptides build. The students t-test was used to determine which genes had interesting changes given the relative spot volumes for the proteins. 3.8.4 Results In paper [41] they discuss their results and conclude that some of the results correspond to information already known and some was not known. They where hoping to show how the expression of genes and proteins in these pathways correspond but since they only found one case where this happened, this hope was not fulfilled. They conclude that any changes in protein-levels that have no correspondance in changes in geneexpression levels, could be from post-translational modifications, rather than a change in the expressed genes. 2671 genes were detected in both the uninfected and the infected samples. Of these 10% were notably increased on infection and 13% were decreased. 10 genes were switched off in the infected sample, and 58 were switched on. In the project they compared the effects of two separately Trichinella spiralis infections on the 2D profiles, 2-9% of the spots matched to control gels were notably up-regulated, and 10-14% were down-regulated. Significant changes in the protein expression levels were observed for 16 spots in total. Chapter 3. Results 49 The changes observed by proteomics were then compared to the changes observed in transcription of the corresponding genes. In only 1 case out of 11 was a change in gene transcription (-2.8 fold) accompanied by a significant alteration in the corresponding protein level (-4.0 fold). The other 10 showed no statistically significant change. The table below is just a selection of spots, showing what he chose to highlight as interesting. Gene Identity SP Peptides ATP5B Proteomics Microarray ATP synthase beta chain... P56480 15 (34%) +7.2 - ITLN intelectin (spot 1) 088310 5 (14%) new spot NS LAP3 cytosol aminopeptidase Q9CPY7 6 (12%) -4.0 -2.8 PDIA1 protein disuplhide isomerase P09103 12 (26%) NS absent PGAM1 phosphoglycerate mutase 1 Q9DBJ1 11 (42%) NS +1.4 PKM2 pyruvate kinase, M2 isozyme P52480 12 (25%) spot lost NS PKM2 pyruvate kinase, M2 isozyme P52480 20 (43%) spot lost NS PNLIP pancreatic lipase (rat) P27657 6 (19%) new spot - Table 3.1: The results that Pemberton et al got [41] 3.8.5 Conclusions They conclude that there were significant changes occurring in the gene expression profiles at the time of worm expulsion and that some post- translational modifications could be detected as well. These genes includes those that are known to be involved in the immune-system. There are some drawbacks with both microarray technologies and 2D-gels. For microarray it is a question of detection level. Signals below a certain cut-off will not show up, and important information will be lost. Though 2D-gels allows you to directly look at the protein-level, they restrict the range of the pI and the mol-weight of the protein. Choice of extraction method will also narrow the detection-range. Together - - Chapter 3. Results 50 this means that up to a third of the proteins will never be found, and only the most abundant ones can be identified. In conclusion, the authors did, with the help of these methods, find some changes in the expression levels, but they there mostly unrelated. 3.8.6 My comments With the technologies that are available today (and by available I mean not too expensive and fairly robust) it is more or less possible to investigate the entire genome in one project but it is far from possible to investigate the entire proteome. That is why Pemberton et al only investigated 52 proteins. Not until new technologies are invented, such as protein arrays, will it be possible to really start looking at organisms from a systems biology approach. When it becomes possible this platform needs to be in place already. Chapter 4 Discussion 4.1 Did I achieve my goals? 4 of the goals were achieved, 2 others were only partly achieved. 4.1.1 Standardisation It is standardised since it uses MaxD and PEDRo. MaxD encompasses all information specified by MIAME that is the standard for storing information about microarray -experiments. PEDRo is a part of PSI. PEDRo was changed to improve support for high-throughput technologies, and some other considerations, so not totally standardised. 4.1.2 Public It is right now open to everybody within the University of Edinburgh, but the idea is that it will be password-protected so that only the people that are entering data into the database can see it, and then only their own data. The tools themselves are public. Everything except the password/usernames for the database that is. The only limitation is that without a similar structure the programs cannot be used but if anybody is interested in the programs, then it is likely they already have a similar database-structure already. 51 Chapter 4. Discussion 52 Figure 4.1: The welcome page for the system 4.1.3 Web-based All parts, except for all species.pl are accessible through the web, through the page welcome.html that is shown below. It would not be appropriate for all species.pl to be web-based since its not something the user starts, nor something that processes data that should be displayed to the user. The program is a text-parser and as such it should run on the server, hidden from view. I would not even like to think about how long it would take for the program to complete a run if it were web-based. 4.1.4 Scalable All the connection-information to the database, is in one file “connectDB.pl”, so this will not be hard to change. Simply change the machine name in the connection-string and all programs will automatically connect to the new database instead. MaxD could only be used for microarray-data so any other kind of genomicexperiments would have to be added to the database-structure, but with the project, Chapter 4. Discussion 53 projectexperiment and experiment structure that should be easy. The table AnalyteProcessingStep in PEDRo makes it possible to tie in any kind of proteomic experiment to PEDRo. 4.1.5 Modularised The genomic side really consists of one single program “all species.pl” and the corresponding cron-job. On the proteomic side there are many programs calling one another, each program is responsible for one, well-defined function, so this makes it modularised, though a bit harder to over-see. The reason for connecting them is that is makes it easier for the user and programming goals are never allowed to supercede the need of the user. With the help of query.pl and statistic.pl is the integration part pretty modularised, and all connecting information is within the same boundaries, making any changes necessary easier to perform, and the whole system easier to maintain. These two “modules” could be considered the glue that ties all the different parts together. 4.1.6 Up-to-date The whole system depends on the information that gets retrieved in the genomic part. Today this information is updated by manually calling a script. 4.2 What would I have done differently Given more space to store the database, I would probably have stored all information for UniGene, UniGene Mapping and UniGene Protein. Now I can just store information about the ones that we are using, and therefore the tool is a bit restricted as mapping goes. Had I had more space that part of the system would have been easy to write but now I have had some real problems with it. I store the needed information in hashes, and the system runs gets of memory errors before 2000 items have been stored, so my program stores 1000 at a time, and then saves the relevant ones to the Chapter 4. Discussion 54 file, and then empties the hash, to store the next 1000... until all 200 000 records has been processed. 4.3 What would I have done if I had had more time? • I would definitely have got the cronjob to work, or worked around it somehow. • I would have put more time on query.pl and statistic.pl. So that they would have been even more encapsulated and object-oriented. Now the same information may be on two different places, which is not ideal. • Even though I was aware of the difficulties from the beginning, I did hope that I would have more time to implement different analysing-tools, and different statistics-tools, but I did not. If I had know from the beginning that I would have more time, when I would have read and tried to understand more statistics and analysis steps. • I have absolutely no idea why msfit.cgi all of the sudden decided to crash, and never to work again properly after that day. What happens is that the function (LWP::UserAgent) that sends the information to the web-page is returning 400 OK, but the resultset is empty, meaning that the user never gets to see the result. 4.4 Future work • Change the sequence field in PEDRo.Protein to a Blob or equivalent. • One of the first things to add to the database-structure will probably be clinical data since it is in the foundation for what SCGTI are doing. • Add the terminolgy to the keyword-table. I have not done this since the three terminologies are pretty extensive and to sit and map from one to the other two by hand would take a lot of time and since it is not included in my project-scope, I did not do this mapping. Chapter 4. Discussion 55 • It would probably have been interesting to look at other genomic-technologies than microarrays. Just to see what is out there, and maybe add one into the database to show how it could be done. • There are many exciting things happening in the area of proteomics. This development should be monitored, and when a technique is considered stable enough, it should be added to PEDRo as a new AnalyteProcessingStep. • If dataloaders is not made for PEDRo, then this needs to be made, The reason that I am talking about dataloaders is that today SCGTI has a designated curator that enters all data and therefore the program will not need a “flashy” interface, it should just be a simple tool for adding a large collection of data into the database. • The platform may be in place, but there are many different kinds of anaylsis tools that “should” be implemented once the platform is in place. These include automatic normalisations tools, different statistical test, different visualisation tools and so on. One of the first things to add in this area is the capability to ask those questions that I raise in step 6 in section 1.4. The next is to make it possible to search on keywords. • Another possibility would be to write a table SpotMeasurement in the project tablespace. This table would map a spot from a 2Dgel-experiment onto a measurement from an microarray-experiment. This would be a mapping that maps across the boundaries for cases when the identity is not known. • When all this is done, genetic networks should be modelled using the data that is integrated with this platform. Chapter 5 Summary I have designed and implemented a platform (in Oracle 8i) for the integration of genomic and proteomic data. I have also written a couple of programs for storing, querying and visualising the data. The integration bridge depends on data that gets updated every second week, meaning that the results may change from time to time and that the system is dynamic. The platform accept information about projects, microarray-experiments and protemics-experiments. It is possible to add other types of data to the platform without changing the exisitng structure. The platform is intended to be able to capture all biological data that is needed for an experiment in systems biology. This is a new area of biology that looks at organisms as systems rather than well-defined, isolated parts. The problem is that is is not known what data that will be needed for this so the platform has to be able to grow and encompass new kind of data. I believe that I have succeded in this, I think that it is relatively easy to add, change or delete a module from the system. The only part that will get affected is the integration bridge. This part will have to know how to get information from all parts of the database, the other do not. I have designed and implemented a demo-program that can plot relevant changes in gene-expression levels against those in proteins-levels. This plot could be used as a first tool when trying to understand how the networks are connected, or to decide which genes and proteins which should be studied further. 56 Appendix A Appendix A - Biological definition This project is about bioinformatics - the discipline of using computers as a tool to gain knowledge from biological data. This dissertation is going to read by people with no background in biology, but since it demands certain biological knowledge I have included this appendix for those people that are not entirely sure what the difference is between the genome and the proteome. It could also be used as a “guide” to how I have used the terms (this since I am convinced that if you asked five biologist of a definition, you would get five different answers back). Terms included in this appendix are in italic so that they are easily noticed. The appendix is organised in an alphabetical way rather than a logical way so that it will be easier to find the term you are looking up. In this glossary I have included a short description of one of the most fundamental concepts for this dissertation, the central dogma of molecular biology. It will tell you why DNA microarrays will not reveal the entire picture and why it is interesting to look at the combination of gene and protein expressions. A.1 The Central Dogma of Molecular Biology There are 3 steps involved in this process, and they are: 1. The replication phase. Here DNA (a double-stranded molecule) gets duplicated, i.e. a perfect copy is made. 57 Appendix A. Appendix A - Biological definition 58 2. The transcription phase. By using DNA as a blueprint, a single-stranded RNA sequence is created, preferably a perfect copy. 3. The translation phase. Proteins are made from RNA. This is done by translating 3 consequent bases (a codon) into an amino acid. These amino acids gets linked together to eventually build up the proteins. [17] In short, the central dogma of molecular biology says that our genetic material is stored in DNA, and that DNA is a static occurence. In spite of this, the “reflection” of DNA in our bodies is quite different, for instance compare our toes with our hair. The DNA gets transcribed into RNA, or more specific mRNA. The same mRNA is not always made from the same DNA, there are atleast six different possible mRNAs for each DNA since there are six open reading frames (ORF). The genetic code works is groups of 3 (the above mentioned codon), and transcription can start at site 1, 2 or 3, but then it is also possible for the transcription to work “backwards” adding another 3 possibilities, totalling at 6. If you also take into consideration the fact the pre-mRNA can be spliced (i.e. cut) at different locations in the chain, leaving different mRNA:s. Next to happen is for the RNA to get translated into proteins. This happens by taking a codon from the mRNA and translate it to the corresponding amino acid. The genetic code is redundant meaning that even though you only have 20 different amino acids you have 64 (33 ) possible codons, some codons code for the same amino acid. There is also a start-codon and two stop-codons that tells the tRNA where to start translating the mRNA and where to form the finished protein. Once the protein is created and released it will become modified in different ways so that it can “act” the right way in the right place. Some parts of the chain may be cut off, functional groups may be added, it can fold up and/or form different 3-dimensional structures (the structure of a protein will determine its function). Then all this is done, the protein is transported to the target-location, and there it can have different effects depending on where in the body this is. [17] The central dogma of molecular biology is then the reason for there being a big dimensionality issue in biology, as seen in the figure 1.4.6.1. Appendix A. Appendix A - Biological definition A.2 59 Amino acids Amino acids are the building parts of peptides and therefore also of proteins. An amino acid is built up by five parts. In the middle there is a carbon-atom (C), to this a NH2 (nitrogen), a COOH (carbon-di-oxide) and a hydrogen atom bound. Apart from this there is also a variable sidechain (R). It is in the sidechaing that the amino acids differ. [24] A.3 Bases A base is a pair of nucleotides. An A is bound to a T, and a G to a C through hydrogenbonds. Since the two pairs, AT and GC are of equal length, the diameter of the DNAmolecule is uniform. [30] A.4 Comparative genomics The effort to sequence (and compare) different species. [3] A.5 DNA DNA (Deoxy RiboNucleic Acid) is perhaps best known as the storage site for all genetic material, the unit that transfers the parents traits onto the children. [30] The DNA contains all information needed to build up every single cell in our bodies. [30] There are two different ways of transferring DNA from one generation to the next, either cloning then the DNA gets transfered without any changes. The other is sexual reproduction then the genetic material from the mother is crossed with that of the father forming a unique blend of their genetic material in the offspring. [43] “Chemically” DNA is a double-stranded molecule that consists of nucleotides. There are four different nucleotides that are used to build a DNA-molecule, and that is A, C, G and T. These nucleotides form long sequences that are bound together in such a way that DNA is usually in the form of a double-helix. [30] (p100) Appendix A. Appendix A - Biological definition A.6 60 Gene Genes could shortly be described as the protein encoding sequences in the DNA [30]. Genes could be on or off, a gene that is on is said to be expressed and that means that the protein that the gene is coding for is being produced. Since it is the proteins that are responsible for the functions in our bodies it is very interesting to know which genes that are expressed and which are not. By staying on, the gene makes sure that its product the protein continues to get produced. [30] Gene expression levels could therefore be used as a measure for which proteins that are needed in a given situation. By using gene expression levels as an indicator it is possible to understand which genes that are involved in which diseases. This knowledge might lead to new drug targets since the pharmaceutical industry when will know which proteins that are needed. Today it is mostly proteins that are used as drug targets. A.7 Genetic diseases There are in principle two kinds of genetic diseases simple and complex. The simple ones are the ones there a single molecule (often a gene) are solely responsible for the entire disease state. This is not very beneficial from a evolutionary perspective so the number of simple genetic diseases are very low, in comparison. The complex ones are there a genetic network are cooperating to create the disease, this groups involves cancer, heart diseases and diabetes. (This leads to the conclusion that a genetic network is an evolutionary stable form of organisation.) In these diseases, no single gene can affect the disease, therefore it is likely that more than one mutation has to occur before the disease breaks out. Genes may contribute in different degrees to the disease and they get influenced by their environment to different degrees too. “Unravelling these networks of events will undoubtedly be a challenge for some time to come, and will be amply assisted by the availability of the sequence of the human genome.” [33] Appendix A. Appendix A - Biological definition A.8 61 Genetic network/pathway I am taking genetic networks as an example of for what this platform could be used since it today is something that researchers are interesting in and since it shows the way that DNA, RNA and proteins cooperate to build up our bodies. A genetic network could best be described as the network or circuit which describes the interactions of genes in a particular system. These network differs in size, the could be everything for tiny (2-3 interactions) to huge (a couple of ten thousands of interactions). [8] Networks are responsible for the most complex forms of biological phenomena including cancer. They are a collection of genes, that (when expressed) have a common biological function. This means that a gene from a network cannot by its own make the function it need the others, and it need them in different degrees (the weights of the network). Networks could also be used as a predictive source for how individual genes will act under certain circumstances, since you will know how the network influences the gene you will be able to predict how the gene will behave. This information could possibly be used to predict whether a person will get a specific disease or not. A pathway is a “sub-network”, i.e. a group of genes highly interconnected to each other, but not as connected to the “outside” world (outside meaning outside that pathway). Since they usually are a lot smaller than the networks themselves, they are often easier to model. But its highly unlikely that they wouldn’t exchange information with one-another, this means that you will have to be aware that there are information missing then drawing conclusions based in pathways instead of networks. [2] The term genetic networks, really encompasses so much more than genes, it refers to all those things affecting genes as well, such as proteins, mRNA and other small molecules such hormones and ions, as long as they are connected. A.9 Genome The genome is the entire genetic material for a specific organism. [38] Appendix A. Appendix A - Biological definition A.10 62 Genomics The goal of genomics is to determine the complete DNA sequence for the entire genome. Whereas functional genomics wants to determine the function of the proteome, and structural genomics is the systematic effort to put together a completed structural description of a defined set of molecules. The ultimate goal of the combined effort of these is to have the entire genome and proteome mapped. [38] A.11 Mass spectrometer Mass spectrometer is an analytical technique that is used for 3 different purposes, the identification of unknown compounds, the quantification of known materials and the interpretation of physical and structural properties of ions. [45] The mass spectrometry in this project has been used to identify unknown substances. This is done by choosing interesting samples (for instance a spot from a 2Dgel). This sample is then subjected to a mass spectrometer that cuts up the protein (usually using trypsin as enzyme) into peptides. The instrument then determines the weight of these peptides and produces a list of peptide weights for the protein. These weights are then matched against known peptides, to try and decide which peptides are present. From this list a protein is suggested this process could be done by either Mascot (http://www.matrixscience.com) or Ms-fit (http://prospector.uscf.ed.us) and this technique is known as peptide mass fingerprinting. A.12 Microarrays Microarrays is a technique used to gain a “snapshot” image of the level of thousands of mRNA in a biological sample at a given time. The mRNA “profile” is just an indirect measure of the gene activity, [9] this since if a gene is on, it is producing the corresponding protein and to do that is has to create mRNA. Appendix A. Appendix A - Biological definition A.13 63 Molecule A collection of atoms that have strong enough binding in between them not to fall apart (at least not easily). It is the smallest building-block of any substance, and are determining the properties for that substance. [44] A.14 Nucleotides Nucleotides are the buildingstones of DNA. The nucleotides in DNA are in turn built up by three parts, a sugar (in this case deoxyribose, giving the molecule its name), a phosphate group and a nitrogen-containing-base. There are four different bases in DNA and they are, adednine (A), guaniene (G), cytosine (C) and thymine (T). The first two are called purines, and the last two pyrimidines. The amount of purines in the DNA is equal to the amount of pyrimidines. [30] The nucletides are also known as bases. A.15 Peptide A peptide is a short chain of amino acids linked by peptide bonds. Longer peptides, is called polypeptides, and even longer ones are proteins. Peptides is usually only 20-30 amino acids long. [30] A.16 Protein Proteins are the molecules that acts as building blocks in our bodies. They are responsible for forming most of our body structure such as skin and hair and to produce most substances such as enzymes and antibodies. [42] They are also responsible for the metabolism and are making sure that every reaction in our bodies is happening at the right time, in the right location. [20] In short, they are responsible for more of less everything, except for the skeleton, the spinal cord and some parts of our brains. The importance of proteins was very clear already from the beginning, which is understood, then you are considering that the word comes form the Greek “proteos”, Appendix A. Appendix A - Biological definition 64 that means the first or the most important. [40] Proteins are formed by chains of amino acids, varying from two bases to thousands of bases. Since there are 20 naturally occurring amino acids there are an infinite big number of possible proteins. An N long protein will result in 20N possible proteins. A.17 Protein arrays With protein arrays it will be possible to screen thousands of proteins simultaneously, just as it is for microarrays. Protein arrays could be used for many different things, such as detecting presence or absence of proteins, investigating protein expression levels, determining interaction and functions for specific proteins, or to look at protein-interactions, protein-protein, protein-antibody or protein-drug. [13] A.18 Proteome The proteome is the protein complement that is coded for by the genome. Although the genome is static (except for mutations) the proteome is dynamic. The current proteome at any given time depends on many different factors where the surrounding environment is one. [38] The complexity of the proteome far exceeds that of the genome since a gene can code for many different proteins. So far we do not know exactly how many different proteins there are, but taking into account different splice sites, reading-frames and different post-translational modifications, one guess is about 500 000 proteins. [13] Another estimate comes from the Human Genome Project [22], they think that the human proteome should be at least 10 times larger than that of the fruit fly (Drosphilia) and the roundworm (C.elegans), which would mean around 300-380 000 different proteins. Appendix A. Appendix A - Biological definition A.19 65 Proteomics Proteomics is the subject of quantifying the expression level of the proteome at any given time. Now the term usually also encompasses procedures for determining the function of a set of proteins, this means that the term is a synonyme to functional genomics. [38] The area encompas the study of identity, abundance, distribution, modifications, interactions, structure and functions of proteins. Proteomics is such a vast area that “experts” predict that there will never be one technology that can do all steps. [18] A.20 RNA The most important function of RNA (RiboNucleic Acid) is to act as a middle product between DNA and proteins. The double-helix of the DNA is unwounded and each strand is copied, forming two exact RNA copies of the DNA strands. The resulting RNA is then spliced, forming mRNA (messenger-RNA) strands. Three bases in a row is known as a codon. One by one a tRNA (transfer-RNA) will bind to one of these codons building up a long chain of amino acids that eventually will separate from the mRNA and form a protein. There are two big differences between a DNA and a RNA molecule, the first one is that RNA is single-stranded and DNA double-stranded, the second is that the base T in DNA is replaced for the base U in RNA. [30] One of the most important things to remember in biology, is that there are always exceptions to every rule for instance for some organisms (i.e. some viruses) it is RNA not DNA that is being used as storage site for the genetic material. [23] A.21 Systems biology “Moving past DNA (deoxyribonucleic acid) sequences, researchers are interested in the corresponding protein sequences, their structure, and function. Beyond sequences, researchers wish to understand the “space” and “time” dimensions of genes, as for Appendix A. Appendix A - Biological definition 66 example, what genes are expressed in which tissues and during what stages of development”. [10] The human cell consists of thousands of interactions between proteins, genes and other biomolecules. It is these interactions that are resposible for switching on or off genes, controlling which proteins are being produced and responding to signals from the environment. This means that even small changes in one of these interactions could lead to the outbreak of a disease. [21] Today, we do not know what would happen when a malfunction is introduced in the system. We do not know how stable it is, neither do we know the design underlying it, and we do not know if it would be possible to modify these to make a more robust system. [27] Appendix B Abbreviations Arabidopis Arabidopis thaliana (cress) C.elegans Caenorhditis elegans (worm) EST Expressed Sequence Tag Mged Microarray Gene Expression Data MIAME Mimimum Information About a Microarray Experiment NCBI National Center for Biotechnology Information ORF Open Reading Frame PEDRo Proteomics Experiments Data Repository PSI Proteomic Standards Initiativ SCGTI Scottish Centre for Genomic Technology and Informatics SIRCAMS Scottish Instrumentation and Resource Centre for Advanced Mass Spectrometry SQL Structured Query Language TIG TIGR Gene Indicies TIGR The Institute for Genomic Research 67 Bibliography [1] Anastassiou, ing D. of genetic (2002). networks Computational [WWW document] ModelURL http://www.cvn.columbia.edu/courses/Spring2002/ELENE6901.htm [2] Arkin, A. (2003). Cellular Systems Analysis and Quantitative Biology. [WWW document]. URL http://www.hhmi.org/research/investigators/arkin.html [3] Boguski, M.S. (2002) Comparative genomics: The mouse that roared Nature 420, 515-516 (05 December 2002) ¯ [4] Brazma, A. (2001). Editorial - on the importance of standardisation in life science. Bioinformatics, vol 17, no 2. Pages 113-114 [5] Buneman, Integrating P. Davidson, Biollogical S.B. Data Overton, C. (1995) Sources [WWW Challenges document] in URL http://citeseer.nj.nec.com/davidson95challenges.html [6] Bussey, K. J. & Kane, D. & Sunshine, M. Narasimhan, & S. Nishizuka, S. & Reinhold, W. C. & Zeeberg, B. & Jaay & Weinstein, J. N. (2003) MatchMiner: a tool for batch navigation among gene and gene priduct identifiers Genome Biology 4(4):R27 [7] Castillo-Davis, C. Hartl, D.L. (2002) GeneMerge - post-genomic analysis, data mining, and hypothesis testing. Bioinformatics. Vol 19. No 7. pp 891-892 [8] Using Artificial Genomes to model Genetic Networks [WWW document] URL http://homepage.ntlworld.com/cjl.clarke/proposal.html 68 Bibliography 69 [9] Cuminskey, M. & Levine, J. & Armstrong, D. (2002). Gene Network Reconstruction Using a Distributed GA with a Backprop Local Search. Proceedings of the 1st European Workshop on Evolutionary Bioinformatics (EvoBIO 2003), Springer. [10] Davidson, S.B. Crabtree, J. Brunk, B.P. Schug, J. Tannen, V. Overton, G.C. Stoeckert, C.J. (2001). K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal, vol 40 [11] Davidson, in S.B. Integrating Overton, Biological C. Buneman, Data Surces P. (1995) [WWW Challenges document] URL http://citesser.nj.nec.com/davidson95challenges.html [12] Dennis, G. Jr. Sherman, B.T. Hosack, D.A. Yang, J. Gao, W. Lane, Clifford, H. Lempicki, R.A. (2003) DAVID: Database for Annotation, Visualization, and Integrated Discovery Genome Biology. URL http://genomebiolgy.com/2003/4/5/P3 [13] FunctionalGenomics.org.uk. Protein Arrays Resource Page. [WWW document] URL http://www.functionalgenomics.org.uk/sections/protein arrays.htm#intro [14] GeneticXchange. sis For High optimizing Througput Drug Data Discovery. Integration [WWW and Analy- document]. URL http://www.genetixchange.com/v3/product/whitepapers/gxdatasheet1102.pdf [15] GeneticXchange. sion to speed Exploating new drug the life discovery. science [WWW data explo- document]. URL http://www.geneticxchange.com/v3/product/whitepapers/WPexplosion.pdf [16] GeneticXchange. discoveryHub - “Standard Edition”. [WWW document]. URL http://geneticxchange.com/v3/index.php?doc=product/standardedition.html&lvl=1 [17] Gerlof, D. (2002). Lecture Notes for the Course Communicating about biological Data. [18] Gershon, D. (2003). Proteomics technologies: Probing the proteome Nature 424, pp 581-587 Bibliography 70 [19] Ghazal, P. Talk about pathwaybiology. [20] Gustafsson, A. (2002). De lagger nya bokstaver till livets alfabet. Nyteknik, p 17. 2002 Mars 7 [21] Halim, ating N. Withers, M. (2002) the circuits of life Systems [WWW Biology: Cre- document] URL http://www.wi.mit.edu/nap/2002/nap feature sysbio.html [22] The the Human human Genome genome Project. Preliminary projects [WWW findings of document] URL http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/H/HGP.html [23] Hunter, L. AI and molecular biology. [WWW document]. URL http://www.aaai.org//Library/Books/Hunter/01-Hunter.pdf [24] [WWW document] URL http://www.hyperdictionary.com/dictionary/amino+acid [25] Institute man: of Medecin-Division Building a Safer of Health Care To Health System [WWW Err is Hu- document] URL http://www4.nas.edu/news.nsf/isbn/0309068371?OpenDocument [26] Johnson, K. & Lin, S. (2003). QA/QC as a Pressing Need for Microarray Analysis: Meeting Report from CAMDA’02. Biotechnique vol 34. Pages S62-S63 [27] Kitano, H. (2002). Systems biology: A Brief Overview Science (http://www.sciencemag.org), vol 295. 1 March 2002. pp 1662-1664 [28] Kulkarni, A.V. Williams, N.S. Wren, J.D. Mittelman, D. Pertsemlidis, A. Garner, H.R. (2002). ARROGANT: an application to manipulate large gene collections Bioinformatics, Vol18. No 11 2002 pp 1410-1417 [29] Lesney, M. S. on comparative (2001). genomics Ecce homology: [WWW A primer document] UTL http://pubs.acs.org/subscribe/journals/mdd/v04/i11/11lesney.html Bibliography 71 [30] Lodish, H. & Berk, A. & Zipursky, S.L. & Matsudaira, P. & Baltimore, D. & Darnell, J. (2000). Molecular Cell Biology. New York:W.H. Freeman and Company. pp 5,100 [31] [WWW document] URL http://bioinf.man.ac.uk/microarray/maxd/ [32] Mged. periment (2002). - Minimum Miame Information 1.1 Draft 6. About a [WWW Microarray Ex- document]. URL document]. URL http://www.mged.org/Workgroups/MIAME/miame 1.1.html [33] NCBI. (2001). Genes and diseases. [WWW http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=gnd.preface.91 [34] NCBI. UniGene [WWW document] http://ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene [35] NCBI. Frequently asked questions [WWW document] http://ncbi.nlm.nih.gov/UniGene/FAQ.shtml [36] NCBI. of (2003). the UniGene: transcriptome A [WWW unified view document] URL http://www.ncbi.noh.gov/books/bv.fcgi?call=bv.View...ShowSection&rid=handbook.chapter.857 [37] NCBI. (2003). Mail from NCBI regarding UniGene cluster ids [private communication] [38] Nilges, M. & Linge, J.P. Bioinformatics - a definition. [WWW document]. URL http://www.pasteur.fr/recherche/unites/Binfs/definition/bioinformatics definition.html [39] Passino, M. Structural Bioinformatics in Drug Discovery [PPT document] URL http://www.sdsc.edu/pb/edu/pharma202/Passino.ppt [40] Paulun, F. Protein: Vad ar det bra for? [WWW document]. URL http://bkspotrsmag.se/artiklar/protein - vad ar det bra for.htm [41] Pemberton, A.D. &Knight, P.A. & Robertson, K. & Wright, S.H. & Roz, D. & Miller, H.R.P. (2003) A combined transciptomic and proteomic survey of the jejunal epithelial response to Trichinella spiralis infection in mice [Unpublished article] Bibliography 72 [42] Phoenix5. (2002). Phoenix5’s Prostate Cancer Glossary. [WWW document]. URL http://www.phoenix5.org/glossary/protein.htm [43] Purves, B. Sadava, D. Orians, G. Heller, C. (2001). Life The Science of biology Sixth Edition. Sinauer Associates. pp 165 [44] Quist, P. (1998) Kemisten - molekylernas mastare? [WWW document] URL http://www.teknat.umu.se/popvet/POP/kemi.html [45] Smoot, tein M. mass (2001). 2-D spectrometry. gel electrophoresis [WWW and document] proURL http://hesweb1.med.virginia.edu/biostat/teaching/statbio/Spring01/2-D GelMass Spec.ppt [46] Stein, L. (2003). Integrating Biological Databases. Nature Reviews — Genetics, Volume 4. [47] Stewart, B. (2002). An interview with Dr. Leroy Hood [WWW document] URL http://www.oreillynet.com/lpt/a/1499 [48] Szallasi, Z. (2001). Genetic network analysis - From the bench to computers and back. 2nd International Conference on Systems Biology [49] Systems Biology.org. (2003) Systems Biology - English [WWW document] URL http://www.systems-biology.org/000/ [50] Taylor, C.F. Paton, N.W. Garwood, K.L. Kirby, P.D. Stead, D.A. Yin, Z. Deutsch, E.W. Selway, L. Walker, J. Riba-Garcia, I. Mohammed, S. Deery, M. Howard, J. A. Dunkley, T. Aebersold, R. Kell, D.B. Lilley, K.S. Roepstorff, P. Yates, J.R. Brass, A. Brown, A.J. Cach, P. Gaskell, S.J. Hubbard, S.J. Oliver, S.G (2003) A systematic approach to modeling, capturing, and disseminating proteomics experimental data Nature, vol 21. Number 3, pp 247-254 [51] Taylor, C.F. Paton, N.W. Garwood, K.L. Kirby, P.D. Stead, D.A. Yin, Z. Deutsch, E.W. Selway, L. Walker, J. Riba-Garcia, I. Mohammed, S. Deery, Bibliography 73 M. Howard, J. A. Dunkley, T. Aebersold, R. Kell, D.B. Lilley, K.S. Roepstorff, P. Yates, J.R. Brass, A. Brown, A.J. Cach, P. Gaskell, S.J. Hubbard, S.J. Oliver, S.G (2003) A systematic approach to modeling, capturing, and disseminating proteomics experimental data [WWW document] URL http://pedro.man.ac.uk/files/PEDRo PowerPoint talk.zip [52] Taylor, C.F. [WWW document] URL http://pedro.man.ac.uk/model.shtml [53] The Institute for Genomic Research. About TIGR. [WWW document] URL http://www.tigr.org/about [54] The Institute for Genomic Research. TIGR Gene Indices Software Tools. [WWW document] URL http://www.tigr.org/tdb/tgi/software [55] Tsai, J. Sultana, R. Lee, Y. Pertea, G. Karamycheva, S. Antonescu, V. Cho, J. Paravizi, B. Cheung, F. Quackenbush, J. (2001). RESOURCERER: a database for annotating and linking microarray resourcers within and across species. Genome Biology 2001, 2(11) [56] Wikipedia. (2003). Model organism [WWW document] URL http://www.wikipedia.org/wiki/Model organism [57] Zhong, S. Li, C. Wong, W. H. (2003) ChipInfo: software for extracting gene annotation and gene ontology information for mixroarray analysis Nucleic Acid Research. vol 31. No 13. pp 3483-3486