Download DOC - Evangelos Pafilis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
VISUALISATION MECHANISMS
Supporting Genotype Analysis
Evangelos Pafilis
A dissertation submitted in part fulfillment of the requirement of the Degree
of M.Res. in Bioinformatics at the University of Glasgow
September 2003
Abstract
Genome visualisation is an important tool which supports the analysis of
genes involved in normal and abnormal activities of living organisms.
Responding to the emerging demands for genotype analysis data display
and genome variation rendering and retrieval, we propose a prototype web
based application that aims to address both of these issues. We
experiment with using an existing visualisation tool in a novel context,
implement software that performs the visualisation logic and develop the
required data storage facilities. The prototype we produced demonstrates
how an implementation with a wider data coverage could satisfy some of
the bioinformatics requirements arising from genetics research.
2
Acknowledgements
I would like to thank the following people for their help with the project:

Dr. Ela Hunt, for her support and guidance throughout the project

Dr. Fadi Charchar for helping me understand the biological background
of the project

Dr. Bailey for sharing his thoughts on bioinformatics issues

Mr. Andy Jones for sharing his ideas and opinions

Dr. David Leader, Dr. Neil Hanlon, Ms. Eilidh Grant and Ms. Suzan
Fairley for their assistance

Mr. Micha Bayer and Mr. Chris Wu for their help in technical issues

Mr. Jim Tourtouras and Ms. Areti Galani for their continuous support
3
Abstract ...................................................................................................... 2
Acknowledgements .................................................................................... 3
Abbreviations .............................................................................................. 6
1. Introduction ............................................................................................. 7
1.1 Need for visualisation ........................................................................ 7
1.2 Project stimulus ................................................................................. 7
1.3 Genotype Analysis Data.................................................................... 8
1.4 Single Nucleotide Polymorphisms (SNPs) ........................................ 9
1.5 Project Approach ............................................................................ 10
2. Design .................................................................................................. 11
2.1 Preparation ..................................................................................... 11
2.1.1 Literature and Web Resource Survey....................................... 11
2.1.2 Requirements Capture ............................................................. 12
2.2 Genotype Analysis Data Visualisation Design ................................ 13
2.3 Y Chromosome SNP Visualisation .................................................. 14
2.4 Y Chromosome SNP Database Data Source Design ..................... 15
2.5 Overall Application Design .............................................................. 16
2.6 Y Chromosome SNP Database design ........................................... 17
3. Implementation ..................................................................................... 19
3.1 YDB Implementation ....................................................................... 19
3.1.2 Implementing the Relational Schema ....................................... 19
3.1.3 Inserting Data ........................................................................... 19
3.2 Presentation Tier ............................................................................. 22
3.3 Application Tier ............................................................................... 23
3.3.1 Genotype Analysis Data Visualisation Middleware ................... 23
3.3.2 Y Chromosome Variation Visualisation Middleware ................. 25
3.4 Implementation Summary ............................................................... 27
4
4. Testing and Evaluation ......................................................................... 28
4.1 Testing ............................................................................................ 28
4.2 Evaluation ....................................................................................... 29
5. Discussion ............................................................................................ 30
5.1 Design Decisions ............................................................................ 30
5.1.1 Only dbSNP as the Data Source for YDB ................................. 30
5.1.2 Not only SNPs .......................................................................... 30
5.2 Implementation Decisions ............................................................... 31
5.2.1 MySQL as the RDBMS ............................................................. 31
5.2.2 Parsers in the Java Language .................................................. 31
5.2.3 Both CGI and Servlets as Middleware Technology .................. 32
5.2.4 DerBrowser as the Visualisation Tool ....................................... 32
5.3 Application Criticism: Achievements and Drawbacks ...................... 33
5.4 Further Improvement....................................................................... 34
6. Conclusion ............................................................................................ 36
References ............................................................................................... 37
A. Literature .......................................................................................... 37
B. Internet Resources ........................................................................... 39
Appendix A ............................................................................................... 40
Appendix B ............................................................................................... 42
Appendix C ............................................................................................... 43
Appendix D ............................................................................................... 44
Appendix E ............................................................................................... 45
Appendix F ............................................................................................... 46
5
Abbreviations
ASN1: Abstract Syntax Notation One
BPG: Blood Pressure Group
BHF: British Heart Foundation
CGI: Common Gateway Interface
DIP: Deletion Insertion Polymorphism
HTML: Hyper Text Mark up Language
HTTP: Hyper Text Transfer Protocol
JDBC: Java DataBase Connectivity
RDBMS: Relational DataBase Management System
STR: Short Tandem Repeat
dbSNP: Single Nucleotide Polymorphism Database
SNP: Single Nucleotide Polymorphism
SQL: Structured Query Language
TSC: The SNP Consortium
YDB: Y Chromosome Variation DataBase
6
1. Introduction
1.1 Need for visualisation
The development of high throughput techniques and large-scale studies in
the biological sciences has given rise to an explosive growth in both the
volume and types of data available to the researchers. The genome
sequence of many organisms is now known and is being annotated
constantly. Handling this ever increasing amount of information is no
longer a straightforward issue and therefore computational methods have
to be applied.
Accessing annotated sequence data in textual format can be extremely
laborious and time consuming (Bryce, 2003). A graph that utilizes different
colours and shapes in order to render every piece of genomic information
associated with a particular region, would ease the process of data
interpretation and allow the generation of conclusions. Thus, there is a
need for a visualisation tool that would display the exact location and
range of each genetic element existing in a genomic sequence, provide a
user with facilities such as the retrieval of further information and be easy
and intuitive to use.
1.2 Project stimulus
The driving force of this project was to provide a group of researchers at
the British Heart Foundation (BHF) Blood Pressure Group (BPG) of the
Division of Cardiovascular & Medical Sciences of the University Of
Glasgow [a]
with visualisation tools that would support their ongoing
experiments.
The visualisation of both genotype analysis results and genome variations,
focusing
on
Single
Nucleotide
Polymorphisms
(SNPs)
was
the
bioinformatics problem to be solved.
7
1.3 Genotype Analysis Data
Certain congenic rat strains have been genotyped for a series of markers
and the results were held in a Microsoft Excel® flat file (Figure 1).
Figure 1: Flat file containing genotype analysis data
All the markers shown belong to rat chromosome 2. The rectangles
correspond to a certain genotype according to the colour code shown on
the right. The coloured rectangles give a visual overview of the marker
genotypes for each rat strain. However, if the researcher wishes to place
the markers on their relative locations on the chromosome, checking for
example their distribution, then he/she is forced to traverse though the
base pair position of each one. In this form of data presentation the real
physical distances on the chromosome of interest are not shown, which
makes the interpretation of data incomplete.
We postulate that there is a need for an application that would accept this
flat file as input and return a visualisation maintaining the same color code
and simultaneously rendering the markers on their chromosomal position.
8
1.4 Single Nucleotide Polymorphisms (SNPs)
An SNP is a substitution of one base pair at a given location on the
genome. At position 11,294,479 on human chromosome 7, for example,
some people have an A, while others have a G. On average, SNPs are
spaced every 300 bases throughout the human genome and are estimated
at nearly 10 million (Orgen, 2003). Each is a genomic landmark, a
surveyor's marker that researchers can use to chart the location of disease
genes and heritable traits, for instance.
Most SNPs reside outside coding regions, exerting potential influence on
gene regulation and expression. Many researchers value these SNPs for
use in association studies and whole-genome linkage-disequilibrium
mapping. In this type of analysis, maps of common, genome-wide
polymorphisms are used to unearth variations that are associated with, but
not causative of, medical conditions.
Some polymorphisms occur in protein-coding regions (cSNPs) and may
directly contribute to disease, disease susceptibility, and drug metabolism,
by altering gene function. (Orgen, 2003).
BPG researchers have already documented that the Y chromosome
harbors a locus or loci that contribute to blood pressure variation in
hypertensive and normotensive men (Charchar et al., 2003). Their next
step is to perform haplotype analysis on a general sample of Polish males
and to try to associate high blood pressure with certain combinations of
SNP allelic states, constituting a particular haplotype.
Additionally, because of its sex determining role, the Y chromosome is
male specific and constitutively haploid. It passes from father to son, and,
unlike other chromosomes, largely escapes meiotic recombination1, 2. The
importance of escaping recombination is that haplotypes usually pass
1
2
Recombination is the formation of new combination of alleles though meiotic crossover.
Some authors include intrachromosomal gene conversion under this heading. As this
has been shown on Y chromosome (Rozen et al., 2003), they prefer not to refer to it as
‘non-recombining’.
Two segments (the pseudoautosomal regions) do recombine with the X, but these
amount to less than 3 Mb of its ~60-Mb length.
9
intact from generation to generation. They change only by mutation, rather
than the more complex reshuffling that other chromosomes experience,
and so preserve a simpler record of their history. Using binary
polymorphisms1 with low mutation rates, such as SNPs, a unique
phylogeny can therefore easily be constructed. (Jobling et al., 2003)
A storage and visualisation mechanism designed specifically for the Y
chromosome SNPs, containing the publicly available knowledge, would
allow researchers to obtain information for the each SNP, such as the
experimental conditions required, and thus will assist them in performing
their haplotype analysis and phylogeny construction experiments.
Additionally, the same application could accommodate the storage of their
own generated SNP data.
1.5 Project Approach
This project demonstrates how a web based application can provide
solutions to the bioinformatics issues mentioned in the previous two
sections. An existing visualisation tool was linked with data originating
from different sources, a flat file that was being read and a database that
was being queried.
The tasks performed comprised of surveying the relevant literature and
web resources, interviewing scientists and capturing the requirements,
developing a database schema, processing data from publicly available
databases, implementing the middleware software and installing and
administrating a web server.
1
For more information on why the SNPs tend to exist in binary forms, i.e. having two
alleles, see Brown, 2002)
10
2. Design
2.1 Preparation
2.1.1 Literature and Web Resource Survey
Before designing the web application, the literature and the internet were
surveyed to gather information regarding the human Y chromosome and
SNP web databases.
The reason for studying the Y chromosome literature was to obtain a
greater depth on the BPG ongoing experiments and to be able to evaluate
biologically the findings emerging though the construction of the
application, e.g. the number of loci found to be associated with SNPs
stored in the database.
Milestone in this background reading was the publication of an article that
analysed and interpreted from the evolutionary perspective the nearcomplete
sequence
of
Y
euchromatin,
including
thorough
gene
identification (Skaletsky et al. 2003).
Querying web search engines [b], [c] and for ‘SNP database’ the results
included among the others the Single Nucleotide Polymorphism Database
(dbSNP) (Sherry et al. 2001) [d] established by the National Center for
Biotechnology Information [e] and the The SNP Consortium (TSC) website
(The International SNP Map Working Group, 2001) [f].
In addition, SNP data were available in the Ensembl database (Hubbard,
T. et al., 2002) [g], queried through EnsMart [h].
The previously mentioned web resources were used to prepare screen
shots of SNP mapping visualisation and information retrieval facilities
(Appendix A) that were used in the requirements capture interviews.
11
2.1.2 Requirements Capture1
The requirements capture procedure involved interviewing molecular
biologists from the BPG and other research groups. The interviewees had
either molecular genetics experience or an interest in bioinformatics. The
former described the specific requirements that the application should
meet, while the latter offered a more generic view of how an SNP storage
–
visualisation
mechanism
should
be
implemented,
focusing
on
visualisation topics.
The interviewees were shown SNP records extracted from web databases
and were asked to point out which parts of information are regarded as
essential for their research and should be included in the database to be
created.
In addition, the interviewees were requested to describe the desired
appearance and the functionality of the application based on observations
and remarks on the SNP visualisation and information screen dumps
mentioned in the previous section. (Appendix A)
The conclusion was the need for an application with an interactive
interface that would provide user with an overview of the SNPs in respect
to their chromosomal positions, and tools for displaying the information
available on each of them. This interface should be scrollable and
zoomable and should accommodate an SNPs search facility. Finally, it
would be of the utmost importance to render SNPs comparatively with
other genetic elements of the same sequence such as microsatelites and
genes, or even better gene substructures.
1
The interviews handled only the Y chromosome SNP visualisation topic. The need for
genotype analysis data visualisation emerged after the interviews were conducted.
Since there are certain common elements in the applications that would serve these
two issues independently, they were merged into one.
12
2.2 Genotype Analysis Data Visualisation Design
In our project we decided to create an application to support genotype
analysis data visualisation. This was accommodated by the following three
tier web application1 (Figure 2).
Figure 2: Client – web server architecture for the genotype analysis data visualisation
The applet on the client size will be the visualisation tool. The middleware
lying on the application tier will be responsible for converting the genotype
analysis data into an applet compatible format and the data tier will cache
client data on the web server, using files.
1
The structure of a generic tree tier web application is explained in Appendix B
13
2.3 Y Chromosome SNP Visualisation
The other application our project gave rise to was a Y Chromosome SNP
visualisation which is accommodated by the following three tier web
application (Figure 3).
Figure 3: Client – web server architecture for the genotype analysis data visualisation
The applet on the client size will be the visualisation tool used in the
genotype visualisation. The middleware in the application tier will be
responsible for querying the database, transforming the data format, and
storing the data returned by the transformation in an applet compatible
format. The data tier uses a combination of a database and files.
14
2.4 Y Chromosome SNP Database Data Source Design
A relational database will hold SNP information. The database schema will
be similar to that of other SNP databases available on the web, but will be
modified to suit the specific needs of the application.
The Y chromosome SNP data will originate from flat file dumps of existing
web databases. A parser will be needed in order to read these flat files.
The parser will select the desired pieces of information and prepare scripts
to load the database (Figure 4). Database loading should be performed by
a database administrator.
Web
Database
Flat File 1
Web
Database
Flat File 2
Parser
Script
Database
Figure 4: Filling database with data
15
2.5 Overall Application Design
Since the web applications for the genotype analysis data and Y
Chromosome SNP visualisation share the same presentation tier and the
creation of an applet specific flat file in the application and data tier, they
can be merged into one application. Taking into consideration the
database creation, the whole web application can be summarised as
shown on Figure 5.
Figure 5: Overall web application design
16
2.6 Y Chromosome SNP Database design
The conceptual schema conceived for the SNP database is illustrated in
the Appendix C entity relationship diagram, and constitutes a modified
version of the dbSNP schema [i].
In dbSNP the term SNP refers to a broad collection of simple genetic
polymorphisms that includes, among the others: SNPs, small deletions or
insertions (a.k.a. Deletion Insertion Polymorphisms, DIPs), retroposable
element insertions and microsatellite repeat variations (a.k.a. Short
Tandem Repeats, STRs) (Kitts et al., 2003)
In the local database the term ‘variation’ has been used instead. This is
the reason for naming the database as Y Chromosome Variation
Database (YDB), since it is Y chromosome specific as well.
According to the YDB conceptual schema a variation has features such as
an internal id that is used as a unique identifier (varID), a unique dbSNP Id
(varRefID), allelic states, neighbouring sequence and, not always, a
heterozygosity estimate. In addition it belongs to a certain class, e.g. SNP,
or DIP, or STR and has a specific type, e.g. not withdrawn or artifact.
Each variation might have a mapweight, a code representing how many
times the variation occurs in the organisms genome, e.g. 1: once, 2: twice,
3: from 3 to 9 times1.
For every variation there is a certain validation status e.g. validated by
multiple submissions, or validated by frequency data, or even not
validated. Each validation status is associated with a certain colour, a
feature that can be exploited by the visualisation tool2.
1
For more information on variation classes and types check Kitts et al., 2003 and the
documentation available on dbSNP home page [d].
2 For more information on possible validation status check Kitts et al., 2003
17
A variation might be associated with one or more loci. Each association
has a sequential number and perhaps a type. Each type belongs to a
certain functional class, e.g. locus when the variation lies in the locus
region, approximately to a gene feature but not in the transcript. For every
functional class YDB keeps an appropriate description. For each locus
there is an id and a symbol stored. If the variation associated with a certain
locus is determined to be in a coding region, then the allelic state, the
reading frame, the translated amino acid residue and the position of the
amino acid in the peptide sequence may be defined.
Finally, a variation may have a set of contig hits reporting each genomic
position predicted by in silico sequence analysis. For every hit the strand
and type1 are being stored. For each contig the id and the accession
number are being stored. The version of the contig that the variation maps
at is stored in the contig hit entity.
1Whether
it is an exact base pair position, a range of base pairs or between two base pairs
18
3. Implementation1
3.1 YDB Implementation
3.1.2 Implementing the Relational Schema
YDB
was
implemented
using
MySQL2
[j]
Relational
DataBase
Management System (RDBMS). An overview of the tables created is
available in Appendix D. The script file containing the Structured Query
Language
(SQL)
commands
to
generate
these
tables
is
CreateAllTablesAndIndices.sql3.
3.1.3 Inserting Data
dbSNP4 has been used as the data source for YDB.
Human Y
chromosome specific flat files were downloaded through the ftp site [k].
Pieces of the information to be hosted on YDB existed in all the three
different flat file types: the chromosome report, the Abstract Syntax
Notation One (ASN1) flat file, and the FastA flat file5.
NCBIParser6 is a Java [l]
command line application that parses the
aforementioned flat files, and creates the InsertNonFixedData.sql3,
an SQL script containing the appropriate commands to load variation data
in YDB.
‘NonFixedData’ refers to the fact that the data just parsed and loaded are
those being updated on every build of dbSNP, e.g. variations that have
been added or removed.
InsertFixedData.sql3 contains SQL commands to load data that
remain constant in YDB such as the different variation classes and
mapweights together with their description. This file has been created
1
Arguments for or against the implementation choices are presented in the Discussion
section of this report.
2 Version 3.23.54
3 Contained in the ‘YDB’ folder of the CD
4 Build 115
5 Contained in the ‘YDB\dbSNPChrYData’ folder of the CD
6
Contained in the ‘NCBIParser’ folder of the CD
19
manually and complies to the relational schema of YDB and the context of
dbSNP.
The YDB data loading procedure is summarised in Figure 6. Figure 7 is
the NCBIParser class diagram. A characteristic of the NCBIParser is that it
implements pattern matching using regular expressions, a central feature
of Perl programming language (Leader, 2003) that become available in the
Java 2 Standard Edition Software Development Kit Version 1.4. (Hitchens,
2002)
Figure 6: YDB implementation summary
20
Figure 7: NCBIParser Class Diagram. NCBIParser class holds the methods for parsing
the dbSNP flat files and creating the SQL script file. The rest of the classes are
used in creating objects that would store the information of all the variations.
21
3.2 Presentation Tier
The front end of the application is an Hyper Text Mark up Language
(HTML) web page available at: http://balabio.dcs.gla.ac.uk/pafilisv/,
accessible by any web browser (Figure 8). An experimental version of
DerBrowser (Grigoriev, 1998), a Java applet, is the visualisation tool for
both marker genotypes and Y chromosome variations.
If the user selects to visualise a genotype analysis data file, then he/she is
prompted to upload the Excel file containing the genotyping experiment
results (Figure 9).
If the user selects to view Y chromosome variations then YDB is being
accessed and the visualisation tool invoked.
Figure 8: Visualisation Mechanisms Supporting Genotype Analysis, entry page.
22
Figure 9: Data file upload interface
3.3 Application Tier
3.3.1 Genotype Analysis Data Visualisation Middleware
After the Excel file has been submitted, a Hyper Text Transfer Protocol
(HTTP) POST request invokes a Common Gateway Interface (CGI) [m]
script written in Perl language [n]1. Figure 10 summarises the series of
actions performed by this script.
1
Contained in the ‘WebPage\MarkerMapVisualisation\CGI’ folder of the CD
23
Figure 10: Implementing marker genotype visualisation business logic
Initially the Excel file is saved in the server side. Then a Perl script, written
by Ashley Pond [o]1, reads the Excel file and converts it into a tab
delimited text file. Subsequently, the tab delimited text file is being parsed
by JMarkerMapParser2, a Java command line tool and a DerBrowser
compatible flat file is being created. Finally, the Perl – CGI script returns
an HTML page (Figure 11) where the marker genotypes are being
rendered by DerBrowser.
1
2
Contained in the ‘WebPage\MarkerMapVisualisation\XLS2TXT’ folder of the CD
Contained in the ‘JMarkerMapParser’ folder of the CD
24
Figure 11: Genotype Analysis Data Visualisation. The colour of each box corresponds to
a different genotype according to the colour code shown on top. The scale on
the left hand side corresponds to the physical chromosome length. Individual
rat strains are shown as columns.
3.3.2 Y Chromosome Variation Visualisation Middleware
The visualisation of Y chromosome variations logic is implemented by a
servlet [p], a Java server-side program that contains methods to respond
to HTTP requests, to connect to databases and to return a new HTML
page.
25
CreateDBrFlatFileServlet.java1
implements
the
functionality
displayed in Figure 3. It connects to YDB using Java DataBase
Connectivity (JDBC)2 [q], queries the name and position and validation
status colour of every Y chromosome variation (Appendix E), writes these
data in a DerBrowser compatible file and returns an HTML page (Figure
12) with the visualisation.
In order for the servlets to be functional, they must be contained inside a
servlet engine running on a web server. Jigsaw 2.2 [r], both a web server
and a servlet engine, was used.
Figure 12: Y Chromosome Variation Visualisation, with the physical chromosome scale
on the left, and SNPs shown as black boxes in the first column.
1
2
Contained in the ‘WebPage\YDBServlets’ folder of the CD
JDBC, the application programming interface for connecting Java programs with
database systems
26
3.4 Implementation Summary
The implementation procedure mentioned in the previous sections is
summarised in Figure 13.
Figure 13: Overall Implementation Summary
27
4. Testing and Evaluation
4.1 Testing
Throughout the whole project comprised of a series of incremental steps.
At each stage the correctness of newly added code was tested, using
manual comparison of input and output data.
The dbSNP flat files have been checked for either missing or invalid data,
e.g. a missing value or a word instead of a number. This check was
performed partially during the parsing and partially during the YDB data
loading.
Another possible source of errors is the format of the uploaded Excel file
containing marker genotype data. On the relevant web page (Figure 9)
there are clear directions regarding the specifications that the Excel file
should comply with. A file that can be used as template is available for
downloading as well.
For both NCBIParser and JMarkerMapParser there were no synthetic
control data samples available. As an alternative, flat files were generated
containing every different combination of data that should be dealt with.
The parsers were fine tuned using these synthetic data files and then
applied to the real data.
The servlet associated with the Y chromosome variation visualisation was
initially constructed using a stand-alone Tomcat 4.0 [s] servlet engine and
was transferred to Jigsaw after it has been debugged. Prompts have been
added to server command line output in order to provide web server
administrator with proper feedback (Figure 14). Finally, when the
application was released, it was tested at the BPG location in the Western
Infirmary in order to identify any possible access or configuration issues.
28
Figure 14: Providing administrator with feedback on the servlet execution
4.2 Evaluation
A molecular biologist and a bioinformatician of the BPG evaluated the web
application. They were given some time to test both the marker genotype
and the Y chromosome variation visualisation and they were asked a
number of specific questions (Appendix F) focusing on user friendliness,
visualisation issues, problems and deficiencies, and suggestions for
further improvement.
29
5. Discussion
5.1 Design Decisions
5.1.1 Only dbSNP as the Data Source for YDB
The initial concept was that the YDB data would originate from more than
one web database. Considering the time limit for the completion of the
project, we decided on building a prototype with equally developed
components and propose data integration from several web databases as
a future extension1.
The selection of dbSNP as the data source is supported by the fact that
dbSNP is integrated with other large public databases, via the LocusLink
database (Sherry et al., 2001). The existence of such links can assist the
identification of entries in different databases that refer to the same
variation and, thus, support data integration.
The plethora of documentation and guidance available on the dbSNP web
site is another reason for selecting dbSNP. The topics ranged from the
biological significance of genome variation to technical details of the
database, easing this way the gathering of background information and
database design.
The fact that dbSNP flat files are divided by chromosome, suited the
purpose of the application to focus only on Y chromosome.
5.1.2 Not only SNPs
The information on types of genomic variation other than SNPs could have
been disregarded, since BPG researchers are mainly interested in SNPs.
However we decided on retaining it because, as will be demonstrated in
the ‘Further Improvement’ section that DerBrowser can display these extra
pieces of information in a biologically significant way.
1
Data integration as a further improvement is being presented in the next chapter of the
report
30
5.2 Implementation Decisions
5.2.1 MySQL as the RDBMS
MySQL has been selected as the RDBMS because it offers stability,
support, and low cost (Suehring, 2002). This means that if e.g. BPG want
to install MySQL locally that will require neither significant funds, nor
administrative personnel.
MySQL advantages and disadvantages compared to other RDBMSs are
being displayed in Table 1.
Table 1: Comparison of SQL Implementations1
5.2.2 Parsers in the Java Language
The parsers needed for the application have been written in Java. This is
due to our familiarity with this language which exploits the advantages of
object oriented programming, and allowed us to create easily the required
data structures and algorithms.
1
From Suehring, 2002, TCO is the abbreviation of Total Cost of Ownership
31
5.2.3 Both CGI and Servlets as Middleware Technology1
Both marker genotype and Y chromosome variation visualisation
middleware technology could have been implemented with either CGI or
Servlets.
The older and slower2 CGI, written in Perl, (Speegle, 2002) was selected
to implement marker genotype visualisation. The fact that no complicated
tasks, e.g. database connectivity, were required in this part of our work,
combined with the ease offered by Perl in invoking command line
applications and scripts, were the reasons for this choice.
For the more demanding Y chromosome variation visualisation Java
Servlet technology has been used. Since servlets are Java programs, they
have access to the entire family of Java Application Programming
Interfaces and receive all the benefits of the Java language including
portability and crash protection.
In addition, because one of the application requirements is the interaction
between the applet (DerBrowser) and the database (YDB), servlets can be
used in order to guarantee safety and to overcome security issues
(Speegle, 2002).
5.2.4 DerBrowser as the Visualisation Tool
DerBrowser is a visualisation tool that is conforms to the requirements we
identified during our project. It is an applet designed to display objects on a
genomic map, it can query a data source about a selected object, find an
object by name, and provides the user with zooming and scrolling
capabilities. In addition, DerBrowser is written is an older version of Java
and thus is supported by most of the web browsers.
1
2
A comparison of middleware technologies is available at Speegle, 2002 Preface pages x, xi
Slower, since it greatly increases the number of processes running on a server.
32
5.3 Application Criticism: Achievements and Drawbacks
The data processing and visualisation mechanisms developed within the
scope of this project performed well only in providing BPG researches with
visualisation tools for viewing their genotype analysis results. As shown by
Figure 11, the desired result of viewing marker colour-coded genotypes, in
respect to the marker physical location has been achieved. However, it
has to be mentioned that extensive use of zooming and scrolling, together
with an ‘artificial increase’ in the length of the markers1 were necessary.
Figure 15: Y Chromosome Variation Visualisation in the highest zoom possible
1
Markers by definition indicate a certain point in the genome. In order to improve
visualisation they have assigned range whose centre is the genomic position that they
indicate.
33
On the contrary the Y chromosome variation visualisation (Figures 12 and
15), did not have the expected outcome. Most of the variation names did
not appear at all and many variations due to their proximity appeared as
smear. In addition the requirement of querying YDB and retrieving
variation specific information has never been met.
However, from a broader point of view, the achievement was the
implementation of an application that can serve more than one purpose
and provides a fundamental storage and visualisation facility that can be
extended in many directions.
5.4 Further Improvement
This application demonstrated clearly that DerBrowser, proves to be more
useful in displaying large objects rather than small genomic variations.
There is a need towards increasing the zooming capacity, something that
should be accompanied by a smooth scrolling behaviour.
The name display perhaps should be reconsidered, along with the
suggestion of colouring not only the genetic objects but their labels as well.
Before viewing Y chromosome variations, the user should be prompted to
specify what exactly he/she wishes to view and in which way, through drop
down menus, radio- and check-buttons.
For example he/she might select only the validated variations, or the
variations that exist in a certain region, or have a certain degree of
heterozygosity. Even further he/she may choose to view a certain kind of
variation (e.g. SNPs) in one of DerBrowser stripes and another kind of
variation (e.g. microsatelites) in the adjacent stripe.
34
The latter could be of utmost importance. If the user could select to view
concurrently certain genomic variations, such as SNPs, in one stripe and
other genetic elements, such as genes, in the next one, a better
understanding of the positions and significance of the SNPs could be
gained. However, this implies that the database should contain all those
pieces of information, and thus would constitute an integrated biological
database rather than a database of genomic variation. The data
integration issue, though, despite of the efforts to solve it, will remain a
difficult problem for the conceivable future (Stein, 2003).
35
6. Conclusion
This project demonstrated how the features of multi-tier web applications
can be exploited in order to serve the purposes of bioinformatics. The
separation between the presentation layer, application logic and the data
storage layer, supported the segmentation of the problem into smaller
parts and assisted the implementation of components which deal with
each one of the layers specifically.
Existing visualisation tools and web interfaces were used in order to
present biological data in a more explicit manner than would be possible
within a spreadsheet program. However, the existing software had to be
modified and adjusted so that it could serve the exact needs of the user,
as defined during the software requirements analysis.
A variety of middleware technologies were used to implement the
application logic. Those included CGI scripts and Java Servlets.
We used a MySQL database management system to manipulate biological
data. We identified further issues in data integration that would provide the
biologists with valuable complementary and comparative information on
genotypes and genetic variation.
Implementing a web application in order to satisfy the needs of the BPG
research group was not only a software engineering task but required a
significant amount of problem solving and intellectual scrutiny of the data
and the research practice.
36
References
A. Literature

Brown,T.A. (2002) Genomes Second Edition. BIOS, pp: 131, Box 5.1.

Charchar F. et al. (2002) The Y Chromosome Effect on Blood
Pressure in Two European Populations. Hypertension, 39: 353 356. [HTML]

Grigoriev, A. et al. (1998) Distributed environment for physical map
construction. Bioinformatics, 14, 242-258

Hitchens, R. (2002) Java™ NIO. O’Reilly, ISBN: 0-596-00288-2.

Hubbard, T. et al. (2002) The Ensembl genome database project.
Nucleic Acids Research 30, 1, 38–41.

Hunter, B. (2003) Gene Visualisation And Comparison Tool. Final
Year Project Report, Dept. Of Comp. Science, Uni. Of Glasgow, pp: 3,
4.

Jobling, M., Tyler-Smith,C. (2003) The Human Y Chromosome: An
Evolutionary Marker Comes Of Age. Nature 424, 598–612

Kitts, A., Sherry, S. (2003) The Single Nucleotide Polymorphism
Database (dbSNP) of Nucleotide Sequence Variation. NCBI
Handbook, Chapter 5. [HTML]

Leader,
D.
(2003)
David’s
Perl.
Perl
Module
Notes
MRes
Bioinformatics Glasgow University.

Ogren,M. (2003) Whole-Genome SNP Genotyping. The Scientist,
17,11, 42. [HTML]

Rozen, S. et al. (2003) Abundant gene conversion between arms of
massive palindromes in human and ape Y chromosomes.

Sherry, S.T et al. (2001) dbSNP: the NCBI database of genetic
variation. Nucleic Acids Research. 29:308-311.
37

Skaletsky, H. et al. (2003) The male-specific region of the human Y
chromosome: a mosaic of discrete sequence classes. Nature 423,
825–837.

Speegle, G. (2002) JDBC: Practical Guide For Java Programmers.
Morgan Kaufman Publishers, ISBN: 1-55860-736-6

Stein, L. (2003) Integrating Biological Databases. Nature Reviews
Genetics 4, 337-345.

Suehring S. (2002) MySQL Bible. Wiley, pp: 7-14, ISBN: 0-7645-4932-4

The International SNP Map Working Group. (2001) A map of human
genome sequence variation containing 1.4 million SNPs. Nature 409,
928–933.
38
B. Internet Resources
[a] British Heart Foundation Blood Pressure Group:
http://www.medther.gla.ac.uk/bhf/index.htm
[b] Google Search Engine: http://www.google.com
[c] Scirus Scientific Information Search Engine: http://www.scirus.com
[d] Single Nucleotide Polymorphism Database (dbSNP) Home Page:
http://www.ncbi.nlm.nih.gov/SNP/
[e] National Center for Biotechnology Information Home Page:
http://www.ncbi.nlm.nih.gov/
[f] The SNP Consortium (TSC) Home page: http://snp.cshl.org/
[g] Ensembl Genome Browser: http://www.ensembl.org
[h] Ensmart Data Retrieval Tools Set: http://www.ensembl.org/EnsMart/
[i] dbSNP Schema: ftp://ftp.ncbi.nih.gov/snp/mssql/schema/erd_dbSNP.pdf
and changes:http://www.ncbi.nlm.nih.gov/SNP/snp_schemaChange.htm
[j] MySQL Home Page: http://www.mysql.com
[k] dbSNP FTP site: ftp://ftp.ncbi.nih.gov/snp/
[l] The source for Java Technology http://java.sun.com/
[m] CGI Information Page: http://hoohoo.ncsa.uiuc.edu/cgi/
[n] Perl Home Page: http://www.perl.com/
[o] Excel to Tab delimited Text File Converter
http://sedition.com/perl/excel_to_delim.html
[p] Java Servlet Technology: http://java.sun.com/products/servlet/
[q] JDBC Technology: http://java.sun.com/products/jdbc/
[r] Jigsaw Home Page: http://www.w3.org/Jigsaw/
[s] Tomcat Home page: http://jakarta.apache.org/tomcat/
39
Appendix A
Screen shots of SNP mapping visualisation and information retrieval
facilities.
Ensembl (Ensmart)
40
SNP ‘rs3848982’ following the link to UCSC Genome Browser from the dbSNP record
41
Appendix B
Structure of a generic tree tier web application.
Abstract structure of a three tier web architecture. There is a clear distinction among
the presentation layer that displays data to the user, the application layer that implements
business logic and a data layer where data are being stored.
42
Appendix C
The conceptual schema of YDB.
strSymbol
strDescr
varClassName
locatTypeID
1
VariationClass
is
N
Strand
varClassDesc
hetero
alleles
varTypeName
VariationType
1
N
has
N
1
Variation
mapWeightID
MapWeight
N
1
N
CtgHit
has
contigAcc
contigID
N
Contig
regarding
N
1
ctgHitID
has
mapWeight
Description
contigVersion
1
N
startPosCtg
valStatus
ID
valStatus
Abbr
has
sequence
varID
varTypeDesc
locatType
Name
1
1
hetSError
varRefID
has
LocationType
N
endPosCtg
1
ValidationStatus
has
belongsTo
has
N
N
valStatus
Desc
valStatus
Colour
1
varResidue
varAllele
readFramePos
aaPosition
1
FunctionClass
has
1
fxnClassName
fxnClassDesc
N
Associated
Locus
Chromosome
maps
assocSeq
No
startPosChr
N
endPosChr
regarding
1
locusID
Locus
locusSymbol
43
chrID
Appendix D
Overview of YDB Relational Schema
1. VariationClass (varClassName, varClassDescription)
2. VariationType (varTypeName, varTypeDescription)
3. MapWeight (mapWeigthID, mapWeightDescription)
4. ValidationStatus (validStatusID, validStatusAbbr,
validStatusDescription, validStatusColour)
5. FunctionClass (fxnClassName, fxnClassDescription)
6. Variation (varID, varRefID, varClass, alleles, hetero, hetSError,
validStatus, varitype, varMapWeight, sequence)
7. Locus (locusID, locusSymbol)
8. Strand (strSymbol, strDescription)
9. Contig (contigID, contigAcc)
10. AssociatedLocus (assocSeqNo, asVarID, asLocusID, fxnClass,
readFramePos, varAllele, varResidue, aaPosition)
11. CtgHit (ctgHitID, variationID, ctgID, ctgVersion, startPosCtg,
endPosCtg, ctgStrand, ctgLocationType, chrom, startPosChr,
endPosChr)
12. LocationType (locatTypeID, locatTypeName)
44
Appendix E
SQL Query
SELECT startPosChr, endPosChr, varRefID, varID,
validStatusColour
FROM Variation, CtgHit, ValidationStatus
WHERE startPosChr <> 0
AND varID = variationID
AND validStatus = validStatusID;
45
Appendix F
Questionnaire
Marker Map Visualisation
How easy was it for you to use the module? (Please tick as appropriate)
Very easy
Relatively
easy
Intermediate
Hard
Very hard
Were the titles used the appropriate ones? (Yes/No)
If No please suggest some alternative ones:
……….…………………………...………………………………………………
………………………………………………………………………………………
How informative was the visualisation?
Very much
Did
you
Adequately
identify
Intermediate
any
Inadequately
problems
or
Not at all
malfunctions?
………………………………………………………………………………………
….…………………………………………………………………………………..
………………………………………………………………………………………
What would you like to suggest as an amendment or a further
improvement?
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
46
Y Chromosome Variation
How easy was it for you to use the module? (Please tick as appropriate)
Very easy
Relatively
easy
Intermediate
Hard
Very hard
Were the titles used the appropriate ones? (Yes/No)
If No please suggest some alternative ones:
……….…………………………...………………………...………………………
………………………………………………………………………………………
How informative was the visualisation?
Very much
Did
you
Adequately
identify
Intermediate
any
Inadequately
problems
or
Not at all
malfunctions?
………………………………………………………………………………………
….…………………………………………………………………………………
What would you like to suggest as an amendment or a further
improvement?
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
Would you support the creation of a Y Chromosome Integrated Database?
Please support your answer with a few points or examples.
………………………………………………………………………………………
………………………………………………………………………………………
47
Both Applications
DerBrowser was not originally built in order to render genetic markers. Are
satisfied with its performance? Please support your answer with a few
points or examples.
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
Do you think that the visualisation software should be extended further?
Please support your answer with a few points or examples.
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
48