Bioinformatics Data Representation and

Bioinformatics Data
Representation and Integration
Ngozi Oleleh
Table of Contents
Introduction to Bioinformatics
Proteins and Sequences
Bioinformatics Tools
The databases
Blast Functions
What is Bioinformatics
• Bioinformatics is the use of computers to study and
handle biological Information
• Bioinformatics can be looked at as an integration of
computer science and Biology to help enhance the
study of biological data which has been proven to be
very extensive
• The role of computer science in this Interdisciplinary is
to store the data(via databases) for future Analysis via
biological tools
• This field’s study includes but is not limited to the study
of genes, dna sequences and protein structures
Protein and Sequences
Biological proteins are made up of 20 amino acids
aspartic acid
glutamic acid
- ala - A
- arg – R
- asn – N
- asp – D
- cys – C
- gln – Q
- glu - E
- gly – G
- his – H
- ile – I
- leu – L
- lys – K
- met – M
- phe – F
- pro – P
- ser – S
- thr - T
- trp - W
- tyr – Y
- val – V
Proteins and Sequences
• Combination of these amino acids make up protein
structures and sequences
• Pdb database contains numerous protein structures
that are similar by sequence alignment of fold
• Bioinformatics studies difference and similarities of
these protein structures based on sequence similarity
• A Sequence is a combination of amino acids.
• This sequences can contain biological data, that can be
used to denote information about families of proteins
Bioinformatic Tools
• Mage
– Used to display protein singular structures
• Rasmol
– Used to display protein 3d Structure
– For pairwise Sequence Alignment
• ClustalW
– Used for Multiple Sequence Alignment
• Ammp
– Molecular Modeling
• Sequence Alignment Tools
– BLAST (will be looked at extensively)
Biological Databases
• There are over 5000 public biological databases
• These databases contain genomic, proteomic and
microarray data.
• This so called data is made up of sequence of genes or
amino acids of proteins
• Biological databases have become very useful to
scientists. It is important in understanding and
explaining a host of biological phenomena from the
structure of biomolecules and their interaction, to the
whole metabolism of organisms and to understanding
the evolution of species.
• This knowledge helps facilitate the fight against
diseases, assists in the development of
medications and in discovering basic relationships
amongst species in the history of life.
• The biological knowledge is distributed amongst
many different general and specialized databases.
This sometimes makes it difficult to ensure the
consistency of information.
• Biological databases cross-reference other
databases with accession numbers as one way of
linking their related knowledge together.
• Bioinformatics databases can be grouped into 2
groups: Generalized databases and Specialized
• Generalized databases
– Primary Sequence Databases (EMBL,
– Protein Sequence Databases(Swissprot,UniProt, UniRef)
– Carbohydrate Databases (CarbBank)
– 3d structure Databases (PDB, EBI-MSD,NDB)
Specialized Databases
• Specialized databases
– Specialized Sequence database
– Genome databases
– Specialized Protein Sequence database
– Specialize Structure databases
– Microarray databases
Main focus are the Generalized databases
Primary Sequence Database
• Primary sequence databases
– EMBL (European Molecular Biology Laboratory
nucleotide sequence database at EBI, Hinxton, UK)
– GenBank (at National Center for Biotechnology
information, NCBI, Bethesda, MD, USA)
– DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan)
Protein Sequence Database
• Protein sequence databases
– SWISS-PROT (Swiss Institute of Bioinformatics, SIB, Geneva, CH)
– TrEMBL (=Translated EMBL: computer annotated protein
sequence database at EBI, UK)
– PIR-PSD (PIR-International Protein Sequence Database,
annotated protein database by PIR, MIPS and JIPID at NBRF,
Georgetown University, USA)
– UniProt (Joined data from Swiss-Prot, TrEMBL and PIR)
– UniRef (UniProt NREF (Non-redundant REFerence) database at
– IPI (International Protein Index; human, rat and mouse
proteome database at EBI, UK)
Other Databases
• Carbohydrate databases
– CarbBank (Former complex carbohydrate structure
• 3D structure databases
– PDB (Protein Data Bank cured by RCSB, USA)
– EBI-MSD (Macromolecular Structure Database at EBI,
UK )
– NDB (Nucleic Acid structure Database at Rutgers State
University of New Jersey , USA)
Blast is a heuristic algorithm to detect sequence
similarity and is optimized for speed. It is suitable
for large scale analysis
What blast does is to match a queried sequence to
certain positions of database sequences
Quick Diversion
• Blast Example
• Sequence to be queried
Sequences producing significant alignments:
E Value
pdb|2FXP|A Chain A, Solution Structure Of The Sars-Coronaviru... 82.4 3e-17
pdb|2BEZ|F Chain F, Structure Of A Proteolitically Resistant ... 81.6
pdb|1WNC|A Chain A, Crystal Structure Of The Sars-Cov Spike P... 77.8 7e-16
pdb|1WYY|A Chain A, Post-Fusion Hairpin Conformation Of The S... 76.6 1e-15
pdb|2BEQ|D Chain D, Structure Of A Proteolytically Resistant ... 69.7
pdb|1ZVA|A Chain A, A Structure-Based Mechanism Of Sars Virus... 68.6 5e-13
pdb|1ZV7|A Chain A, A Structure-Based Mechanism Of Sars Virus... 65.9 3e-12
pdb|1ZV8|B Chain B, A Structure-Based Mechanism Of Sars Virus... 65.5 4e-12
pdb|1WDG|A Chain A, Crystal Structure Of Mhv Spike Protein Fu... 25.4 4.7
pdb|2A11|A Chain A, Crystal Structure Of Nuclease Domain Of R... 24.3
Blast Functions in Databases
– Blast is one of the most heavily used data analysis
tools available, hence large scale data analysis
need to supports BLAST functions.
– Blast Support is achieved by defining a set of userdefined functions that return BLAST results as a
– Many databases Support Blast Functions
– Blast 2 major functions are
The Blast Functions
function BLASTP_MATCH (
query_seq CLOB,
seqdb_cursor REF CURSOR,
subsequence_from NUMBER default 1,
subsequence_to NUMBER default -1,
filter_low_complexity BOOLEAN default false,
mask_lower_case BOOLEAN default false,
sub_matrix VARCHAR2 default ’BLOSUM62’,
expect_value NUMBER default 10,
open_gap_cost NUMBER default 11,
extend_gap_cost NUMBER default 1,
word_size NUMBER default 3,
x_dropoff NUMBER default 15,
final_x_dropoff NUMBER default 25)
return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);
Parameter Description
query_seq The query sequence to search. A sequence is just lines of sequence data. Blank lines are not
allowed in the middle of bare sequence input.
seqdb_cursor The cursor parameter supplied by the user when calling the function. It should return two
columns in its returning row, the sequence identifier and the sequence string.
Subsequence from Start position of a region of the query sequence to be used for
the search. The default is 1.
Subsequence To End position of a region of the query sequence to be used for
the search. If -1 is specified, the sequence length is taken as subsequence to. The default is -1.
Filter_low_complexity TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that
have low compositional complexity. Filtering can eliminate statistically significant but biologically
uninteresting regions, leaving the more biologically interesting regions of the query sequence available for
specific matchingagainst database sequences. Filtering is only applied to the query sequence. The default
value is FALSE.
mask_lower_case TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the
query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from
the sequence. The default value is FALSE.
sub_matrix Specifies the substitution matrix used to assign a score for aligning any possible
pair of residues. The different options are PAM30, PAM70, BLOSUM80, BLOSUM62, and
BLOSUM45. The default is BLOSUM62.
expect_value The statistical significance threshold for reporting matches against database
sequences. The default value is 10. Specifying 0 invokes default behavior.
open_gap_cost The cost of opening a gap. The default value is 11. Specifying 0 invokes
default behavior.
extend_gap_cost The cost of extending a gap. The default value is 1. Specifying 0 invokes
default behavior.
word_size The word size used for dividing the query sequence into subsequences during the
search. The default value is 3. Specifying 0 invokes default behavior.
x_dropoff Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes
default behavior.
final_x_dropoff The final X dropoff value for gapped alignments in bits. The default value is
25. Specifying 0 invokes default behavior.
t_seq_id The sequence identifier of the returned match.
score The score of the returned match.
expect The expect value of the returned match.
How the whole system Works
• Sequences that need to be searched are
inserted into a query table
• INSERT INTO query_db VALUES (’1’,
How does it work
• Select T_SEQ_ID, score, EXPECT as evalue
(select sequence from query_db), -- query_sequence
CURSOR(SELECT seq_id, seq_data
FROM swissprot
WHERE organism = 'Homo sapiens (Human)'), -- seqdb_cursor
1, -- subsequence_from
-1, -- subsequence_to
t where t.score > 25;
The Search Procedure
SELECT t.t_seq_id, t.score, t.expect,
(SELECT sequence FROM query_db WHERE sequence_id = ’2’),
CURSOR(SELECT seq_id, sequence FROM PROT_DB),
)t WHERE t.t_seq_id = p.seq_id AND t.score > 25
ORDER BY t.expect;
Output Results
-------- ---------- ---------P31946 205 5.8977E-18
Q04917 198 3.8228E-17
P31947 169 8.8130E-14
P27348 198 3.8228E-17
P58107 49 7.24297332
The Databases and Why
• The ability to perform genome-wide and cross-genome data
analysis can reduce time required for new biological
• Since traditional databases are not built to support location
datatypes, researchers are forced to find ways in which these
databases can manage biological information that will permit
information to be queried with a Modern database system
• This research has led to a concept called Bioindexing
• An index in this construct is basically a way of providing a
mapping between information entities.
• In a traditional database, an index is an auxiliary structure
which speeds up the data retrieval process by providing a
mapping between a record key and the physical disk
address of the records containing the key
• Bioindexing provides similar functionality as a database
index but also facilitates DATA INTEGRATION
• Biological features are generally attached to locations and
locations are also the bases for maps(MAPS in this context
is an association of features with a sequence alignment),
alignment ( relationships between two genomic sequence
segments ) and other complex relationships.
The Blast Database and Bioindexing
• Bioindexing is essentially an infrastructure for
representing and managing biological knowledge
in a large-scale database system using index
• Bioindexing uses “location” datatype and “BLAST
JOINS” to efficiently handle and query the large
amount of data.
• Bioindexing is essentially a scheme for connecting
and querying information with modern database
Types of Indexing
• Intrinsic Indexing: Indexable bioinformatics
datatypes. Intrinsic indexing permits both the
representation and management of biological
• Extrinsic Indexing : is basically an efficient way of
data integration from different heterogeneous
sources such as relational tables, xml files
standard sequence formats and other sources.
• Extrinsic indexing concerns the functions and
algorithms used to access and connect this
information, even when it is not stored locally
Location (How it is represented)
• Without proper abstraction, users have to implement
their own codes to handle location operations
• A location consists of a sequence identifier and an
interval range.
• Integer Interval are modeled in [lower,upper] structure
• Identifiers are character strings or accession numbers
used to denote a particular sequence and interval
range consists of a pair of positive integers used to
denote the sub-range within the given sequence
Complexity (Where Clauses ) if no
location Datatypes
Est sequences being needed to be grouped over consecutive overlapping EST fragments
SELECT DISTINCT, A.lower, B.upper
WHERE A.unigene_clusterid = B.unigene_clusterid
AND A.lower < B.upper
WHERE C.unigene_clusterid = A.unigene_clusterid
AND A.lower < C.lower AND C.lower < B.upper
WHERE D.unigene_clusterid = A.unigene_clusterid
AND D.lower < C.lower AND C.lower <= D.upper))
WHERE E.unigene_clusterid = A.unigene_clusterid
AND ((E.lower < A.lower AND A.lower <=E.upper) OR
(E.lower < B.upper AND B.upper < E.upper)))
Location Datatype
• A straightforward representation of a location would
be a sequence identifier as a character string and the
location interval as (start, end) pair of integers.
• There are other possible representations such as
integer codes for sequence identifiers and or a
(start,length) interval representation
• Most databases use the sequence identifier, and
location (start, end ) pair of integers.. WHY..because of
Simplicity using Location Datatype
“Creation and Insertion”
• CREATE TABLE features ( location loc, description text);
-- The Prader-Willi/Angelman syndrome region on chromosome 15
INSERT INTO features VALUES ( 'NG_002690[1..755217]', 'Prader-Willi/Angelman
syndrome region' );
INSERT INTO features VALUES ( 'NG_002690[1..174707]', 'AC090602.16' );
INSERT INTO features VALUES ( 'NG_002690[174707..324834]', 'AC124312.5' );
INSERT INTO features VALUES ( 'NG_002690[324835..478258]', 'AC124303.5' );
INSERT INTO features VALUES ( 'NG_002690[478259..606120]', 'AC100774.2' );
INSERT INTO features VALUES ( 'NG_002690[606121..755217]', 'AC124997.4' );
• The introduction of location datatype not only provides
a natural and intuitive way to represent biological
information, but also boosts system performance.
• Additional performance increase could be achieved by
supporting the location index scheme.
• Supports for indexing schemes in traditional relational
database systems are very limited and inflexible.
• They are only limited to a few well-known index
structures, such as B+-tree, Hash and R-tree and could
be used for a limited set of native data-types for
(in)equality and range queries.
• Essentially there are operation and functions
supported in the location datatype.
• A major proportion of these functions are related to
interval operations.
• More than 30 interval operations are defined, including
Allen's interval logic [15] (which includes after, before,
contains, during, equals, overlaps, overlapped by,
• finishes, finished by, meets, met by, starts and started
• Optimization information (such as regarding ordering,
commutativity or negation) is also provided to permit
optimization of important operations like merge-join,
hash-join or general theta-join.
Why location datatype is Needed
• Here is a simple example to demonstrate the
power of location datatype support. This
example shows a session that painfully
attempts to locate alternatively spliced exon
intervals which intersect with known
homology intervals and associate them with
known protein features from the Pfam and
Swissprot databases.
Complexity without locations
• CREATE TABLE alt_splice_homology_map AS
• SELECT o.*, d.swiss_id, d.query_start, d.query_end,
– d.hit_start+(o.seq_start-d.query_start)/3,
– d.hit_start+(o.seq_end-d.query_start)/3,
• FROM alt_splice_exon_obs o, alt_splice_homology d
• WHERE o.ug_id = d.ug_id
– AND o.seq_start > d.query_start
– AND o.seq_start < d.query_end
– AND d.e_value < 0.01
GROUP BY o.ug_id, o.seq_start;
SELECT o.*, f.type, f.start, f.end
FROM alt_splice_homology_map o, swiss_feature f
WHERE o.swiss_id=f.swiss_id
– AND o.hit_end >= f.start
– AND o.hit_end <= f.end;
Simplicity using locations
CREATE TABLE alt_splice_homology_map AS
SELECT o.*, d.location,
FROM alt_splice_exon_obs o, alt_splice_homology d
WHERE o.location @ d.location -- contained
AND d.e_value < 0.01
SELECT o.*, f.type, f.location
FROM alt_splice_homology_map o, swiss_feature f
WHERE o.location &< f.location -- left overlap
Location Support
• Supporting location indexing in a traditional
database implies the need to support interval
• BUT, interval indexing is not supported in
traditional databases and standard join
operations could not handle intervals
efficiently, this has led to extensive research
for interval indexing.
• Here lies the need for a concept called GIST
• Is an efficient solution handle the problem of
ineffective interval indexing in traditional database
• Gist is basically a balanced search tree in which keys
are maintained in a hierarchical manner. The search
keys used in gist may be any arbitrary predicate, but
this predicate must hold true for the data searched
below a key.
• Gist searches by traversing the entire tree in a deptfirst search manner. If the query predicate is consistent
with a given search key, Gist will continue to search the
subtree below the key
Gist Implementation
• Gist is implemented using bounding intervals that
covers the range of
• Identifier integers (id_lower,id_upper)
• And
• Intervals in the subtree (lower,upper)
• Under Gist architecture interval predicates such
as such as left, right overlap, overleft,overright,
contains, contained and equal are all supported
What gist location does
• Bioinformatics databases are being modeled and queried
using function(as seen in oracle and ibm DB2)
• An efficient way of modeling these databases are seen
using bioindexing (as seen in postgre- sql database)
• The use of an index structure as seen in Bioindexing, where
a location is modeled using a (DFS) tree structure leads to
less complexity.
• This location index structure leads to an faster searching of
the databases
• This concept of speed is very important in bioinformatics
• Using a gist architecture, lead to less complex queries and a
more confined search sector for query information.
