Download Introduction to Using the Relational Model of Data in Biology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Short Introduction to Analyzing Biological
Data Using Relational Databases
Part I: Introduction to Using the Relational Model of Data in Biology
Alex Ropelewski
[email protected]
Pittsburgh Supercomputing Center
National Resource for Biomedical Supercomputing
Bienvenido Vélez
[email protected]
University of Puerto Rico at Mayagüez
Department of Electrical and Computer Engineering
1
Learning Objectives
• Awareness of the diverse nature of data
managed by biologists
• Understanding how to model biological data and
their relationships using tables
• Minimizing data redundancy through
normalization
3
Bioinformatics
The interdisciplinary science of using computational approaches
to analyze, classify, collect, represent and store biological
data with the goal of accelerating and enhancing the
understanding of DNA, RNA and protein molecules.
CATD_H
CHYM_B
CARP_YEAST
CARP_RHICH
PEPA_ASPAW
:
:
:
:
:
*
20
*
40
*
60
*
------MQPSSLLPLALCLLAAPASALVRIPLHKFTSIRRTMSEVG--------------GSVEDLIAKGPV
--------MRCLVVLLAVFALSQGTEITRIPLYKGKSLRKALKEHG---------------LLEDFLQKQQY
--------MFSLKALLPLALLLVSANQVAAKVHKAKIYKHELSDEMKEVTFEQHLAHLGQKYLTQFEKANPE
MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQ-------------KAIAKYNKHKIN
-------MVVFSKTAALVLGLSSAVSAAPAPTRKGFTINQIARPANKTRTIN-------LPGMYARSLAKFG
l
p K
6
:
:
:
:
:
52
49
64
59
58
CATD_H
CHYM_B
CARP_YEAST
CARP_RHICH
PEPA_ASPAW
:
:
:
:
:
80
*
100
*
120
*
140
SKYSQAVPAVTEGPIPEVLKNYMDAQYYGEIGIGTPPQCFTVVFDTGSSNLWVPSIHCKLLDIACWIHHKYN
GISSKYSGFGEVASVP--LTNYLDSQYFGKIYLGTPPQEFTVLFDTGSSDFWVPSIYCKSN--ACKNHQRFD
VVFSREHPFFTEGGHDVPLTNYLNAQYYTDITLGTPPQNFKVILDTGSSNLWVPSNECGSL--ACFLHSKYD
TSTGGIVPDAGVGTVP-MTDYGNDVEYYGQVTIGTPGKKFNLDFDTGSSDLWIASTLCTNCG---SRQTKYD
GTVPQSVKEAASKGSAVTTPQNNDEEYLTPVTVGKS--TLHLDFDTGSADLWVFSDELPSSE---QTGHDLY
1 2Y
6 6Gtp
f 6 fDTGSs1lW6 S c
:
:
:
:
:
124
117
134
127
125
CATD_H
CHYM_B
CARP_YEAST
CARP_RHICH
PEPA_ASPAW
:
:
:
:
:
*
160
*
180
*
200
*
SDKSSTYVKNGTSFDIHYGSGS-LSGYLSQDTVSVPCQSASSASALGGVKVERQVFGEATKQPGITFIAAKF
PRKSSTFQNLGKPLSIHYGTGS-MQGILGYDTVTVSN-----------IVDIQQTVGLSTQEPGDVFTYAEF
HEASSSYKANGTEFAIQYGTGS-LEGYISQDTLSIGD-----------LTIPKQDFAEATSEPGLTFAFGKF
PKQSSTYQADGRTWSISYGDGSSASGILAKDNVNLGG------------LLIKGQTIELAKREAASFANGPN
TPSSSATKLSGYTWDISYGDGSSASGDVYRDTVTVGG-----------VTTNKQAVEAASKISSEFVQNTAN
SS
G
I YG GS
G 6 Dt6 6
q
f
:
:
:
:
:
195
177
194
187
186
Data Management for Bioinformatics
5
How to Represent a Molecule?
C43H66N12O12S2
N
Q
P
G
L
MRLLVLAALLTVGAGQAGLNSRALWQFNGM
IKCKIPSSEPLLDFNNYGCYCGLGGSGTPV
DDLDRCCQTHDNCYKQAKKLDSCKVLVDNP
YTNNYSYSCSNNEITCSSENNACEAFICNC
DRNAAICFSKVPYNKEHKNLDKKNC
C
C
Image from Wikipedia Commons:
http://en.wikipedia.org/wiki/File:Oxytocin.jpg
I
Y
6
Storing Biological Data
• Data organization from a biologists perspective:
–
–
–
–
–
–
–
–
–
Sequence (Amino Acids represented as Letters)
Structure
Family/Domain
Species
Taxonomy
Function/Pathway
Disease/Variation
Publication Journal
Many others
• How will the data be used?
7
Storing Biological Data
• Data organization from a computer-science
perspective:
–
–
–
–
–
–
–
–
In a flat text file
In a spreadsheet
In an image
In an video animation
In a relational database
In a networked (hyperlinked) model
In any combination of the above
Others
• How will the data be used?
8
Retrieving Biological Data
• Reference:
– find something that I have seen before
– Example:
• find out who discovered a DNA sequence or protein
• Find some characteristic of a known sequence or protein
• Discovery:
– find something new. Infer new knowledge.
– Examples:
• Find new sequences that evolved from known common ancestor
• Find sequences that may have similar function in other organisms
9
Finding Reference Information
• Reference information searches can be
accomplished:
– By key
• Find a DNA sequence by its accession number
– By attribute (exact)
• Find sequences belonging to C. Elegans
– By attribute (inexact)
• Find proteins known to be related to some type of
cancer
10
Motivation: Storing Experimental Results
• Recent phenomenon from the biological
experimenters perspective
– Too many results to keep track of by hand
– Need to Summarize/Aggregate data in order
to visualize and extract valuable information
– Need to repeat same discovery searches to
better mine the data
• With different parameters
• Over time to pick up database changes (more
sequences, better annotation)
11
Structured Databases
• All information organized in same way (Data Model)
• Language available to
–
–
–
–
describe (create) the database
insert data
manipulate data
update
• Language establishes an abstract data model: Data
Independence
• Programs using language can work across systems
• Facilitates communication and sharing data
12
Structured Databases
• Examples
– Hierarchical Databases
– Networked Databases
– XML Databases
– Relational Databases
Relational Databases
• Model originally described by Edgar F.
Codd in the early seventies.
• Defined using relational algebra, an
offshoot of algebra of sets and first order
logic.
• Model Implemented in many products:
– Commercial: Oracle, MS SQL, IBM DB2
– Open Source/Free: MySQL, Postgres, SQLite
14
Relations
• set of tuples that have the same attributes
– base relations = stored data
– derived relations = computed by applying
relational operators
Relation
Tuple
15
Attribute or Column
Example Relational Database Design
Store results from multiple sequence database BLAST searches
Accession
Description
P14555
Species
Matrix
eValue
Date
Group IIA
Human
Phospholipase
A2
Pam70
4.18 E-32
7/21/07
P81479
Phospholipase Indian Green
A2 isozyme IV Tree Viper
Pam70
2.68 E-52
7/21/07
P14555
Group IIA
Human
Phospholipase
A2
Blosom80
3.47 E-33
7/20/07
P81479
Phospholipase Indian Green
A2 isozyme IV Tree Viper
Blosom80
1.20 E-54
7/20/07
P00623
Phospholipase Eastern
A2
Diamondback
Rattlesnake
Blosum80
1.21 E-08
7/20/07
16
An Improved Relational Database Design
Reduce redundancy through normalization
Sequences
Species
Accession
Description
P14555
Group IIA Phospholipase A2
P81479
Phospholipase A2 isozyme IV Indian Green Tree Viper
P00623
Phospholipase A2
Foreign key
Human
Eastern Diamondback Rattlesnake
Matches
Matrix
Acc#
Date
eValue
P14555
7/21/07
Pam70
4.18 E-32
P81479
7/21/07
Pam70
2.68 -E52
P14555
7/20/07
Blosom80 3.47 E-33
P81479
7/20/07
Blosom80 1.20 E-54
P00623
7/20/07
Blosum80 1.21 E-08
17
Still redundant
An Improved Relational Database Design
Sequences
Accession
Description
Species
P14555
Group IIA Phospholipase A2
Human
P81479
Phospholipase A2 isozyme IV Indian Green Tree Viper
P00623
Phospholipase A2
Eastern Diamondback Rattlesnake
Matches
Runs
Accession RunNum
eValue
P14555
1
4.18 E-32
RunNum
Date
P81479
2
2.68 E-52
1
7/21/07 Pam70
P14555
2
3.47 E-33
2
7/20/07 Blosom80
P81479
1
1.20 E-54
P00623
2
1.21 E-08
Matrix
18
Key Concepts
• Biologists need to manage information of a large and
diverse nature
• Computer scientists have developed a variety of data
models for storing information
• Relational databases are ideal for representing large
collections of data of a regular tabular nature
• A relational database is a collection of tables
• Tables use rows to represent objects and columns to
represent their attributes
• Tables should be optimized through normalization to
avoid redundancy
19
Related documents