Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Organization of Biological Data
and Databases
Pramod Wangikar
Dept. of Chemical Engineering
IIT Bombay
ORGANIZATION OF BIOLOGICAL DATA
Gene i
Genomics
m-RNA i
Transcriptomics
Protein i
Protein Sequence /
Proteomics
Function
(Enzyme,
hormone etc.)
3-D Structural
Database
Primary Structure of Deoxyribonucleic Acid (DNA)
A
C
3’
P
G
3’
P
5’
T
3’
P
5’
3’
P
5’
OR
pApCpGpTpTpG
OR
ACGTTG
G
T
3’
P
5’
3’
P
5’
OH
5’
The Basic Principle of Transcription
RNA
Polymerase
5’
Double
stranded
DNA
RNA
Nucleotides
The Code
• 64 ways of writing the codon
• 20 amino acids
M
uac 5'
5'... aug
F
gaa 5'
uuu ...
Adjacent mRNA codons
x=
Uxu
uxc
uxa
uxg
Cxu
cxc
cxa
cxg
u
c
F
L
Y
C
H
W
P
R
Q
N
I
S
T
C-S
K
R
M
NH+
gxu
gxc
gxa
gxg
g
S
axu
axc
axa
axg
a
D
V
A
G
O-
E
H:D/A
The Flow of Genetic Information
5’
DNA
Sequense same as RNA
3’
ACTGCACCATGGGGCTCAGCGACGGGGAATGGCACTTGGTG
TGACGTGGTACCCCGAGTCGCTGCCCCTTACCGTGAACCAC
Sequence complementary to RNA
mRNA 5’
ACUGCACCAUGGGGCUCAGCGACGGGGAAUGGCACUUGGUG
Initiation
signal
codons
Protein
Met-Gly-Leu-Ser-Asp-Gly-Gln-Trp-His-Leu-Val
Memory Requirements for Storing Genomes
00
01
10
11
Prokaryotic
Eukaryotic
=
=
=
=
a
c
g
t
0.5-7.0 Mbp
10 Mbp - 1000 Gbp
How Much Data Does a
Bacteria (E. coli) contain?
E. coli and Data size
Genome
Proteome
Sequence Structure
4.6 x 106 1.2 x 106 1.3 x 107
letters
letters,
atoms, atom
identifier, type, amino
description acid,
, location charge, 3
co-ordinates
per atom
1.1 Mbyte 10 Mbyte 400 Mbyte
Function
Substrate,
product,
reaction,
ligand,
specifcity,
transport
100 Mbyte
Metabolites
Structure Connection
800
Intercompound conversion
s, 10-50
of compds,
atoms,
reactions,
bonds, 3- rates,
coordinate enzymes
s per atom
50 Mbyte 500 Mb
Numbers are approximate: The data size increases roughly by
three orders of magnitude for human system
Minimal Life:
Self- assembly, Catalysis, Replication, Mutation, Selection
Environment
Monomers
RNA
Growth rate
Cell Boundary
Maximal Life:
Self- assembly, Catalysis, Replication, Mutation, Selection
Regulatory & Metabolic Networks
Metabolites
DNA
RNA
Protein
Growth rate
Expression
stem cells
cancer cells
microbes
Regulation: More biological data
What is regulation: A catalogue of possible scenarios
and respective course of action.
The information for regulation can be stored in the form of:
• Protein-protein interaction
• Protein-DNA interaction
• Protein-metabolite interaction
• Molecular switches, controls, set-points, etc.
Genome + Environment: Input file
Biological Machinery: Executable program
Observations: Output file
Can we crack the executable program?
Upstream activating
sequences (UAS)
Some useful regulatory signals on Genes
m-RNA expression
start & end
TATA box
DNA
x
x
mRNA
Ribosomal
binding site
protein
Protein
synthesis
starts
Protein
synthesis
stops
Minimal Gene Complement of Mycoplasma genitalium
DESCRIPTION OF A LIVING CELL / VIRUS
Genome /
Genomics
Transcriptomics
Proteomics /
Protein Map
General Capability
of the Cell
Readyness of the Cell
Physiological state
of the cell
Paradigm Shift in the Bioinformatics Age
Conventional Path
Function

Structure
Gene
Bioinformatics Age:
Gene
sequence
Protein Map
2D-PAGE,
pI, mol. wt.
Functional
Genomics
Structure
of Protein
Proteomics
Function
Possible Relationships Between Databases
Genome Sequence
Transcriptomics
Expression Profile
Protein-DNA
interactions
Proteomics
Protein Seqeunce
Protein Profile
Protein Structure
Protein-Protein
Interaction
Protein Function
Metabolome
Phenotype
Combinatorial Problems in Biology
•
Prediction of ORF; gene finding
•
Prediction of DNA regulatory sites
•
DNA regulatory Proteins
•
Protein-Protein interactions
•
Protein Function
•
Prediction of Metabolic capability
•
Prediction of Genetic Regulatory Circuits
Biological Databases
•
Raw databases
•
Processed databases
•
Querying in databases.
Raw Databases
Conventional Ones
DNA / Gene / Genome Sequence Databases.
EMBL, GenBank, GSDB etc.
> 106 genes, Doubles every 18 months.
Genome Projects: E. coli, plants, Human, Mouse, etc.
Protein Sequence Databases.
PIR, SwissProt, GenBank, etc.
> 105 protein sequences, Doubles every 21 months
Three Dimensional structure Database.
Brookhaven Protein Databank (PDB)
> 20,000 structures, doubles every 24 months.
Proteomics Database (SwissProt)
• Each Protein Identified by: pI, mol wt., mass spectra,
microsequencing, peptide mass fingerprint, etc.
• Entries for E.coli, yeast, human etc.
Hoogland et al, Nucl. Acids Res. (2000) 28, 286
Cluster of Orthologous Groups (COG) of Proteins:
A Processes Database
• Compares genes from different genomes.
• Forms clusters with similar sequences.
• Each COG contains genes connected through vertical
evolutionary descent.
• 30 genomes (68,571 genes), 2,791 COGs with 45,350 genes
• Assignment of function for genes based on known functions
for some members of the cluster.
• Highly useful for functional assignments for newly sequenced
genomes.
EcoCyc Database: Encyclopedia of E. coli genes and Metabolism
4300 genes, 695 enzymes, 595 reactions, 123 pathways
Blue: E. coli only; Green: both E. coli and H. influenzae.
Karp et al, Nucl. Acids Res. (1998) 26, 50
Querying in Databases
• Based on sequence similarity; gives similar
sequences and the similarity score or expectation
value.
• Normally a BLAST, FASTA search (local alignment).
Can look for a sequence motif.
• Gene names, biological source, functional category,
cellular location / role.
• Structural features (for known 3-D structures).
Bioinformatics:
A multidisciplinary effort is required
•
Generation of biological data
•
Storage and Retrieval of Data
•
Conversion of known biological hypotheses into
mathematical/statistical models
•
Building models from data
•
Fitting new data to existing models.
•
Searching for patterns in data
•
Derive new biological knowledge from Data
Related documents