Download Biological Information and Biological Databases

Document related concepts

Microsatellite wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Helitron (biology) wikipedia , lookup

Human Genome Project wikipedia , lookup

Transcript
Biological Information and
Biological Databases
Meena K Sakharkar
Bioinformatics Centre
National University of Singapore
Biological Information
Nature of Life Science
Information
•
•
•
•
•
•
•
Descriptive
Classification and Nomenclatural
Observational and Phenomenological
Experimental
Deduced/Computed
Simulated?
Theoretical?
Descriptive
Classify and Give Names
• Classification and Nomenclature
• Linnaeus - binomial nomenclature
• Group into kingdoms, phyla, classes, orders,
families, genera, species, subspecies, strains,
etc
• Associate descriptions to these classification
schema, and classify according to description
etc
Observational/Phenomenological
•
•
•
•
•
Like descriptive, yet more active
Observe a lot of biological phenomenon
Charles Darwin
Gregor Mendel to McClintock
Start to do some experiments
Experimental
• From dissections to
complex genetic
engineering
experiments
BioInformatics
• Deduced/Computed
• Simulated?
• Theoretical?
What is BioInformatics?
• Many related terms and buzzwords
•
A multiplicity of names:
–
–
–
–
–
–
bioinformatics
biocomputing
biological computing
computational biology
computational genomics
biological data mining
Overview of the challenges of Molecular Biology
Computing
• The huge dataset problem
– automated DNA sequencers
– the Human Genome Project
– bulk sequencing of cDNAs (ESTs)
Human Genome Project
• What is the Human Genome Project?
– 15-year effort formally begun in October 1990. coordinated by the
U.S. Department of Energy and the National Institutes of Health.
– identify all the estimated 80,000 genes in human DNA,
– determine the sequences of the 3 billion chemical bases that make
up human DNA,
– store this information in databases,
– develop tools for data analysis, and
– address the ethical, legal, and social issues (ELSI) that may arise
from the project.
• Who is head of the U.S. Human Genome
Project?
– The DOE Human Genome Program is directed by Ari Patrinos,
and Francis Collins directs the NIH Human Genome Program.
– Ari Patrinos also heads the Department of Energy Office of
Biological and Environmental Research.
• What are the comparative genome sizes of humans and
other organisms being studied?
If compiled in books, the data would fill an estimated 200
volumes the size of a Manhattan telephone book
(at 1000 pages each), and reading it would require 26
years working around the clock
Informatics: Data Collection and Interpretation
•
•
•
•
•
HUMAN GENETIC DIVERSITY
The Ultimate Human Genetic Database
Any two individuals differ in about 3 x 106 bases (0.1%).
The population is now about 5 x 109.
A catalog of all sequence differences would require 15 x
1015 entries.
This catalog may be needed to find the rarest or most
complex disease genes.
Databases
Basic Terminology
What is a nucleotide/protein sequence database and
databank?
•
Database is a collection of Nucleotide/protein sequence and their
Associated annotations.
• Databanks
Groups which collect, compile, maintain and distribute the database.
Fundamental
Dogma
Work from the Code of Life
Deduced and Computed
Information in the Era of
Computational Biology
Databases
• What are the different kinds of databases and their formats?
Nucleic Acid Sequence
EMBL at EBI.
GENBANK at NCBI.
DDBJ at Japan.
Protein Sequence
SWISS PROT
NBRF(PIR)
Database
• Protein structure databases
PDB
•
Information on the structural data for the proteins/nucleic acids.
•
whose 3-D structure solved by X-ray crystallography/NMR
•
PDB database
NRL 3D Database
•
NRL_3D is a sequence-structure database.
•
Can be used in conjunction with PIR.
•
PDB with PIR.
GenBank Entry
EMBL Entry
SwissProt Entry
Other databases
• Genome Databases
– GDB :Genome Data Bank
– OMIM
• Pattern Databases
– Prosite
– TFD
Usage of databases
• Annotation Searches - KW, Authors, Features.
– What is the protein sequence for human insulin?
– How does the 3D structure of calmodulin look like?
– What is the genetic location of cystic fibrosis gene?
– List all introns in rat?
• Homology Searches
– Is there any protein sequence that is similar to mine?
– Is this gene known in any other species?
– Has someone already cloned this sequence?
Usage of databases
• Pattern searches
– Does my sequence contain any known motif (that can give me a
clue about the function)?
– Which known sequences contain this motif?
– Is any part of my sequence recoganised by a transcription factor?
– List all known start, splice and stop signals in my genomic
sequence
• Prediction - Use the database as knowledge database
– What may the structure of my protein be?
• Secondary structure prediction
• Modeling by homology
– What is the gene structure of my genomic sequence?
– Which parts of my protein have a high antigenicity?
Usage of Databases
• Comparisons:
– Gene Families
– Phylogenetic Trees
Year
Apr-98
Oct-97
Apr-97
Oct-96
Apr-96
Oct-95
Apr-95
Oct-94
Apr-94
Oct-93
Apr-93
Sep-92
Dec-91
Mar-91
Jun-90
Sep-89
Dec-88
Jun-88
Sep-87
Feb-87
May-86
May-85
Sep-84
Dec-82
Bases
GenBank Growth Chart
1600000000
1400000000
1200000000
1000000000
800000000
600000000
400000000
200000000
0
Evolutionary basis of Alignment
• Enable the researcher to determine if two
sequences display sufficient similarity to
justify the inference of homology.
• Similarity is an observable quantity that
may be expressed as say %identity or some
other measure.
• Homology is a conclusion drawn from this
data that the two genes share a common
evolutionary history.
Sequence Formats
Fasta Format
>SANJAY REFORMAT of: SANJAY.seq check: 8826 from: 1 to: 573 March 12, 1998
MASSSVPPMITEEEARFEAEVSAVESWWRTDRFRLTRRPYSARDVVSLRGTLHHSYASDQ
MAKKLWRTLKSHQSAGTASRTFGALDPVQVTMMAKHLDTIYVSGWQCSSTHTATNEPGPD
LADYPYNTVPNKVEHLFFAQLYHDRKQHEARVSMTREQRAKTPYVDYLRPIIADGDTGFG
GATATVKLCKLFVERGAAGVHIEDQSSVTKKCGHMAGKVLVAVSEHINRLVAARLQFDVM
GVETVLVARTDAVAATLIQSNVDLRDHQFILGATNPDFKRRSLAAVLSAAMAAGKTGAVL
QAIEDDWLSRAGLMTFSDAVINGINRQLPEYEKQRRLNEWAAATEYSKCVSNEQGREIAE
RLGAGEIFWDWDIARTREGFYRFRGSVEAAVVRGRAFAPHADLIWMETSSPDLVECGKFA
QGMKASHPEIMLAYNLSPSFNWDAAGMTDEEMRDFIPRIAKMGFCWQFITLGGFHADALV
TDTFAREFAKQGMLAYVERIQREERNNGVDTLAHQKWSGANYYDRYLKTVQGGISSTAAM
GKGVTEEQFKEESRTGTRGLDRGGITVNAKSRL
GCG Format
ckl.seq Length: 473 September 15, 1999 12:25 Type: P Check: 8103 ..
1 MSTKYSASAE SASSYRRTFG SGLGSSIFAG HGSSGSSGSS RLTSRVYEVT
51 KSSASPHFSS HRASGSFGGG SVVRSYAGLG EKLDFNLADA INQDFLNTRT
101 NEKAELQHLN DRFASYIEKV RFLEQQNSAL TVEIERLRGR EPTRIAELYE
151 EEMRELRGQV EALTNQRSRV EIERDNLVDD LQKLKLRLQE EIHQKEEAEN
201 NLSAFRADVD AATLARLDLE RRIEGLHEEI AFLRKIHEEE IRELQNQMQE
251 SQVQIQMDMS KPDLTAALRD IRLQYEAIAA KNISEAEDWY KSKVSDLNQA
301 VNKNNEALRE AKQETMQFRH QLQSYTCEID SLKGTNESLR RQMSEDGGAA
351 GREAGGYQDT IARLEAEIAK MKDEMARHLR EYQDLLNVKM ALDVEIATYR
401 KLLEGEESRI SLPVQSFSSL SFRESSPEQH HHQQQQPQRS SEVHSKKTVL
451 IKTIETRDGE VVSESTQHQQ DVM
Taxonomy Database
Blast Results
Examples of the New Biology
1.
2.
3.
4.
Full genome-genome comparisons
Rapid assessment of polymorphic genetic variations
Complete construction of orthologous or paralogous groups of genes
Structure determination of large macromolecular
assemblies/complexes
5. Dynamically simulation of realistic oligomeric systems
6. Rapid structural/topological clustering of proteins
7. Prediction of unknown molecular structures; Protein folding
8. Computer simulation of membrane structure and dynamic function
9. Simulation of genetic networks and the sensitivity of these pathways to
component stoichiometry and kinetics
10.Integration of observations across scales of vastly different dimensions
and organization to yield realistic environmental models for basic
biology and societal needs
Theoretical?
• The day will dawn when
we will have sufficient
information to understand
how basic life functions
are integrated into a
living cell, and how such
cells intercommunicate
and interoperate to
function as a living
whole. Then maybe, we
can start talking about
theoretical biology
•
•
•
•
•
•
•
•
Categories of BioDbs - by
domain of information
DNA
RNA
Protein
Genomic Mapping
Pathways
Structure
Bibliographic
Biochemical/Molecular/Miscellaneous
Other categories
• By category of species
• By families or superfamilies of molecules
etc
• Demo
http://www.infobiogen.fr/services/dbcat/
Demonstration of BioDatabases
• Majority of Life
Science databases are
online, accessible with
Web via Internet
• Catalogs of databases
available
• Need for a Registry to
keep track and offer
quality control