Download Biomolecular databases Examples of biomolecular databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Contents
Bioinformatics
! 
Examples of biological databases
" 
" 
" 
! 
Nucleic sequences: Genbank, EMBL, and DDBJ
Protein sequences: UniProt
The Gene Ontology (GO) project
Issues and perspectives for biological databases
Biomolecular databases
Jacques van HeldenFORMER ADDRESS (1999-2011)
Université Libre de Bruxelles, Belgique
Bioinformatique des Génomes et des Réseaux (BiGRe lab)
http://www.bigre.ulb.ac.be/
NEW ADDRESS (since Nov 1st, 2011)
[email protected]
Université d’Aix-Marseille, France
Lab. Technological Advances for Genomics and Clinics
(TAGC, INSERM Unit U1090)
http://tagc.univ-mrs.fr/
B!GRe
Bioinformatique des
Génomes et Réseaux
!"#$
Inserm U1090
!"#$%&'&()#*'+*,-*%#".+/&0+("%&1)#.+*%,+#')%)#.
Examples of biomolecular databases
Biomolecular Databases
! 
Sequence and structure databases
" 
" 
" 
" 
" 
! 
Genome sequences and annotations
" 
Examples of biomolecular databases
" 
! 
Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, !)
Multiple genomes (Integr8, NCBI, KEGG, TIGR, !)
Molecular functions
" 
" 
" 
! 
Protein sequences (UniProt)
DNA sequences (EMBL, Genbank, DDBJ)
3D structures (PDB)
Structural motifs (CATH)
Sequence motifs (PROSITE, PRODOM)
Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)
Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)
Transport (YTPdb)
Biological processes
" 
" 
" 
" 
Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)
Signal transduction pathways (CSNdb, Transpath)
Protein-protein interactions (DIP, BIND, MINT)
Gene networks (GeneNet, FlyNets)
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Databases of databases
Biomolecular Databases
! 
! 
There are hundreds of databases related to molecular biology and biochemistry.
New databases are created every year.
Every year, the first issue of Nucleic Acids Research is dedicated to biological
databases
" 
" 
! 
The same journal maintains a database of databases: the Molecular Biology
Database Collection
" 
! 
http://nar.oupjournals.org/
2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1
http://www.oxfordjournals.org/nar/database/c/
Some bioinformatics centres maintain multiple database, with cross-links
between them. The SRS server at EBI holds an impressive collection of
databases.
" 
Nucleic sequence databases:
GenBank, EMBL, and DDBJ
http://srs.ebi.ac.uk/
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Nucleic sequence databases
! 
! 
! 
To publish an article dealing with a sequence, scientific journals impose to have
previously deposited this sequence in a reference database.
There are 3 main repositories for nucleic acid sequences.
Sequences deposited in any of these 3 databases are automatically
synchronized in the 2 other ones.
The sequencing pace
! 
Nucleic sequences
" 
! 
Entire genomes
" 
" 
! 
Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/
•  126,551,501,141 bases in 135,440,924 sequence records in the
traditional GenBank divisions
•  191,401,393,188 bases in 62,715,288 sequence records in the
Whole Genome Ssequencing
GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced
genomes.
http://www.genomesonline.org/gold_statistics.htm
Protein sequences
" 
Essentially obtained by translation of putative genes in nucleic
sequences (almost no direct protein sequencing).
UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.
" 
http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
" 
Okubo et al. (2006) NAR 34: D6-D9
Size of the nucleotide database
EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012
http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html
!
Class
entries
nucleotides!
------------------------------------------------------------------!
CON:Constructed
7,236,371 359,112,791,043!
EST:Expressed Sequence Tag
73,715,376 40,997,082,803!
GSS:Genome Sequence Scan
34,528,104 21,985,922,905!
HTC:High Throughput CDNA sequencing
491,770
594,229,662!
HTG:High Throughput Genome sequencing
152,599 25,159,746,658!
PAT:Patents
24,364,832 12,117,896,594!
STD:Standard
13,920,617 37,665,112,606!
STS:Sequence Tagged Site
1,322,570
636,037,867!
TSA:Transcriptome Shotgun Assembly
8,085,693
5,663,938,279!
WGS:Whole Genome Shotgun
88,288,431 305,661,696,545!
----------- ---------------!
Total
252,106,363 450,481,663,919!
!
Division
entries
nucleotides!
------------------------------------------------------------------!
ENV:Environmental Samples
30,908,230 14,420,391,278!
FUN:Fungi
6,522,586 11,614,472,226!
HUM:Human
32,094,500 38,072,362,804!
INV:Invertebrates
31,907,138 52,527,673,643!
MAM:Other Mammals
40,012,731 145,678,620,711!
MUS:Mus musculus
11,745,671 19,701,637,499!
PHG:Bacteriophage
8,511
85,549,111!
PLN:Plants
52,428,994 55,570,452,118!
PRO:Prokaryotes
2,808,489 28,807,572,238!
ROD:Rodents
6,554,012 33,326,106,733!
SYN:Synthetic
4,045,013
782,174,055!
TGN:Transgenic
285,307
849,743,891!
UNC:Unclassified
8,617,225
4,957,442,673!
VRL:Viruses
1,358,528
1,518,575,082!
VRT:Other Vertebrates
22,809,428 42,568,889,857!
----------- ---------------!
Total
252,106,363 450,481,663,919!
Adapted from Didier Gonze
Genbank (NCBI - USA)
http://www.ncbi.nlm.nih.gov/Genbank/
The EMBL Nucleotide Sequence Database (EBI - UK)
DDBJ - DNA Data Bank of Japan
http://www.ebi.ac.uk/embl/
http://www.ddbj.nig.ac.jp/
Size of the nucleic sequence databases
Biomolecular Databases
! 
! 
Summary of database contents for the 3 main databases of nucleic sequences.
Source: NAR database issue January 2006.
DDBJ
EMBL
GenBank
URL
http://www.ddbj.nig.ac.jp/
http://www.ebi.ac.uk/embl/
http://www.ncbi.nlm.nih.gov/
Sequences
2.0E+06
Bases
(without
shotgun)
1.7E+09
4.6E+07
5.1E+10
bases
(including
shotgun) Organisms
1.0E+11
1.0E+11
2.0E+05
2.1E+05
UniProt : protein sequences
and functional annotations
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
UniProt - the Universal Protein Resource
UniProt example - Human Pax-6 protein
Header : name and synonyms
http://www.uniprot.org/
! 
! 
! 
Database content (Sept 2012)
" 
UniProtKB:
•  24,532,088 entries
•  Translation of EMBL coding sequences
(non-redundant with Swiss-Prot)
" 
UniProtKB/Swiss-Prot section (reviewed):
•  537,505 entries
•  annotation by experts
•  high information content
•  many references to the literature
•  good reliability of the information
" 
The rest (90% of the entries)
•  Automatic annotation by sequence
similarity.
Features
" 
The most comprehensive protein database in
the world.
" 
A huge team: >100 annotators + developers.
" 
Annotation by experts: annotators are
specialized for different types of proteins or
organisms.
" 
World-wide recognized as an essential
resource.
References
" 
Bairoch et al. The SWISS-PROT protein
sequence data bank. Nucleic Acids Res (1991)
vol. 19 Suppl pp. 2247-9
" 
The UniProt Consortium. The Universal Protein
Resource (UniProt) 2009. Nucleic Acids Res
(2008). Database Issue.
Number of entries (polypeptides) in Swiss-Prot
http://www.expasy.org/sprot/relnotes/relstat.html
Taxonomic distribution of the sequences
UniProt example - Human Pax-6 protein
Human-based annotation by specialists
Within Eukaryotes
UniProt example - Human Pax-6 protein
Structured annotation : keywords and Gene Ontology terms
UniProt example - Human Pax-6 protein
Protein interactions; Alternative products
UniProt example - Human Pax-6 protein
UniProt example - Human Pax-6 protein
Peptidic sequence
UniProt example - Human Pax-6 protein
References to original publications
Detailed description of regions, variations, and secondary structure
UniProt example - Human Pax-6 protein
Cross-references to many databases (fragment shown)
3D Structure of macromolecules
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
PDB - The Protein Data Bank
http://www.rcsb.org/pdb/
Genome browsers
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
EnsEMBL Genome Browser (Sanger Institute + EBI)
UCSC Genome Browser (University California Santa Cruz - USA)
http://www.ensembl.org/
http://genome.ucsc.edu/
Human gene Pax6 aligned with Vertebrate genomes
UCSC Genome Browser (University California Santa Cruz - USA)
UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
http://genome.ucsc.edu/
Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
ECR Browser
EnsEMBL - Example: Drosophila gene Pax6
http://ecrbrowser.dcode.org/
http://www.ensembl.org/
Integr8 - access to complete genomes and proteomes
http://www.ebi.ac.uk/integr8/
Comparative genomics
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Integr8 - genome summaries
Integr8 - clusters of orthologous genes (COGs)
http://www.ebi.ac.uk/integr8/
http://www.ebi.ac.uk/integr8/
Integr8 - clusters of paralogous genes
http://www.ebi.ac.uk/integr8/
Databases of protein domains
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Prosite - protein domains, families and functional sites
Prosite - aligned sequences and logo
http://www.expasy.ch/prosite/
http://www.expasy.ch/prosite/
! 
! 
! 
Some of the sequences that were
used to built the Prosite profile for
the Zn(2)-C6 fungal-type DNAbinding domain
(ZN2_CY6_FUNGAL_2,
PS50048).
The Sequence Logo (below)
indicates the level of conservation
of each residue in each column of
the alignment.
Note the 6 cysteines,
characteristic of this domain.
Prosite - Example of profile matrix
Prosite - Example of sequence logo
http://www.expasy.ch/prosite/
http://www.expasy.ch/prosite/
Prosite - Example of domain signature
PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/
http://www.expasy.ch/prosite/
Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)
! 
The domain signature is a string-based pattern representing the residues that
are characteristic of a domain.
CATH - Protein Structure Classification
CATH - Protein Structure Classification
http://www.cathdb.info/
http://www.cathdb.info/
! 
CATH is a hierarchical classification of
protein domain structures, which clusters
proteins at four major levels:
" 
" 
" 
" 
! 
! 
Class (C),
Architecture (A),
Topology (T)
Homologous superfamily (H).
The boundaries and assignments for
each protein domain are determined
using a combination of automated and
manual procedures which include
computational techniques, empirical and
statistical evidence, literature review and
expert analysis.
References
" 
" 
Orengo et al. The CATH Database
provides insights into protein structure/
function relationships. Nucleic Acids Res
(1999) vol. 27 (1) pp. 275-9
Cuff et al. The CATH classification
revisited--architectures reviewed and new
ways to characterize structural divergence
in superfamilies. Nucleic Acids Res (2008)
pp.
InterPro (EBI - UK)
http://www.ebi.ac.uk/interpro/
InterPro (EBI - UK)
Antennapedia-like Homeobox (entry IPR001827)
Ontology definition
Biomolecular Databases
! 
! 
Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être,
indépendamment de ses déterminations particulières
Ontology: part of the metaphysics that focusses on the being as a beging, independently of
its particular determinations
Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993!
The Gene Ontology (GO) database
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
The "bio-ontologies"
! 
Answer to the problem of inconsistencies in the annotations
" 
" 
! 
Gene ontology: processes
Controlled vocabulary
Hierarchical classification between the terms of the controlled vocabulary
E.g.: The Gene Ontology
" 
" 
" 
molecular function ontology
process ontology
cellular component ontology
Gene ontology: molecular functions
Gene ontology: cellular components
Gene Ontology Database
Gene Ontology Database (http://www.geneontology.org/)
Example: methionine biosynthetic process
http://www.geneontology.org/
Status of GO annotations (NAR DB issue 2006)
! 
Term definitions
" 
" 
" 
" 
! 
Genomes with annotation
" 
! 
Biological process terms
Molecular function terms
Cellular component terms
Sequence Ontology terms
QuickGO (http://www.ebi.ac.uk/QuickGO/)
Web site
http://www.ebi.ac.uk/QuickGO/
A user-friendly Web interface to
the Gene Ontology.
Graphical display of the
hierarchical relationships
between terms.
Convenient browsing between
classes.
! 
9,805
7,076
1,574
963
30
! 
! 
! 
Excludes annotations from UniProt, which represent 261 annotated proteomes.
Annotated gene products
" 
" 
" 
Total
Electronic only
Manually curated
1,618,739
1,460,632
158,107
Remarks on "bio-ontologies"
! 
Improvement compared to free text
" 
" 
! 
" 
e.g. compartment subtypes (plasma membrane is a membrane)
e.g. compartment locations (nucleus is inside cytoplasm is inside plasma
membrane)
To be useful, should remain purpose-based
" 
" 
! 
A "bio-ontologies" is usually nothing more than a taxonomical classification of
the terms of a controlled vocabulary
Multiple possibilities of classification criteria
" 
! 
! 
each biologist might wish to define his/her own classification based on his/her
needs and scope of interest
impossible to define a unifying standard for all biologists
No representation of molecular interactions
" 
" 
relationships between objects are only hierarchical, not horizontal or cyclic
e.g. does not describe which genes are the target of a given transcription
factor
A general definition
" 
controlled vocabulary (choice among synonyms)
hierarchical relationships between the concepts
Nothing to do with the philosophical concept of ontology
" 
! 
What is biological function ?
" 
! 
Fonction: action, rôle caractéristique d un élément, d un organe, dans un ensemble
(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et
analogique de la langue francaise. 1982.
Function: characteristic action (role) of an element (organ) within an set
(often opposed to structure)
Function and gene ontology
" 
" 
Understanding the function requires to establish the link between molecular activity
and the context in which it takes place (process).
Multifunctionality
•  Same activity can play different roles in different processes.
!  Example: scute gene in Drosophila melanogaster: a transcription factor
(activity) involved in sex determination, determination of neural precursors
and malpighian tubules (3 processes).
•  Multiple activities of a same protein in a given process
! 
Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic
domains (enzymatic activities) + a DNA-binding domain (DNA binding
transcription factor) -> 3 molecular activities in the same process (proline
utilization).
LIGAND - Small compounds and metabolic reactions
Biomolecular Databases
Small compounds, reactions
and metabolic pathways
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
KEGG - Kyoto Encycplopaedia of Genes and Genomes
Biomolecular Databases
Ecocyc, BioCyc and Metacyc - Metabolic pathways
Biomolecular Databases
Protein interaction networks
and transduction pathways
Microarray databases
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
HapMap
http://www.hapmap.org/
! 
! 
Human genome resources
The International HapMap
Project is a multi-country effort to
identify and catalog genetic
similarities and differences in
human beings.
Associations between genetic
variations (SNPs, ...) and
diseases + response to
pharmaceuticals.
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Issues for biological databases
Biomolecular Databases
! 
! 
Dealing with biological complexity
Data content
" 
" 
! 
Issues for
biomolecular databases
Data quality
" 
" 
! 
! 
" 
! 
Data structure
Consistency
Query capabilities
Interfaces
" 
! 
Coverage
Information content
User interfaces
Programmatic interfaces
Annotation
Funding
[email protected]
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Towards biological complexity
! 
! 
! 
The main databases currently available are focussed on one type of molecular
entity : nucleic sequences, proteins, compounds, !
This type of organization is very convenient as far as the information to be
represented is simple (e.g. DNA sequences, structures of small molecules and
macromolecules).
It becomes more difficult if we want to represent
" 
" 
" 
the interactions between biological objects,
the integration of various elements in a biological process (metabolic pathways, protein
interaction networks, regulatory networks, !)
complex concepts such as ”biological function”
Data content
! 
Scope of the database
! 
Number of entries
! 
Information content
! 
References to the source of information
" 
" 
" 
types of biological objects represented
coverage of the current knowledge
Level of detail in the description of the biological objects
Data quality
! 
Query capabilities
Data Consistency
" 
" 
" 
" 
always use the same name to indicate the same object
(this seems trivial, but its is unfortunately still not always the case)
event better: define an ID for each objects, and allow to retrieve it by any of its
synonyms
spelling mistakes
! 
Data Structuration
! 
Reliability
" 
" 
" 
" 
" 
" 
! 
user-friendly
convenient browsing
intuitive query forms
visualization (graphical output)
communication with external programs:
•  other databases (concept of distributed database)
•  analysis tools
Funding
! 
Public funding
" 
! 
Problem: easier to obtain public funds for creating a new database than for
maintaining or expanding existing resources
Private funding
" 
! 
More elaborate search
! 
Complex querying
" 
" 
" 
select records with some constraints
select specific fields of some records with constraints on some fields (~SQL
SELECT)
ability to return an answer that results from a "live" computation, and was not part
of any record of the dabatase
Annotation
Programmatic interfaces
" 
! 
Evidences ? Level of confidence ?
Assignation of function by similarity
•  recursive process ! propagation of errors
User interfaces
" 
Browsing (click and read)
Simple search
distinct fields for distinct attributes of the biological objects
Interfaces
! 
! 
! 
Industrial companies are
•  ready to invest in good data and good query capabilities
•  interested by academic expertise
Solutions
" 
" 
All users pay (per query for example)
•  Note: academic users are anyway funded by public funds
Hybrid solution
•  access is free for academic users, not for companies
•  companies can buy the whole database an install it in-house
(+ add their own private data)
•  academia-industry interface is often ensured by a spinoff company
! 
Problem
! 
Strategies
" 
" 
" 
" 
" 
The flow of available data is increasing exponentially
internal curators
selected external experts
public submission
computer-based extraction of information from biological texts