Download Sequencing the World of Possibilities for Energy & Environment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ancestral sequence reconstruction wikipedia , lookup

Western blot wikipedia , lookup

Protein moonlighting wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Molecular evolution wikipedia , lookup

Protein structure prediction wikipedia , lookup

DNA sequencing wikipedia , lookup

Homology modeling wikipedia , lookup

Protein adsorption wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Genomic library wikipedia , lookup

RNA-Seq wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Exome sequencing wikipedia , lookup

Transcript
Sequencing the World of Possibilities for Energy & Environment
Information Sources
for Genomics
Konstantinos Mavrommatis
Genome Biology Program
[email protected]
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Databases
 Databases used for the analysis of biological
molecules.
 Databases contain information organized in a way
that allows users/researchers to retrieve and exploit
it.
 Why bother?
 Store information.
 Organize data.
 Predict features (genes, functions ...).
 Predict the functional role of a feature (annotation).
 Understand relationships (metabolic reconstruction).
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Overview
 Sequence databases
Primary (contain “raw” data)
 Nucleotide
 Protein
Secondary (processed information)
 Genes
 Proteins
 Classification databases
Sequence classification
Function classification
Other methods
 Other specialized databases
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Primary nucleotide databases
EMBL/GenBank/DDBJ
(http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl)
 Archive containing all sequences from:
 genome projects
 sequencing centers
 individual scientists
 patent offices
 The sequences are exchanged between the
three centers on a daily basis.
 Database is doubling every 10 months.
 Sequences from >140,000 different species.
 1400 new species added every month.
Year
2004
2005
2006
2007
2008
Base pairs
44,575,745,176
56,037,734,462
69,019,290,705
83,874,179,730
99,116,431,942
Sequences
40,604,319
52,016,762
64,893,747
80,388,382
98,868,465
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Primary protein sequence databases
Contain coding sequences
derived from the translation of
nucleotide sequences
 GenBank
 Valid translations (CDS) from nt
GenBank entries.
 UniProtKB/TrEMBL (1996)
 Automatic CDS translations from
EMBL.
 TrEMBL Release 40.3 (26-May-2009)
contains 7,916,844 entries.
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Errors in databases
There are a lot of errors in the primary
sequence databases:
 In
the sequences themselves:
Sequencing errors.
Cloning vectors sequences.
 For
the annotations, the free submission of
entries results to:
Inaccuracies, omissions, and even mistakes.
Inconsistencies between some fields.
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Redundancy
 Redundancy is a major
problem.
 Entries are partially or
entirely duplicated:

e.g. 20% of vertebrate
sequences in GenBank.
{
{
{

Partial and complete
sequence duplications
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Overview
 Sequence databases
Primary (contain “raw” data)
 Nucleotide
 Protein
Secondary (processed information)
 Genes
 Proteins
 Classification databases
Sequence classification
Function classification
Other methods
 Other specialized databases
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
NCBI Derivative Sequence Data
Curators
RefSeq
TATAGCCG
AGCTCCGATA
CCGATGACAA
Labs
Genome
Assembly
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
GenBank
UniGene
Algorithms
MGM workshop. 19 Oct 2010
RefSeq
Sequencing the World of Possibilities for Energy & Environment
 Curated transcripts and proteins.
 reviewed by NCBI staff.
 Model transcripts and proteins.
 generated by computer algorithms.
 Assembled Genomic Regions (contigs).
 Chromosome records.
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Secondary protein databases
 Uniprot/SWISS-PROT (1986) (http://ca.expasy.org/spro)
 a curated protein sequence database
 high level of annotation (such as the description of the function of a protein, its domains
structure, post-translational modifications, variants, etc.)
 a minimal level of redundancy
 high level of integration with other databases
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Overview
 Sequence databases
Primary (contain “raw” data)
 Nucleotide
 Protein
Secondary (processed information)
 Genes
 Proteins
 Classification databases
Sequence classification
Function classification
Other methods
 Other specialized databases
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Classification databases
Groups (families/clusters) of proteins based on…
Overall sequence similarity.
Local sequence similarity.
Presence / absence of specific features (active site, signal peptides…
).
Structural similarity.
...
These groups contain proteins with similar properties.
Specific function, enzymatic activity.
General function.
Evolutionary relationship.
…
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Overall sequence similarity
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Clusters of orthologous groups (COGs)
 COGs were delineated by comparing protein sequences
encoded in 43 complete genomes representing 30 major
phylogenetic lineages.
 Each Cluster has representatives of at least 3 lineages
 A function (specific or broad) has been assigned to each COG.
http://www.ncbi.nlm.nih.gov/COG/
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Profiles & Pfam
 A method for classifying proteins into groups
exploits region similarities, which contain
valuable information (domains/profiles).
 These domains/profiles can be used to detect
distant relationships, where only few residues
are conserved.
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Regions similarity
MGM workshop. 19 Oct 2010
Pfam
Sequencing the World of Possibilities for Energy & Environment
http://pfam.sanger.ac.uk
HMMs of protein alignments
(local) for domains,
or global (cover whole protein)
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
TIGRfam
 Full length alignments.
 Domain alignments.
 Equivalogs: families of
proteins with specific
function.
 Superfamilies: families of
homologous genes.
 HMMs
http://www.tigr.org/TIGRFAMs/
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
KEGG orthology
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Composite pattern databases
 To simplify sequence analysis, the family databases are
being integrated to create a unified annotation resource –
InterPro


Release 28.0 (Aug 10) contains 20837entries
Central annotation resource, with pointers to its satellite dbs
http://www.ebi.ac.uk/interpro/
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
* It is up to the user to decide if the annotation is correct *
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
ENZYME
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
ENZYME
http://ca.expasy.org/enzyme/
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
KEGG
 Contains information about biochemical pathways, and protein
interactions.
http://www.kegg.com
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Overview
 Sequence databases
Primary (contain “raw” data)
 Nucleotide
 Protein
Secondary (processed information)
 Genes
 Proteins
 Classification databases
Sequence classification
Function classification
Other methods
 Other specialized databases
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Sequencing projects
 GOLD

Information for ongoing and
finished (meta)genomic
projects.

Information about the
metadata of genomes and
metagenomic samples.
http://www.genomesonline.org
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Literature search
PubMed
http://www.ncbi.nlm.nih.gov/Pubmed
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Specialized databases
There is a large number of databases
devoted to specific organisms.
For some model organisms there are often
concurrent systems.
These databases are associated to
sequencing or mapping projects.
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Other specialized databases
 Signal transduction,
regulation, protein-protein interactions
Gene
3D structures
expression
 TRANSFAC (Transcription
Factor
database)
GXD
PDB(Mouse
(Protein
Gene
Data
Expression
Bank)
Database)
 BRITE (Biomolecular
Relations
in Information
Transmission and
The
MMDB
Stanford
(Molecular
Microarray
Modelling
Database
Data
Expression database)
Base)
Mapping
 DIP (DatabaseNRL_3D
of Interacting
Proteins)
(Non-Redundant
Library of
GDB (Genome Data Base)
3D Structures)
 BIND (Biomolecular
Interaction Network database)
EMG (Encyclopedia of Mouse Genome)
SCOP (Structural Classification of
 BioCarta
MGD
(Mouse Genome Database)
Proteins)
 Biochemical pathways
INE (Integrated Rice Genome Explorer)
Polymorphism
 KLOTHO (Biochemical
Compounds Declarative database)
Protein
quantification
ALFRED
(Allelesystem)
Frequency Database)
 BRENDA (enzyme
information
SWISS-2DPAGE
Molecular
interactions
 LIGAND (similar
to Enzyme
but with more information for substrates)
PDD (Protein Disease Database)
DIP (Database of Interacting proteins)
 Gene order and co-occurrence
Sub2D (B. subtilis 2D Protein Index)
BIND (Biomolecular Interaction
 STRING
Network Database)
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
List of databases
http://www.oxfordjournals.org/nar/database/c
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Databanks interconnection
Blocks
MIMMAP
REBASE
PDBFINDER
ALI
PROSITEDOC
OMIM
ProDom
PROSITE
SWISSNEW
ENZYME
DSSP
SWISSDOM
HSSP
FSSP
GenBank
PDB
MOLPROBE
SWISS-PROT
NRL_3D
ECDC
EPD
YPDREF
PMD
EMBL
YPD
EMNEW
TFSITE
TrEMBLNEW
ProtFam
FlyGene
TrEMBL
PIR
TFACTOR
Not all databases are updated regularly.
Changes of annotation in one database are not reflected in others.
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Concluding remarks
 We have main archives (Genbank), and
currated databases (Refseq, SwissProt), and
protein classification database (COG, Pfam),
and many, many more…
 They help predict the function, or the network
of functions.
 Systems that integrate the information from
several databases, visualize and allow handling
of data in an intuitive way are required
MGM workshop. 19 Oct 2010
Sequencing the World of Possibilities for Energy & Environment
Thank you for your attention.
MGM workshop. 19 Oct 2010