Download Protein sequence databases

Document related concepts

Signal transduction wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

SR protein wikipedia , lookup

Genetic code wikipedia , lookup

Paracrine signalling wikipedia , lookup

Metalloprotein wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Gene expression wikipedia , lookup

Expression vector wikipedia , lookup

Point mutation wikipedia , lookup

Protein wikipedia , lookup

Magnesium transporter wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Structural alignment wikipedia , lookup

Interactome wikipedia , lookup

Western blot wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein purification wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Transcript
Large scale protein sequence
clustering
Prof. Dr. Antje Krause
Bioinformatics
Wildau University of Applied Sciences
[email protected]
Abstract
The concept of protein superfamilies, families and domains is
one of the oldest in computational biology. Back in the 60s,
when the first protein sequence database was published as
printed version, Margaret Dayhoff defined the basic principles of
this discipline with only a small number of sequences at hand.
Nowadays, with more than a million protein sequences available
in public databases, a constantly growing number of
uncharacterized proteins from completely sequenced genomes
and still a comparatively small number of known protein
structures, a systematic grouping and characterization of this
data is needed more than ever. This tutorial reviews the
different approaches developed during the last decades and
points out possible challenges waiting in the future.
Antje Krause
Poznań 14.07.2006
2
Margaret O. Dayhoff
“Dr. Margaret Oakley Dayhoff (1925-1983) was a
pioneer in the use of computers in chemistry and
biology, beginning with her PhD thesis project in
1948. Her work was multi-disciplinary, and used her
knowledge of chemistry, mathematics, biology and
computer science to develop an entirely new
field. She is credited today as a founder of the field
of Bioinformatics. This field is defined as the use of
computers in solving information problems in the life
sciences, mainly involving the creation of extensive
electronic databases on protein sequences and
genomes. Dr. Dayhoff was the first woman in the
field of Bioinformatics.”
http://www.dayhoff.cc/
Antje Krause
Poznań 14.07.2006
3
Margaret O. Dayhoff
• deduce evolutionary
connections of the biological
kingdoms, phyla, and other
taxa from sequence evidence
• collection of all known protein
sequences
• made available to others in
1965 in a small book
• contained sequence
information of 65 proteins
• several releases followed
• resulted in the Protein
Information Resource (PIR)
Antje Krause
Poznań 14.07.2006
4
Antje Krause
Poznań 14.07.2006
5
Protein sequences
>O54090|O54090_SULAC Hypothetical protein (Fragment).
MKILDYSDLVFFRKLTNKMRDPKTRFDVREFINRGEDYLFNYTNKNVGGVDERRRKFLKS
LIFGMAA
>P70723|P70723_ACIAM Orf-2 (Fragment).
MSKNSLDNLGEKALELLKKYPLCDSCLGRCFAKLGYRFANKERGKAIKTYLVLELDRKIK
DHELEDLNEIKEILFNMGKEYLEYLIYLSNEKFQERT
>sptrembl|Q9V2V9|Q9V2V9_PYRAE Rieske iron sulfur protein (ParR).
MVDENRRNTLKIFLGTTAALGAGMLATPLVASVIGSKAGYIKPEPSGAIPVEICKDVDSC
PKDYGVSLDELRNGPVFKLLKVNTMAIPAVFGIVRAKDGKEYPVAYVAICTHFGCPVNVS
GGKYLIGFNCPCHGSIFAICNDPNGCPDYNAAFLEMYVSGGPAPRSLRAIKVAVKDGVVY
PLVAYI
>O93973|O93973_MALSM Allergen.
MSNVIKKVFNTDKAEAEGSKVADAPQEAGHKGEGFLHDAKDRLQGFAGHGHHNAQNAASG
VAGSAGAGGAPSVPSANVDVTNPVNDASVQGGVEAPRSWSTQLPQSQSVADTTGATSAGR
NNLTQTTSTGSGVNVAAGNVDQDVQHLAPVTRHVHHRHEIEELLREREHHIHQHHIQHHV
QPVVDSEHLAEQIHSRVVPQTTVREVHANTDKDAALMRAVAGNPKDTFTQAAIDRSVIDK
GETVREIVHHHIHNIVQPIIEKETHEYHRIRTTIPTTHITHEAPIVHESTAHQPIRKEDF
LKGGGVLTSTTRSIEEVGLLNLGNNQRTVEGETYTGGLPLSQ
>Q02039|Q02039_RHYSE NIP1 precursor (NIP1 avirulence protein precursor).
MKFLVLPLSLAFLQIGLVFSTPDRCRYTLCCDGALKAVSACLHESESCLVPGDCCRGKSR
LTLCSYGEGGNGFQCPTGYRQC
>Q873M4|Q873M4_MALSM Manganese superoxide dismutase (Fragment).
PFYPIPSALPFPLPIHSLFSRRTRLFRFSRTAARAGTEHTLPPLPYEYNALEPFISADIM
MVHHGKHHQTYVNNLNASTKAYNDAVQAQDVLKQMELLTAVKFNGGGHVNHALFWKTMAP
QSQGGGQLNDGPLKQAIDKEFGDFEKFKAAFTAKALGIQGSGWCWLGLSKTGSLDLVVAK
DQDTLTTHHPIIGWDGWEHAWYLQYKNDKASYLKQWWNVVNWSEAESRYSEGLKASL
>Q2V2P9|Q2V2P9_YEAST Protein YDR119W-A.
MFFSQVLRSSARAAPIKRYTGGRIGESWVITEGRRLIPEIFQWSAVLSVCLGWPGAVYFF
SKARKA
Antje Krause
Poznań 14.07.2006
6
MVDENRRNTLKIFLGTTAALGAGMLATPLVASVIGSKAGYIKPEPSGAIPVEICKDVDSC
PKDYGVSLDELRNGPVFKLLKVNTMAIPAVFGIVRAKDGKEYPVAYVAICTHFGCPVNVS
GGKYLIGFNCPCHGSIFAICNDPNGCPDYNAAFLEMYVSGGPAPRSLRAIKVAVKDGVVY
PLVAYI
Function?
Diseases?
Regulation?
Development?
Structure?
Evolutionary
history?
Interactions?
Antje Krause
© David S. Goodsell 1999
Cellular
location?
Poznań 14.07.2006
Tissue?
7
Protein structures
• Prediction of protein structure is still not possible from sequence
alone
• Not all mechanisms of protein folding are known
• Experimental protein structure determination
– is time consuming
– is very expensive
– is not always possible (protein must be in crystal structure)
– results in only one conformation
– does not show flexible regions
– does not show the protein in its natural environment
– can only be done with globular proteins (difficult with
transmembrane proteins)
Antje Krause
Poznań 14.07.2006
8
Different categories of protein databases
• Protein sequence databases:
– Information about single proteins
• Protein structure databases:
– Information about single proteins
• Protein domain databases:
– Information about functional domains
• Protein (sequence) family databases:
– Information about groups of evolutionarily and functionally
related proteins
• Protein (structure) family databases:
– Information about structural elements
• Gene family databases:
– Information about groups of evolutionarily and functionally
related proteins or genes mainly of completely sequenced
species
Antje Krause
Poznań 14.07.2006
9
Protein sequence databases
• UniProt = Universal Protein Resource
• Integration of Swiss-Prot/TrEMBL and PIR
• http://www.expasy.uniprot.org
• central repository of protein sequence and function
• maintained by
– European Bioinformatics Institute
– Swiss Institute of Bioinformatics
– Georgetown University
Antje Krause
Poznań 14.07.2006
10
Protein sequence databases
•contain experimentally verified entries ...
•... and translated entries from DNA databases,
namely EMBL
Swiss-Prot
TrEMBL
– predicted proteins
– hypothetical proteins
– putative proteins
•Problem in the past: no clear difference between
experimentally verified entries/annotation and
predicted entries/annotation
Antje Krause
Poznań 14.07.2006
11
Protein sequence databases
(Swiss-Prot/TrEMBL)  now UniProt!
ExPASy (http://www.expasy.ch)
Expert Protein Analysis System
SIB (http://www.isb-sib.ch)
Swiss Institute of Bioinformatics, Geneva, CH
Swiss-Prot (http://www.expasy.ch/sprot)
Manually curated protein sequence database
TrEMBL (translated EMBL)
Computer-annotated supplement to SwissProt, contains all the translations of EMBL
nucleotide sequence entries not yet integrated
in Swiss-Prot
Antje Krause
Poznań 14.07.2006
12
Protein sequence databases
(PIR-PSD)
NBRF (http://pir.georgetown.edu/nbrf)
National Biomedical Research Foundation
Georgetown, Washington DC, USA
JIPID Japan International Protein Information Database
MIPS (http://mips.gsf.de)
Munich Information Center for Protein Sequences,
GSF, Neuherberg, Munich
PIR (http://pir.georgetown.edu)
Protein Information Resource
Collaboration of NBRF, JIPID and MIPS
PSD (http://pir.georgetown.edu/pirwww/search/textpsd.shtml)
Protein Sequence Database
First published in the Atlas of Protein Sequence and Structure
(1965-1978), the first systematic collection of protein
sequences, generated by Margaret Dayhoff
Antje Krause
Poznań 14.07.2006
13
?
Antje Krause
Poznań 14.07.2006
14
Antje Krause
Poznań 14.07.2006
15
Antje Krause
Poznań 14.07.2006
16
Pattern search
Pattern construction
[SN]-P-x-[LV]-x(2)-H-A-x(3)-F.
Multiple Sequence Alignment
Antje Krause
Poznań 14.07.2006
17
Patterns
• Use of standard IUPAC one-letter codes for amino acids
• Symbol 'x' for a position where any amino acid is possible
• Ambiguities are indicated by listing the acceptable amino
acids in square parentheses '[ ]'
• Ambiguities are indicated by listing the not acceptable
amino acids in curly brackets '{ }'
• Elements are separated by '-'
• Repetition of an element is indicated by a numerical
value or a numerical range between parenthesis following
that element
• Restriction of the pattern to either the N- or C-terminal of
a sequence is indicated by either starting with a '<'
symbol or ending with a '>' symbol
• A period ends the pattern
Antje Krause
Poznań 14.07.2006
18
Example leucine zipper
L-x(6)-L-x(6)-L-x(6)-L.
Coiled-coil
PROSITE Entry PDOC00029
Antje Krause
Poznań 14.07.2006
19
Example C2H2 zinc finger
x
x
x
x
C
x
x
x
x
H
x
x
x
x
x
\
x
/
Zn
x
x x x x x
/
C
x
x
x
\
H
x x x x x
PROSITE Entry PDOC00028
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.
Antje Krause
Poznań 14.07.2006
20
Pattern
Advantages:
Disadvantages:
• easy and intuitive definition
• yes/no-decisions:
proteins not complying with a
certain pattern will never be
found although they may
contain the domain
• simple to use in automated
processing
• needs multiple alignment
Antje Krause
Poznań 14.07.2006
21
Antje Krause
Poznań 14.07.2006
22
Rule
Advantages:
Disadvantages:
• easy and intuitive notation
• difficult to use in
automated processing
• simple to use in manual
processing
• able to model long range
dependencies
Antje Krause
Poznań 14.07.2006
23
Antje Krause
Poznań 14.07.2006
24
Profile
• position specific scoring/weight matrix
with N columns
and 20+ rows
• N is the number of columns in a multiple alignment
= length of the multiple alignment
= length of the profile
• each row holds the information about 1 amino acid (IUPAC
code), about gap penalties or other properties
Antje Krause
Poznań 14.07.2006
25
Scoring matrices
(e.g. BLOSUM62)
Antje Krause
Poznań 14.07.2006
26
Average score method to
calculate a profile
Multiple Sequence
Alignment with
N=10 columns and
Z=23 rows
Profile with
N=10 columns (k) and
20 + 1 rows (j)
Cik: Quantity of
amino acid i
in column k
Sij: Score of
amino acid i and
amino acid j in
scoring matrix
(e.g. BLOSUM62)
ML1 = (CV1 / Z) * SVL + (CI1 / Z) * SIL = (4 / 23) * 1 + (19 / 23) * 2 = 1.83
Antje Krause
Poznań 14.07.2006
27
Profile
Advantages:
Disadvantages:
• captures degree of
conservation at each
position in a multiple
alignment
• difficult to use in manual
processing
• statistical method
• needs multiple alignment
• no formal statistical basis
• simple to use in automated
processing
Antje Krause
Poznań 14.07.2006
28
Antje Krause
Poznań 14.07.2006
29
Antje Krause
Poznań 14.07.2006
30
Antje Krause
Poznań 14.07.2006
31
Hidden Markov Model (HMM)
• statistical model where the system being modelled is assumed
to be a Markov process (stochastic process)
• the probability of being in one state depends only on the
previous state
• In a regular Markov model, the states are directly visible to the
observer, and therefore the state transition probabilities are the
only parameters
• HMM adds outputs: each state has a probability distribution
over the possible output tokens
Antje Krause
Poznań 14.07.2006
32
Profile HMM architecture
Insertion
End
Terminal
Match
Begin
Start
C-terminal
unaligned
sequence
N-terminal
unaligned
sequence
Delete
Joining segment of
unaligned sequences
from HMMER User Guide http://hmmer.wustl.edu
Antje Krause
Poznań 14.07.2006
33
Profile HMM
Advantages:
Disadvantages:
• same as for profile
• no manual processing
• statistical method with well
established formal
probabilistic basis
• needs a higher number of
sequences to give a
satisfactory result
• can use unaligned
sequences
Antje Krause
Poznań 14.07.2006
34
Antje Krause
Poznań 14.07.2006
35
Domain databases
• describe functional regions of proteins (called
domains, motifs, signatures...)
• a protein may consist of several and/or different
domains (multi-domain-protein)
• domains can be described with
–
–
–
–
patterns (regular expressions)
rules
profiles
Hidden Markov Models
Antje Krause
Poznań 14.07.2006
36
Antje Krause
Poznań 14.07.2006
37
Domain databases
Sanger Institute (http://www.sanger.ac.uk)
The Wellcome Trust Sanger Institute, Hinxton, GB
Pfam (http://www.sanger.ac.uk/Software/Pfam)
– Protein FAMmilies database of alignments and HMMs
– sequences from Swiss-Prot and TrEMBL
Prosite (http://www.expasy.ch/prosite)
– database of protein families and domains
– consists of biologically significant sites, patterns and
profiles
– sequences from Swiss-Prot and TrEMBL
– manual annotation made by experts
Antje Krause
Poznań 14.07.2006
38
InterPro
Integrated Resources of Protein
Families, Domains and
Functional Sites
Collaboration of Pfam,
PROSITE, PRINTS, ProDom,
SMART and TIGR
Used for automatic annotation
of entries in TrEMBL
Antje Krause
Poznań 14.07.2006
39
Antje Krause
Poznań 14.07.2006
40
Domain databases
FHCRC
Fred Hutchinson Cancer Research Center,
Seattle, Washington DC, USA
BLOCKS (http://www.blocks.fhcrc.org)
– multiply aligned ungapped segments corresponding to the
most highly conserved regions of proteins
– automatically derived from InterPro
– originally developed for the creation of scoring matrices
(substitution matrices)
 BLOSUM62 (BLOcks SUbstitution Matrix)
Antje Krause
Poznań 14.07.2006
41
Domain databases
SMART (http://smart.embl-heidelberg.de/)
Simple Modular Architecture Research Tool
• identification and annotation of genetically mobile domains and
the analysis of domain architectures
• signalling, extracellular and chromatin-associated proteins
Antje Krause
Poznań 14.07.2006
42
Antje Krause
Poznań 14.07.2006
43
intron length
frame
Antje Krause
Poznań 14.07.2006
44
Domain and family databases
Suppose we have n homologous protein sequences
•What do they have in common?
•What are the functional regions of these proteins?
•Which regions are conserved, which are not
conserved?
•How can we characterize these proteins/their
functional domains?
•What distinguishes these proteins/their functional
domains from others?
Antje Krause
Poznań 14.07.2006
45
Similarity:
Expressed in score, E-value, % sequence identity,
etc.
Homology:
Relationship due to common ancestry
Orthology:
Genes in the genomes of different species with a
common ancestor (resulting from a speciation
event)
Paralogy:
Genes in the same genome with a common
ancestor (resulting from a duplication event)
Antje Krause
Poznań 14.07.2006
46
But!
Similarity ≠ Homology
• Similarity is a good indicator for homology
• Normally we deduce homology from significant sequence
similarity
• But, we can not deduce sequence similarity from homology!
• Thus we also can not deduce non-homology from non-sequence
similarity!
Antje Krause
Poznań 14.07.2006
47
Database Search
Antje Krause
Poznań 14.07.2006
48
Transitivity
•use of intermediate sequences to derive knowledge
about homology
•if the proteins A and B are homologous and the
proteins B and C are homologous, than A and C are
homologous, too
•this holds even if there is no sequence similarity
detectable between A and C!
Antje Krause
Poznań 14.07.2006
49
Transitivity?
... may be limited to domains!
... but often it's difficult to define domain boundaries!
Antje Krause
Poznań 14.07.2006
50
Cutoff: 1e-30
Cutoff: 1e-20
Cutoff: 1e-10
Database Search
Database Search
Database Search
Database Search
Antje Krause
Poznań 14.07.2006
51
Sequence Clustering: Goals
Biologically meaningful partitioning of the data:
• Functional annotation
• Gain of information
• Reduction of the search space
• Selection of prototypic or representative
sequences
• Phylogenetic analyses
• Protein prediction
etc.
Antje Krause
Poznań 14.07.2006
52
Protein Families
• “Protein superfamily” (Dayhoff, 1974):
Group of evolutionarily related proteins
• Hierarchy of homology domains, families, and
superfamilies (Barker, 1996)
Manual classification based on sequence
similarity
• Most current proteins are thought to be the
descendants of no more than 1,000 (structural)
ancestors (Chothia, 1994)
• But no “definition”!
Antje Krause
Poznań 14.07.2006
53
Protein Families
Following M.Dayhoff we can think of a
• Protein superfamily as a group of proteins
– sharing domains
– being evolutionarily related
– showing weak sequene similarity
• Protein family as a group of proteins
– being (closely) evolutionarily related
– (showing at least 50% sequence similarity)
• Homeomorphic protein family as a group of
proteins
– having the same domains in the same order
Antje Krause
Poznań 14.07.2006
54
Single Linkage Clustering
Cutoff/Threshold
weak
conservative
stringent
Single Linkage
Hierarchy
Antje Krause
Poznań 14.07.2006
55
Test data set
• starting with 171,191 redundant sequences from Swiss-Prot
• after all-against-all BLAST database searches:
19,407,137 pairwise values
• after excluding 27,305 fragments (being 90% identical to
another sequence over 95% of their sequence length):
13,083,209 pairwise values
• Reminder:
171,191 sequences
 14,653,093,645 possible pairwise values!
• only 0.132% sequence pairs result in an Evalue < 10!
Antje Krause
Poznań 14.07.2006
56
143,886 non-redundant sequences and 13,083,209 pairwise values
Antje Krause
Poznań 14.07.2006
57
Antje Krause
 10%
sequence
overlap
 50%
sequence
overlap
 75%
sequence
overlap
 90%
sequence
overlap
Poznań 14.07.2006
58
Observations
• Doing single-linkage-clustering with this data we can vary on
the pairwise results of the BLAST searches, i.e., Evalue, %
Identity, length of local alignment, % alignment length of
sequence length, Score and all combinations!
• With a choice of at least 50% identity we are on the safe side
(this was Margaret Dayhoff’s original value for a protein family!)
• Unfortunately (but no surprise) nature does not behave in
cutoffs 
• There are highly conserved protein families (e.g., histones) and
fast evolving protein families (e.g., immunoglobulines)
• Every protein family needs it’s own cutoff
Antje Krause
Poznań 14.07.2006

59
SYSTERS
(SYSTEmatic Re-Searching)
Single linkage
hierarchy
Superfamilies
Superfamily
distance graph
Family
clusters
Superfamilies as well as family clusters are derived
from the structure generated by the data itself
 no need for a user defined static cutoff
Antje Krause
Poznań 14.07.2006
60
259/212,012
211,975/37
1/36
15/21
1/18
2/19
13/5
4/1
Antje Krause
Poznań 14.07.2006
61
Algorithm 1: Superfamily determination
Input: Tree T = (V, E) with n leaves (sequences)
Output: Superfamilies
1: for all leaves li  V, i  {1, ..., n} do
2: q  li
3: I  0
4: sfi  li
5: while (q  Troot) do
6:
p  parent (q)
7:
J  subtreesize (p) - subtreesize (q)
subtreesize (q)
8:
if (J > I) then
9:
IJ
10:
sfi  q
11:
end if
12:
qp
13: end while
14: end for
15: Resolve inclusions by keeping the largest superfamilies
Antje Krause
Poznań 14.07.2006
62
456 superfamilies with cutoff < 1e-180
64,282 superfamilies in 40,288 separate trees
Antje300,000
Krause
14.07.2006
About
non-redundantPoznań
sequences
63
|V| = 7
x = 15 * (6 / 42) = 2,14 < (7 / 2)
|E| = 15
4
4
3
x = 15 * (6 / 25) = 3,6 > (7 / 2)
4 1
B
1
1
 Split graph
 Process subgraphs
A
C
1
1
D
1
4
1
2
2
Stop criterion:
4
Antje Krause
2
4 2
x>
4
2
G
Minimal Cut C
E
1 4
4
 Output graph
F
Poznań 14.07.2006
|V|
2
|E|
x
*  w(i)
 w( j) iC
jE
64
weighted_HCS
Algorithm 2: HCS
Highly Connected Subcluster
(Hartuv & Shamir, 1999)
weighted graph G = (V, E)
Input: Connected unweighted
Output: Cluster graphs
1: (H1, H2, C)  mincut (G)
2: x
x
 |C|
|E| * (iC w(i) / jE w(j))
3: if (x > (|V| / 2)) then
4: output G
5: else
weighted_
(H1) HCS (H1)
6: HCS
7: HCS
weighted_
(H2) HCS (H2)
8: end if
Antje Krause
Poznań 14.07.2006
65
Ephrin
type A
Ephrin
type B
Predicted proteins
(C.elegans and Drosophila)
Antje Krause
Poznań 14.07.2006
66
SLC: Single Linkage Clustering
SF: Superfamilies
SF+SC: Family clusters derived from superfamilies
About 300,000 non-redundant sequences
Antje Krause
Poznań 14.07.2006
67
Family:
Superfamily:
systers.molgen.mpg.de
Domains:
Antje Krause
Poznań 14.07.2006
68
SYSTERS
• Exploit the self-structuring properties of the
data:
– Determine an individual cutoff for each superfamily
based on the single linkage hierarchy
– Split each superfamily into family clusters based on the
superfamily distance graph
• Automated and independent of static userdefined cutoffs
• Results accessible on the Internet
Antje Krause
Poznań 14.07.2006
69
Protein family databases - ProtoNet
• http://www.protonet.cs.huji.ac.il/
• global classification of proteins into hierarchical clusters
• based on Swiss-Prot sequences, with TrEMBL sequences added
after clustering
• N. Kaplan et al.,
NAR, 2005, 33(DB)
• 3 different
hierarchical
clustering methods
available depending
on the similarity
measure (harmonic-,
geometric-,
arithemtic average)
based on the BLAST
Evalue
Antje Krause
Poznań 14.07.2006
70
Protein family databases - CluSTr
• http://www.ebi.ac.uk/clustr/index.html
• automatic hierarchical classification of all sequences in UniProt
• uses Z-Score based on Smith-Waterman comparison:
Z-Score = min(Z(A,B), Z(B,A)) with
Z(A,B) = (Score(A,B) – M) /  with
M: arithmetic mean,
 : stand. deviation of all results
• R. Petryszak et al., Bioinformatics, 2005, 21(18)
• constructs single-linkagehierarchy
• provides a subset of clusters
at several different cutoff
values
Antje Krause
Poznań 14.07.2006
71
Protein family detection - TribeMCL
• http://www.ebi.ac.uk/research/cgg/tribe/
• uses a Markov Clustering method based on BLAST Evalues
• primarily used for comparing protein sequence sets of
completely sequenced genomes, e.g. in ENSEMBL
• clustering software available
• provides one set of protein families
• more specific than other methods, but less sensitive
Related
Not
related
Found
True
positive
False
positive
Not
found
False 
negative
True
negative
Antje Krause

Poznań 14.07.2006
A.J.Enright et al., NAR, 2002, 30(7)
72
But wait a moment...
• Why so many databases?
• Which one is “right” which one is “wrong”?
• How can we proof that the results are correct?
• We want to answer biological questions with these databases
• Different databases are needed to answer different questions
• There is no “right” or “wrong”
• The benefit highly depends on the questions
• The more concise the question, the more beneficial the answer
Antje Krause
Poznań 14.07.2006
73
Gene family databases
Suppose we have the gene/protein sequences of 2 completely
sequenced species
• Which genes/proteins do these species have in common?
• Which genes/proteins are orthologous?
• Where are the differences?
• Which genes/proteins have paralogs in one or the other
species?
Antje Krause
Poznań 14.07.2006
74
Antje Krause
Poznań 14.07.2006
75
What happens to a duplicated gene?
Duplication-Degeneration-Complementation Model (DDC)
Lynch & Force (Genetics, 1999/2000)
Antje Krause
Poznań 14.07.2006
76
Pairwise-best-hit-method
1.
Search with all protein sequences
of species A against all
protein sequences of
species B
2.
Remember only the best
hits
3.
Search with all protein
sequences of species B
against all protein
sequences of species A
4.
Remember only the best
hits
5.
All pairwise-best-hits are
assumed to be orthologs
Antje Krause
Poznań 14.07.2006
All proteins (genes)
of species A
All proteins
(genes) of species B
77
Gene family databases - InParanoid
• http://inparanoid.cgb.ki.se/
• clustering software available
• after determination of main-orthologs inparalogs are added to
the groups
• inparalogs: duplicated after speciation event
• outparalogs: speciation event after duplication
• uses
BLAST
• K.O’Brien
et al.,
NAR,
2005,
33 (DB)
Antje Krause
Poznań 14.07.2006
78
Gene family databases - COGs
• http://www.ncbi.nlm.nih.gov/COG/
• Cluster of Orthologous Groups of proteins
• based on all-against-all sequence search
• a protein builds a COG if pairwise-best-hits consist for
at least 3
species
• manual
postSpecies C
processing
Species A
(alignments,
Species B
trees) of COGs to split
COGs of multi-domain-proteins
• R.L.Tatusov
et al., 1997,
Science, 278
Antje Krause
Poznań 14.07.2006
79
Biological databases in general
first issue
every year
is the
database
issue
in 2006 this
database
collection
covered
858
databases
Antje Krause
Poznań 14.07.2006
80