Download Protein Feature Identification

Document related concepts

Endogenous retrovirus wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene expression wikipedia , lookup

Expression vector wikipedia , lookup

Interactome wikipedia , lookup

Western blot wikipedia , lookup

Point mutation wikipedia , lookup

Proteolysis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein structure prediction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Homology modeling wikipedia , lookup

Transcript
Protein and Proteome
Annotation
David Wishart
University of Alberta
Edmonton, AB
[email protected]
Annotating 2D Gels
Trypsin
+ Gel punch
p53
Trx
G6PDH
Lecture 2.5
2
Is This Annotated?
p53
Information
1) pI
2) MW
3) name (abbr)
4) accession #
5) relative amnt
Trx
G6PDH
Lecture 2.5
3
How About This?
Information
1) name (abbr)
2) accession #
3) relative amnt
4) coexpressors
Lecture 2.5
4
Is This Annotated?
>P12345 Sequence 1
GATTACAGATTACAGATTACAGATTACAGATTACAG
ATTACAGATTACAGATTACAGATTACAGATTACAGA
TTACAGATTACAGATTACAGATTACAGATTACAGAT
TACAGATTAGAGATTACAGATTACAGATTACAGATT
ACAGATTACAGATTACAGATTACAGATTACAGATTA
CAGATTACAGATTACAGATTACAGATTACAGATTAC
AGATTACAGATTACAGATTACAGATTACAGATTACA
GATTACAGATTACAGATTACAGATTACAGATTACAG
ATTACAGATTACAGATTACAGATTACAGATTACAGA
TTACAGATTACAGATTACAGATTACAGATTACAGAT
Lecture 2.5
5
Protein Annotation
• Objective - identify and describe all the
physico-chemical, functional and structural
properties of a protein including its
sequence, accession #, mass, pI,
absorptivity, solubility, active sites, binding
sites, reactions, substrates, homologues,
function, name(s), abundance, location, 2o
structure, 3D structure, domains, pathways,
interacting partners
Lecture 2.5
6
Protein vs. Proteome
Annotation
• Protein annotation is concerned with one or
a small number (<50) proteins from one or
several types of organisms
• Proteome annotation is concerned with
entire proteomes (>2000 proteins) from a
specific organism (or for all organisms) need for speed
Lecture 2.5
7
Different Levels of
Annotation
• Sparse – typical of many gel or microarray
annotations, usually just includes name
and accession number
• Moderate – typical of many sequence
databases or of experiments aimed at
identifying protein complexes or ligands
• Detailed – not typical (occasionally found
in organism-specific databases)
Lecture 2.5
8
Different Levels of Database
Annotation
• GenBank (large # of sequences, minimal
annotation)
• PIR (large # of sequences, slightly better
annotation)
• SwissProt (small # of sequences, even
better annotation)
• Organsim-specific DB (very small # of
sequences, best annotation)
Lecture 2.5
9
GenBank Annotation
Lecture 2.5
10
PIR Annotation
Lecture 2.5
11
Swiss-Prot Annotation
Lecture 2.5
12
CCDB Annotation
Lecture 2.5
13
CCDB Annotation
Lecture 2.5
14
Ultimate Goal...
• To achieve the same level of
protein/proteome annotation as found in
CCDB for all genes/proteins – from 2D GE
data, from microarray data or for
sequence databases in general
How?
Lecture 2.5
15
Annotation Methods
• Annotation by homology (BLAST)
– requires a large, well annotated
database of protein sequences
• Annotation by sequence composition
– simple statistical/mathematical methods
• Annotation by sequence features,
profiles or motifs
– requires sophisticated sequence
analysis tools
Lecture 2.5
16
Annotation by Homology
• Statistically significant sequence matches
identified by BLAST searches against
GenBank (nr), SWISS-PROT, PIR, ProDom,
BLOCKS, KEGG, WIT, Brenda, BIND
• Properties or annotation inferred by name,
keywords, features, comments
Databases Are Key
Lecture 2.5
17
Sequence Databases
• GenBank
– www.ncbi.nlm.nih.gov/
• EMBL/trEMBL
– www.ebi.ac.uk/trembl/
• DDBJ
– www.nig.ac.jp/
• PIR
– http://pir.georgetown.edu/
• SwissProt
– www.expasy.ch/sprot/
• UniProt
– http://www.pir.uniprot.org/
Lecture 2.5
18
Structure Databases
• RCSB-PDB
– http://www.rcsb.org/pdb/
• MSD
– http://www.ebi.ac.uk/msd/i
ndex.html
• CATH
– http://www.biochem.ucl.
ac.uk/bsm/cath/
• SCOP
– http://scop.mrclmb.cam.ac.uk/scop/
Lecture 2.5
19
Expression Databases
• Swiss 2D Page
– http://ca.expasy.org/ch2d/
• SMD
– http://genomewww5.stanford.edu/MicroArra
y/SMD/
• ArrayExpress
– http://www.ebi.ac.uk/arrayexp
ress/
• Gene Expr. Omnibus
– http://www.ncbi.nlm.nih.gov/g
eo/
Lecture 2.5
20
Metabolism Databases
• KEGG
– http://www.genome.ad.jp/kegg
/metabolism.html
• Roche/Boeringer
– http://www.expasy.org/cgibin/search-biochem-index
• EcoCyc
– www.ecocyc.org/
• MetaCyc
– http://metacyc.org/
Lecture 2.5
21
Interaction Databases
• BIND
– http://www.blueprint.org/bin
d/bind.php
• DIP
– http://dip.doe-mbi.ucla.edu/
• MINT
– http://mint.bio.uniroma2.it/
mint/
• IntAct
– http://www.ebi.ac.uk/intact/i
ndex.html
Lecture 2.5
22
Bibliographic Databases
• PubMed Medline
– http://www.ncbi.nlm.nih.gov/
PubMed/
• Science Citation Index
– http://isi4.isiknowledge.com/
portal.cgi
• Your Local eLibrary
– www.XXXX.ca
• Current Contents
– http://www.isinet.com/isi/
Lecture 2.5
23
Annotation by Homology
An Example
• 76 residue protein from Methanobacter
thermoautotrophicum (newly sequenced)
• What does it do?
• MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEF
EKIKEMDQILEAGLTALPGLAVDGELKIMGRVAS
KEEIKKILS
Lecture 2.5
24
PSI BLAST
Select Database
Lecture 2.5
25
PSI-BLAST
Lecture 2.5
26
PSI-BLAST
Lecture 2.5
27
PSI-BLAST
Lecture 2.5
28
Conclusions
• Protein is a thioredoxin or glutaredoxin
(function, family)
• Protein has thioredoxin fold (2o and 3D
structure)
• Active site is from residues 11-14 (active
site location)
• Protein is soluble, cytoplasmic (cellular
location)
Lecture 2.5
29
Annotation Methods
• Annotation by homology (BLAST)
– requires a large, well annotated
database of protein sequences
• Annotation by sequence composition
– simple statistical/mathematical methods
• Annotation by sequence features,
profiles or motifs
– requires sophisticated sequence
analysis tools
Lecture 2.5
30
Annotation by Composition
• Molecular Weight
• Isoelectric Point
• UV Absorptivity
• Hydrophobicity
Lecture 2.5
31
Where To Go
Lecture 2.5
32
Isoelectric Point
• The pH at which a protein has a net charge=0
•
Q = S Ni/(1 + 10pH-pKi)
pKa Values for Ionizable Amno Acids
Residue
C
D
E
Lecture 2.5
pKa
10.28
3.65
4.25
Residue
H
K
R
pKa
6
10.53
12.43
33
UV Absorptivity
• OD280 = (5690 x #W + 1280 x #Y)/MW x Conc.
• Conc. = OD280 x MW/(5690 X #W + 1280 x #Y)
OH
N
H2N
Lecture 2.5
C
H
COOH
H2N
C
H
COOH
34
Hydrophobicity
• Indicates Solubility
• Indicates Stability
• Indicates Location
(membrane or
cytoplasm)
• Indicates Globularity
or tendency to form
spherical structure
Lecture 2.5
Kyte / Doolittle Hyrophobicity Scale
Residue
A
C
D
E
F
G
H
I
K
L
Hphob
1.8
2.5
-3.5
-3.5
2.8
-0.4
-3.2
4.5
-3.9
3.8
Residue
M
N
P
Q
R
S
T
V
W
Y
Hphob
1.9
-3.5
-1.6
-3.5
-4.5
-0.8
-0.7
4.2
-0.9
-1.3
35
Annotation Methods
• Annotation by homology (BLAST)
– requires a large, well annotated
database of protein sequences
• Annotation by sequence composition
– simple statistical/mathematical methods
• Annotation by sequence features,
profiles or motifs
– requires sophisticated sequence
analysis tools
Lecture 2.5
36
Where To Go
Lecture 2.5
37
Sequence Feature Databases
• PROSITE - http://www.expasy.ch/
• BLOCKS - http://blocks.fhcrc.org/
• DOMO - http://www.infobiogen.fr/services/domo/
• PFAM - http://pfam.wustl.edu
• PRINTS - http://www.biochem.ucl.ac.uk/bsm/dbrowser/PRINTS
• SEQSITE - PepTool
Lecture 2.5
38
What Can Be Predicted?
•
•
•
•
•
•
•
•
•
O-Glycosylation Sites
Phosphorylation Sites
Protease Cut Sites
Nuclear Targeting Sites
Mitochondrial Targ Sites
Chloroplast Targ Sites
Signal Sequences
Signal Sequence Cleav.
Peroxisome Targ Sites
Lecture 2.5
•
•
•
•
•
•
•
•
•
ER Targeting Sites
Transmembrane Sites
Tyrosine Sulfation Sites
GPInositol Anchor Sites
PEST sites
Coil-Coil Sites
T-Cell/MHC Epitopes
Protein Lifetime
A whole lot more….
39
Cutting Edge Sequence
Feature Servers
• Membrane Helix Prediction
– http://www.cbs.dtu.dk/services/TMHMM-2.0/
• T-Cell Epitope Prediction
– http://syfpeithi.bmiheidelberg.com/scripts/MHCServer.dll/home.htm
• O-Glycosylation Prediction
– http://www.cbs.dtu.dk/services/NetOGlyc/
• Phosphorylation Prediction
– http://www.cbs.dtu.dk/services/NetPhos/
• Protein Localization Prediction
– http://psort.nibb.ac.jp/
Lecture 2.5
40
Subcellular Localization
Lecture 2.5
41
Subcellular Localization
Lecture 2.5
http://www.cs.ualberta.ca/~bioinfo/PA/Sub/
42
Proteome Analyst (SubCell)
Lecture 2.5
43
2o Structure Prediction
• PredictProtein-PHD (72%)
– http://cubic.bioc.columbia.edu/predictprotein/
• Jpred (73-75%)
– http://www.compbio.dundee.ac.uk/~wwwjpred/submit.html
• SAM-T02 (75%)
– http://www.cse.ucsc.edu/research/compbio/HMMapps/T02-query.html
• PSIpred (77%)
– http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
Lecture 2.5
44
Putting It All Together
Seq Motifs
Composition
Annotated
Protein
Homology
Lecture 2.5
45
Putting It All Together
• PEDANT
– http://pedant.gsf.de/
• GeneQuiz
– http://jura.ebi.ac.uk:8765/ext-genequiz/
• Magpie
– http://magpie.ucalgary.ca/
• Proteome Analyst
– http://www.cs.ualberta.ca/~bioinfo/PA/
Lecture 2.5
46
Lecture 2.5
47
Programs Used By Pedant
•
•
•
•
•
•
•
•
HMMER
PSORT
PREDATOR
COILS
FGENESH++
pI
PROSEARCH
TargetP
Lecture 2.5
•
•
•
•
•
•
•
•
SAPS
NCBI-BLAST
SEG
InterProScan
SignalP
TMHMM
tRNAscan-SE
GENSCAN
48
Databases Used By Pedant
• EMBL
• Blocks
• PIR-PSD
• PDB
• SWISS-PROT
• SCOP
• Functional Cat
• COGs
• PROSITE
• Pfam
• TrEMBL
• STRIDE
Lecture 2.5
49
Lecture 2.5
50
Lecture 2.5
http://jura.ebi.ac.uk:8765/gqsrv/submit
51
GeneQuiz Functions
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Amino acid biosynthesis
Biosynthesis of cofactors, prosthetic groups, & carriers
Cell envelope
Cellular processes
Central intermediary metabolism
Energy metabolism
Fatty acid and phospholipid metabolism
Other categories
Purines, pyrimidines, nucleosides, and nucleotides
Regulatory functions
Replication
Transcription
Translation
Transport and binding proteins
Unknown
Lecture 2.5
52
Lecture 2.5
53
Lecture 2.5
54
Lecture 2.5
55
Lecture 2.5
56
Home Page
Lecture 2.5
57
Proteome Analyst
• Uses PSI-BLAST, PSI-PRED and motif
analysis tools
• Extracts keyword information from
homologues and uses Naïve Bayes
classifiers to infer function
• Combines sequence motif and sequence
profile information to complete functional
classification
• Supports custom classifier/ontology
Lecture 2.5
58
BacMap
• Picking up where we left off with the
CCDB… (Google “bacmap”)
• Idea is to generate a visual atlas of all (not
just Escherichia coli) bacterial
chromosomes and plasmids but with links
to extensive genome annotation
• Attempt to re-use annotation and graphing
tools originally developed for the CCDB
Lecture 2.5
59
BacMap
Lecture 2.5
http://wishart.biology.ualberta.ca/BacMap/60
BacMap
Lecture 2.5
61
Text Search Tools
Lecture 2.5
62
Sequence Search Tools
Lecture 2.5
63
Bacterial Biography Card
Lecture 2.5
64
Genome Statistics
Lecture 2.5
65
Proteome Statistics
Lecture 2.5
66
BacMap
• Each genome has a short description of the
organism and sequence data
• Supports zoomable, hyperlinked, clickable
map views of the genome
• Supports text search of gene names, protein
names and synonyms
• Supports BLAST search and supplies
genome-wide stats
• Currently going through major update
Stothard P, et al. BacMap: an interactive picture atlas of annotated bacterial genomes.
Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D317-20.
Lecture 2.5
67
What if Your Organism or
Genome isn’t in BacMap?
http://wishart.biology.ualberta.ca/basys/
Lecture 2.5
68
BASys
• Bacterial Annotation System
• A publicly available web server that
performs automated annotation of
bacterial genomes given only the gene
sequence of a chromosome or plasmid
• Takes about 24 hrs for an average genome
(4 megabases)
• Output includes images and annotation
text (about 70 fields for each gene)
Lecture 2.5
69
Typical BASys Result
Lecture 2.5
70
Conclusion
• Genome annotation is the same as proteome
annotation – required after any gene
sequencing and gene ID effort
• Can be done either manually or automatically
• Need for high throughput, automated
“pipelines” to keep up with the volume of
genome sequence data
• Area of active research and development with
about ½ of all bioinformaticians working on
some aspect of this process
Lecture 2.5
71