Download PPT - Bioinformatics.ca

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein domain wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Homology modeling wikipedia , lookup

Protein purification wikipedia , lookup

Protein moonlighting wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Protein wikipedia , lookup

Trimeric autotransporter adhesin wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteomics wikipedia , lookup

Cyclol wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

List of types of proteins wikipedia , lookup

Western blot wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Transcript
Protein Subcellular Localization
Shan Sundararaj
University of Alberta
Edmonton, AB
[email protected]
Lecture 4.0
1
Why is Localization Important?
•
•
•
•
•
Function is dependent on context
Co-localization of proteins of related function
Valuable annotation for new proteins
Design of proteins with specific targets
Drug targeting
– Accessibility:
Membrane-bound > cytoplasmic > nuclear
Lecture 4.0
2
Why is Localization Important?
• 1974 Nobel Prize in Physiology/Medicine
– George Palade
• “for discoveries concerning the structural and
functional organization of the cell”
• 1999 Nobel Prize in Physiology/Medicine
– Günter Blobel
• “for the discovery that proteins have intrinsic
signals that govern their transport and localization
in the cell”
Lecture 4.0
3
Bacteria
Gram Positive
Gram Negative
(3-4 states)
(5 states)
Extracellular
cytoplasm
cytoplasmic
membrane
Lecture 4.0
cytoplasm
periplasm
cell wall
cytoplasmic
membrane
outer
membrane
Extracellular
4
Eukaryotic Cell
• Compartmentalized
• Diverse range of
specific organelles:
(modified from Voet & Voet, Biochemystry; Wiley-VCH 1992)
Lecture 4.0
– Plants: chloroplasts,
chromoplasts, other
plastids
– Muscle: sarcoplasm
– Various endosomes,
vesicles
5
Yet more categories…
Chloroplast
Lecture 4.0
Mitochondrion
Yeast “specific”
6
Level of Annotation
• As simple as two states:
– membrane protein vs. non-membrane protein
– secreted protein vs. non-secreted protein
• Gross compartments:
– cytoplasm, inner membrane, periplasm, cell wall,
outer membrane, extracellular
– nucleus, mitochondria, peroxisome, vacuole…
• Fine compartments:
– Mitochondrial matrix, bud neck, spindle pole…
– Any of 1425 GO cellular compartments
Lecture 4.0
7
Localization signaling
• Proteins must have
intrinsic signals for
their localization – a
cellular address
– E.g. N-terminal signal
sequences
321 Nuclear Inner Membrane Lane
Nucleus, Intracellular county
Eukaryotic Cell
CL34V3M3
Lecture 4.0
8
Localization signaling
• Some signals are easily recognizable
– Signal peptidase cleavage site, consensus
sequence for secretion  extracellular
– Address printed neatly, postal code
• Others are difficult to understand
– Outer membrane b-barrel proteins, no
consensus sequence, few sequence restraints
– Sloppy address, different kind of code that we
don’t understand yet
Lecture 4.0
9
Experimental determination
• Since don’t fully understand the language of
proteins, our knowledge must often come
from inference
– Predicting localization is like sorting mail based
only on examples of where some mail has gone
before
• Important to have good data sets of proteins
with known localizations
Lecture 4.0
10
Datasets
• Organelle_DB (http://organelledb.lsi.umich.edu/)
– 25095 eukaryotic proteins from subcellular proteomics studies
• DBSubLoc (http://www.bioinfo.tsinghua.edu.cn/~guotao/download.html)
– Combines SwissProt and PIR annotations (64051 proteins)
• PSORTDB (http://db.psort.org/)
– Bacterial. 1591 Gram –ve proteins, 574 Gram +ve proteins
• SignalP (http://www.cbs.dtu.dk/ftp/signalp/)
– 940 plant and 2738 human proteins
• YPL (http://bioinfo.mbb.yale.edu/genome/localize/)
– 2956 yeast proteins
Lecture 4.0
11
Experimental Methods
• Electron microscopy
• GFP tagging / fluorescence microscopy
• Subcellular fractionation + detection
– Western blotting
– Mass spectrometry
Lecture 4.0
12
Electron Microscopy
• Highest resolution, can
work at the level of a
single protein complex
• Immunolabel proteins of
interest in conjunction
with colloidal gold, and
visualize
• Combined with electron
tomography, can even
visualize unlabeled
complexes
(from Koster and Klumperman, Nat Rev Mol Cell Biol, Sep 2003, S6-10)
Lecture 4.0
13
Fluorescence Microscopy
• Tag gene at either 3’ or 5’ end
– Using GFP (or RFP, YFP, CFP, etc.)
– Using an epitope tag and a fluorescently labeled
antibody
– Careful of removing signal peptides!
• Also use a subcellular-specific marker or
stain
• Visualize with confocal fluorescence
microscopy and analyze images for colocalization
Lecture 4.0
14
Specific co-labeling (yeast)
•
•
•
•
•
•
•
•
•
•
•
•
Early Golgi:Cop1
Endosome: Snf7
ER to Golgi: Sec13
Golgi apparatus: Anp1
Late Golgi: Chc1
Lipid particle: Erg6
Mitochondrion: MitoTracker
Nucleus: DAPI
Nucleolus: Sik1
Nuclear periphery: Nic96
Peroxisome: Pex3
Vacuole: FM4-64
Lecture 4.0
Nuclear-specific DAPI staining
15
Subcellular Fractionation
1000 g
tissue
homogenate
Lecture 4.0
transfer
supernatant
transfer
supernatant
10,000 g
100,000 g
Pellet
unbroken cells
nuclei
chloroplast
Pellet
mitochondria
transfer
supernatant
Pellet
microsomal
Fraction
(ER, golgi,
lysosomes,
peroxisomes)
Super.
Cytosol,
Soluble
enzymes
16
Detergent Fractionation
Cells
Extraction with
Digitonin/EDTA
supernatant
Cytoplasmic
Fraction
pellet
Extraction with
TritonX100/EDTA
supernatent
Organelle
Membranes
pellet
Extraction with
SDS/EDTA
supernatant
Nuclear
Lecture 4.0
pellet
Cytoskeletal (in SDS)
17
Fractionation  Identification
• Once fractionated, take compartment of
interest and separate proteins
– 2D gel or chromatography
• Identify separated proteins
– Mass spectrometry for high-throughput
– Western blot for specific proteins
Lecture 4.0
18
Fractionation in proteomics
Lecture 4.0
19
High-Throughput Experiments
• Kumar et al., Genes Dev 2002, 16:707-719
– Epitope-tagged >60% of ORFs, visualized with
fluorescently labeled antibody
– 2744 localizations (44% of S. cerevisiae genes)
• Huh et al., Nature 2003, 425:686-691
– GFP tagged all ORFs, RFP tagged compartments
– 4156 localizations (75% of S. cerevisiae genes)
• Combined, now nearly 87% of yeast proteins
have a localization annotation
Lecture 4.0
20
High-Throughput Experiments
• Lopez-Campistrous et al, Mol Cell
Proteomics, 2005
– Subcellular fractionation of E. coli, 2D-gel
separation, MS-MS
– 2,160 localizations to cytoplasm, inner membrane,
periplasm, and outer membrane
Lecture 4.0
21
Predictions from known data
•
Enough experimental data exists to build
highly accurate computational predictors of
localization
Lecture 4.0
22
Predictions from known data
• Different information used for predictions:
1) Sequence motifs
- N-terminal: secretory signal peptides, mitochondrial
targeting peptide, chloroplast transit peptide
- C-terminal: peroxisome import signal, ER retention signal
- Mid-sequence: nuclear localization signals
2) Amino acid composition
- AA frequency, dipeptide composition.
3) Homology
- Sequence comparison to proteins of known localization
Lecture 4.0
23
N-terminal signal peptides
• Common structure of signal peptides:
– positively charged n-region, followed by a hydrophobic hregion and a neutral but polar c-region.
Prokaryotes
Eukaryotes
Total length (avg)
22.6 aa
Gram-negative
Gram-positive
25.1 aa
32.0 aa
n-regions
only slightly Arg-rich
Lys+Arg-rich
h-regions
short, very
hydrophobic
slightly longer, less
hydrophobic
very long, less
hydrophobic
c-regions
short, no pattern
short, Ser+Ala-rich
longer, Pro+Thr-rich
-3,-1 positions
small, neutral residues
almost exclusively Ala
+1 to +5 region
no pattern
rich in Ala, Asp/Glu, and Ser/Thr
Lecture 4.0
24
N-terminal signal peptides
Lecture 4.0
25
More work to do
•
•
•
•
•
Multiple bacterial secretion pathways
C-terminal signal peptides
Internal mitochondrial transit peptides
Structural aspects of targeting
Gene re-localization
• Still a lot to discover in how signaling works!
Lecture 4.0
26
Computational methods for predicting
localization
•
•
•
•
•
•
Expert rule based methods
Artificial Neural Nets (ANN)
Hidden Markov Models (HMM)
Naïve Bayes (NB)
Support Vector Machines (SVM)
Combination of above methods
Lecture 4.0
27
Naïve Bayes
• Assumption:
– Features are conditionally
independent, given class labels
• Structure:
C
– 1 level tree
– Class labels — root
– Features — leaf nodes
F1
F2
…
F7
• Prediction:
– class(f) = argmax P(C=c)P(F=f | C=c)
c
Lecture 4.0
28
Artificial Neural Network
• Excellent for modeling nonlinear input/output
relationships
• Robust to noise in training
data
• Widely used in bioinformatics
Input
Lecture 4.0
Hidden Output
29
Support Vector Machines
• Input vectors are
separated into positive
vs. negative instance
• Map to new feature
space
• Find hyperplane that
best separates the two
classes by distance
Lecture 4.0
Half-space:
w.x + b > 0
Class: +1
x
x x
w
x
x
x
Half-space:
w.x + b < 0
Class: -1
x
Hyperplane:
w.x + b = 0
30
Evaluating Predictors - Precision
Predicted
True
+
-
+ TP FN
-
FP TN
• # of proteins correctly labeled as “cyt” divided
by the total # of proteins labeled as “cyt”
• How often the label is correct
• If there are 90 proteins correctly labeled as
“cyt”, and 10 proteins incorrectly labeled as
“cyt”, then the precision is 90/100 = 0.90.
Lecture 4.0
31
Evaluating Predictors - Sensitivity
Predicted
True
+
-
+ TP FN
-
FP TN
• # of proteins correctly labeled as cytoplasmic
divided by the total # of proteins that are
cytoplasmic
• “How many of the true results were retrieved”
(also called “recall” or “accuracy”)
Lecture 4.0
32
Predictions from known data
• Different information used for predictions:
1) Sequence motifs
- N-terminal: secretory signal peptides, mitochondrial
targeting peptide, chloroplast transit peptide
- C-terminal: peroxisome import signal, ER retention signal
- Mid-sequence: nuclear localization signals
2) Amino acid composition
- AA frequency, dipeptide composition, hydrophobicity
3) Homology
- Sequence comparison to proteins of known localization
Lecture 4.0
33
TargetP, SignalP, *P
http://www.cbs.dtu.dk/services/
Sequence-based methods
• TargetP (85-90% recall)
– Predicts mitochondria/chloroplast/secreted
– Contains SignalP and ChloroP
• LipoP
– lipoproteins and signal peptides in Gram negative
bacteria
• SecretomeP
– non-classical secretion in eukaryotes
Lecture 4.0
34
SignalP result
• Common structure of signal peptides:
– positively charged n-region, followed by a hydrophobic h-region and
a neutral but polar c-region.
Cleavage site
Prediction: Signal peptide
Signal peptide probability: 0.945
Signal anchor probability: 0.000
Max cleavage site probability: 0.723 between pos. 28 and 29
Lecture 4.0
35
Organellar Prediction
• Predotar (http://www.inra.fr/predotar/) (80% recall)
– Mitochondrial and plastid sequences; N-terminal sequences
• MitoPred (http://mitopred.sdsc.edu/) (82% recall)
– Mitochondrial; PFAM domains, AA composition
• MitoProteome (http://www.mitoproteome.org/)
– Database of experimentally predicted human mitochondrial
• MitoP (http://ihg.gsf.de/mitop2/)
– Combines data from multiple experimental and
computational sources to give a consensus score for each
“mitochondrial” protein in yeast and human
Lecture 4.0
36
The PSORT Family
• PSORT – plant sequences
– Expert rule-based system
• PSORT II – eukaryotic sequences
– Probabilistic tree
• iPSORT – eukaryotic N-term. signal sequences
– ANN
• PSORT-B – bacterial sequences
• WoLF PSORT – eukaryotic
– Updated (2005) version of PSORTII
Lecture 4.0
37
PSORT-B
http://www.psort.org/psortb/
Lecture 4.0
38
PSORT-B - methods
• Signal peptides: Non-cytoplasmic
• AA composition/patterns
– SVM’s trained for each location vs. all
other locations
• Transmembrane helices: Inner membrane
– HMMTOP
• PROSITE motifs: all localizations
• Outer membrane motifs: Outer membrane
• Homology to proteins of known
localization
Integration
with a
Bayesian
network
– SCL-BLAST
Lecture 4.0
39
PSORT-B results
SeqID: Unannotated_bacterial2
Analysis Report:
CMSVMUnknown
CytoSVMCytoplasmic
ECSVMUnknown
HMMTOPUnknown
MotifUnknown
OMPMotifUnknown
OMSVMUnknown
PPSVMUnknown
ProfileUnknown
SCL-BLASTCytoplasmic
SCL-BLASTeUnknown
SignalUnknown
Localization Scores:
Cytoplasmic
9.97
CytoplasmicMembrane
0.01
Periplasmic
0.01
OuterMembrane
0.00
Extracellular
0.00
Final Prediction:
Cytoplasmic
9.97
Lecture 4.0
[No details]
[No details]
[No details]
[No internal helices found]
[No motifs found]
[No motifs found]
[No details]
[No details]
[No matches to profiles found]
[matched 118438: Cyto. protein]
[No matches against database]
[No signal peptide detected]
40
Proteome Analyst
http://www.cs.ualberta.ca/~bioinfo/PA/Sub/
Lecture 4.0
41
Proteome Analyst - Method
>?<Fly_01…
MDLRATSSND…
…
Unknown
Sequence
Training
>Extracellular<AFP1_BRANA…
MAKSATIVTL …
>Etracellular<AFP2_RAPSA…
ACRAGMEEP…
…
Lecture 4.0
Classifier
Prediction
Training
Sequences
Machine
Learning
Algorithm
Predicted
Class
>Cytoplasm<Fly_01
…
42
Proteome Analyst - Feature Extraction
>AFP1_ARATH
>AFP1_HUMAN
>AFP1_SINAL
…
MAKSATIVTL …
Sequence
PSI-BLAST
Swiss-Prot
Lecture 4.0
Homolog
Homolog
Homolog
Feature
Feature
Feature
Feature
43
Proteome Analyst: Feature Extraction
•
TOP 3 Homologs

–
–
•
KW
–
–
–
•
IPR002118; IPR003614
CC: Subcellular location
–
•
Plant defense; Fungicide;
Signal; Multigene Family;
Pyrrolidone carboxylic acid
DR: InterPro
–
•
AFP1_ARATH
AFP1_BRANA
AFP2_ARATH
Secreted
Token Set:
{Plant defense; Fungicide; Signal; Multigene Family;
Pyrrolidone carboxylic acid; IPR002118; IPR003614;
Secreted}
Lecture 4.0
44
PASub - Results
Contribution
of each token
Log scale
Features
Lecture 4.0
45
PASub - Interpretation
• Bars represent -log probability, so a little
difference is a lot!
• Naïve Bayes chosen as classifier because of
transparency of method
– Each token gives a probability that can be
summed and shown graphically
– Neural network actually has higher recall
• Can change token set, ask to explain with
different features
Lecture 4.0
46
Save Time: Pre-computed Genomes
• PSORTDB
– http://db.psort.org
– Browse, search, BLAST, download
– 103 Gram –ve bacteria, 45 Gram +ve bacteria
• Proteome Analyst (PA-GOSUB)
– http://www.cs.ualberta.ca/~bioinfo/PA/GOSUB/
– Browse, search, BLAST, download
– 15 bacterial and 8 eukaryotic
Lecture 4.0
47