Download PPT

Document related concepts

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression wikipedia , lookup

Expression vector wikipedia , lookup

Signal transduction wikipedia , lookup

Amino acid synthesis wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Biosynthesis wikipedia , lookup

Metalloprotein wikipedia , lookup

Metabolism wikipedia , lookup

Point mutation wikipedia , lookup

Interactome wikipedia , lookup

Magnesium transporter wikipedia , lookup

SR protein wikipedia , lookup

Protein purification wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Genetic code wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein wikipedia , lookup

Structural alignment wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Biochemistry wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Bioinformatics Approaches to
Identifying Candidate Effector
Molecules of S. typhimurium
Matthew Sylvester
12/1/03
Endocytic Trafficking
SPI-1
SPI-2
?
Salmonella-containing vacuole
Lysosome
bacterial
effector
proteins (SseJ, SifA,
SseXs, and several
others)
Lamp-1
H+ ATPase
Cathepsins
Transferrin R
Man. 6-PR
Selection of S. typhimurium
Proteins
• Salmonella effectors are secreted into the host cell via
either the Salmonella pathogenicity island 1 (SPI1) or SPI2
type three secretion system (TTSS)
• We chose only those proteins shown experimentally in the
literature to go out through one or both of these systems
(see PubMed at http://ncbi.nlm.nih.gov)
• The seventeen identified SPI1 and SPI2-associated
effectors were considered as one group for subsequent
analysis
• As the N-terminal 150 amino acids have been shown to
contain conserved sequences for several SPI2 effectors, we
compared this region (Miao and Miller, 2000)
Alignment of SPI-2 Effector
Proteins
Miao E and Miller S. A conserved amino acid sequence directing intracellular type III secretion by Salmonella typhimurium.
PNAS. 2000, 97(13). Pp. 7539-7544.
Published alignment of known and putative SPI2 effectors identified
by a BLAST (Basic Local Alignment Search Tool) search and then
aligned using ClustalW. Note the presence of the WEK(I/M)XXFF
motif from approx. aa 31-38.
BLAST
• Tries to find the most “similar” proteins
• Compares a query to sequences in a database and each
comparison is given a score (higher scores are more
similar)
• Scoring matrices (substitution-based) are used to assign a
score based on the probability of each residue substitution
• Gap penalties are negative scores
• The alignment score is the sum of scores at each position
• Significance of overall alignment given a p-value or an evalue
– e-value = expectation value: The number of different
alignments with scores equivalent to or better than S
that are expected to occur in a database search by
chance. The lower the E value, the more significant the
score.
Building Substitution Matrices: Part I
Blocks: Local ungapped alignment with
rows = protein segments and
columns = amino acid position
1ADEPQDA
2ACEPDDA
… … … … … …..
……………………
10 S D E P Q D A
New Sequence: A D E P Q R A
-count number of matches and mismatches between
new sequence and every other sequence in block.
-We have 9AA matches and 1 AS mismatch in pos. 1
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks.
PNAS (1992). pp.10915-10919.
Building Substitution Matrices: Part II
Next, sum the results of each column, store results
in a table and add the new sequence to the group
By successively adding new sequences, we get a
table with all possible pairs
If we have 9 A’s and 1 S in the first column,
we get 1 + 2 + …+8=36 possible AA pairs and
we get 9 AS or SA pairs and we get 0 SS pairs
If w = width of amino acids and s = # sequences,
we have w*s*(s-1)/2 total possible pairs.
Here, we have 36+9=45 or 1*10*9/2=45
Calculating the Lod (log-odds) Matrix
• Let fij be the total number of amino acid
pairs in the frequency table at position i,j
(1<=j<=i<=20)
• Then the observed proportion for each
amino acid pairing is:
• We have fAA=36 and fAS=9, so qAA=36/45
and qAS=9/45
Calculating the Lod Matrix II
• Now we need the expected probabilities of occurrence for
each amino acid pair
• If we assume that the observed frequencies of each amino
acid are the population frequencies, we have
• For our example, pA=36/45+(9/45)/2 =0.9 and
pS=(9/45)/2=0.1
• Then the expected probability (eij)of occurrence is pipj for
i=j and pipj+pjpi for i!=j
• We have expected probability of AA=0.9*0.9=0.81,
AS=2*0.9*0.1=0.18, SS=0.1*0.1=0.01
Calculating the Lod Matrix III
• Then we calculate the log-odds score in bits
as sij=log2(qij/eij), so if we see more than
expected, sij>0, if we see as many as
expected, sij=0, and if we see less than
expected, sij<0
• Multiplying s by 2 and rounding to the
nearest integer, we obtain our values for the
block substitution matrix (BLOSUM)
Clustering
• To prevent “double-counting” amino acid
contributions from closely related proteins,
sequences are clustered and counted as a single
sequence in counting amino acids
• Thus, if two sequences are identical at >X% of
their aligned positions, then contributions are
averaged between the two
• In our example, if we were to cluster 8 of our
sequences with A in the first position, we now
have 2As and 1S
• These matrices will be denoted BLOSUM X, such
as BLOSUM 62
Substitution Matrix (log-odds)
Based on observed frequencies of substitutions in related
proteins; identical amino acids are given high positive scores,
frequently observed substitutions get lower positive scores, and
seldom observed substitutions get negative scores.
Related Calculations
• Relative entropy
measures the average information in bits
that can be distinguishes an alignment from
chance
• Expected score in bit units
Bioinformatics Approaches:
Primary Structure
Protein Sequences
ClustalW alignment
Hmmer motif creation
Hmmer database
search
Input sequences and other
database proteins returned
Consensus sequence
developed
MEME domain creation
TRVI consensus creation
MAST domain database search
Search database with motif
allowing for gaps and protein
substitutions
Input sequences and other
database proteins returned
Input sequences and other
database proteins returned
Primary Sequence Search
Methodology
Hmmer search of aligned sequences:
• Hmmer uses hidden markov models to make a profile probability
matrix of amino acids from aligned sequences
• The matrix is searched against the appropriate genome database
TRVI search allowing for gaps and substitutions:
• A motif is developed by allowing for a flexible number of gaps
wherever there are gaps in the alignment
• Substitutions of amino acids with similar properties are allowed
• The motif is searched against the appropriate genome database
MEME/MAST search of unaligned sequences:
• Identifies a specified number of domains (probability matrices) across
a subset of the input sequences
• The domains are searched against the appropriate genome database
How Hmmer Works:
Profile Hidden Markov Models
for Protein Sequence Analysis
• http://hmmer.wustl.edu/
Hmmer Architecture
• Squares are match states (consensus
positions), diamonds are insertions, circles
are deletions and beginning/end. Arrows
indicate state transitions.
Hidden Markov Model Background
From PMMB—Sandrine Dudoit
See also http://www.ai.mit.edu/~murphyk/Bayes/rabiner.pdf
More Hidden Markov Model Background
Still More Background
Hmmer Intro
• Each M/D/I is a node and are determined by data and the
multiple sequence alignment
• Each M state aligns with a single amino acid and carries a
vector of 20 probabilities determined by the proportion of
times that an amino acid has shown up in a position in a
multiple sequence alignment
• Capable of handling gapped alignments
• At each node either the M (amino acid aligned) or D state
is used, and I states occur between nodes and selftransition
• Arrows are transition probabilities and are estimated by the
residues in each column of the multiple sequence
alignment
• S,N,C,T,J are “special states” that are algorithm-dependent
and controlled externally
Intermediate Hmmer
• Want to calculate P(S|M) where the sum over the
space of all sequence should be 1
• …The rules of the HMM allow us to do this
• Implied that the insertions follow a geometric
distribution
• From a multiple sequence alignment “seed”,
Hmmer make a consensus sequences and searches
databases against this consensus sequence
Hmmer Results
ClustalW Alignment of SPI1 Effectors
ClustalW Alignment of All Known Effectors
Analysis of TRVI-Putative
Cytoplasmic Proteins
• Literature search
– YciE not found
– YciF classified as a putative structural protein by Blattner et al.
• BLAST searches
– STM0274 almost exactly SciI (S. typhimurium); other homologies
to ImpC and ImpD (Rhizobium leguminosarum), and conserved
hypotheticals—no literature on SciI, ImpC, nor ImpD
– YciF has homologies to other putative structural proteins in
Shigella and E.coli. Also homologous to several conserved
hypotheticals
– YciE has homologies to YciE from E.coli and other putative
cytoplasmic/structural proteins in other species (YciE and YciF do
not hit each other)
– STM3767 homologous to a 4-hydroxy-2-oxoglutarate aldolase and
several hypothetical proteins
– STM4192 homologous to a nucleoprotein/polynucleotideassociated enzyme, hypothetical protein YaiL from E.coli, and
hypotheticals (YaiL not in literature)
Analysis of TRVI-Microarray
Proteins
• SseJ and YciE show up
• fruF is part of the phosphoenolpyruvate:
fructose phosphotransferase system
• STM1181 is a putative flagella basal body
part
S. typhimurium MEME Motif Summary
MEME MAST Analysis
MEME search results using MAST and searched by domain:
Domain 1: SseI, SlrP, SopA (putative effector proteins), YebE
Domain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein)
Domain 3: SseI, HepA/RapA, Putative inner membrane protein
(STM1698)
Domain 4: YfeC, Putative periplasmic proteins (STM3783 and
STM3605)
Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion
regulator)
Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB
(part of needle complex)
Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406
Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate
transporter), STM4502
• Domains 1,3, and 5 look to be important for SPI2 secretion
• The other domains are important for small, related subsets of proteins
MEME Including Putative Cytoplasmic
Proteins
S. typhimurium Search Results
Summary
Hmmer search of aligned sequences :
Only the input sequences (+ 2 theoretically secreted proteins) were returned. SPI1
and SPI2 effectors both have significant e-values from a combined matrix.
TRVI search allowing for gaps and substitutions:
56 hits returned—Possible interesting hits include SseI, 5 LysR family proteins, 5
putative cytoplasmic proteins , 1 putative periplasmic protein, 2 inner membrane
proteins, and 3 flagellar proteins. 4 proteins (FruF, SseJ, YciE, and a putative
flagellar protein) were also identified in a DNA microarray screen under SPI2
inducing conditions with cholesterol.
MEME search results using MAST and searched by domain:
Domain 1: SseI, SlrP, SopA (putative effector proteins), YebE
Domain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein)
Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698)
Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605)
Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator)
Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB (part of
needle complex)
Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406
Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter),
STM4502
Primary Structure Conclusions
• The best lead may be YciE, a putative
cytoplasmic protein found with two
different search methods
• The methods did not give the same output
• Hypothetical proteins found in the literature
such as SipD, SptP (SPI1) and SpiC, SrfJ,
SseB,C,D (SPI2) were not found
• All proteins that go out via SPI2 do not
necessarily have the WEK(I/M)XXFF motif
• There is not a clear SPI1 motif
Secondary Structure Prediction
• Psipred structure prediction server used
• Predictions made by two feed-forward neural
networks based on PSI-BLAST output
• N-terminal motif (MEME 3)—random coil in all
SPI2 proteins
• First SPI2 motif at aa 31-38 (MEME 1)—
examples are SseJ, SifA, SifB(+F), SlrP(+F), SseI,
SspH1(+F)
• Second SPI2 motif at aa 105-120 (no MEME)—
entirely random coil except for a small segment of
SspH2
Secondary Structure Prediction of
SifA
Alpha-helical Wheel (SifA,SifB)
WEK(I/M)XXFF is the
Conserved motif among
SPI2 effectors from aa 34
-41 (positions 1,2,3,4,7).
All show this profile but
SseJ (position 7 is polar-still a hydrophobic face).
SspH1 Secondary Structure
SspH2 Alpha-Helical Wheel
SseG Secondary Structure
SseG Alpha-helical Wheel
SopD Alpha-Helical Wheel
Secondary Structure Conclusion
• A hydrophobic face on the alpha helix
containing the conserved may be at least in
part responsible for the translocation signal
• Other seemingly important domains do not
have secondary structure (other than
random coils)
• I have not looked at the SPI1 effectors nor
the putative cytoplasmic proteins in this
regard
3D Structure Prediction and
Comparison:
Ab initio
• Prediction based solely upon the primary
amino acid sequence of the protein
• Rosetta Stone has done fairly well at CASP
competitions– David Baker at U. of
Washington
• Accuracy of predictions still in question
3D Prediction and Comparison:
Homology Modeling
• BLAST protein of interest on proteins in the Brookhaven
Protein Data Bank (PDB)
• If there is significant homology (approx. 30%), then a
model for the protein of interest can be determined based
on the known structure(s) of the other protein(s)
• This model can be compared to other known or predicted
models to determine similarity
• The main flaw is that if there is not a sequence with
significant homology that has been crystallized, this
method cannot be used
Results of Swiss-Model Homology
Search of all Putative and Know
Effectors
• Only full-length SspH1, SspH2 and SopE had
enough homology to get structures
• Only SopE gave me a result when I submitted the
first 150 amino acids
• The catalytic domain of SopE has been
crystallized, but the first 77 amino acids are
missing
• Only the Leucine-rich repeat region of SspH1 and
SspH2 could be modeled (amino acids 158 and
higher)
Tertiary Structure Examples
Catalytic domain of
SopE (starts at aa 77)
and cdc42
SspH1 homology-modeled to
YopM. Homology starts at
Amino acid 158. Geno3D2 used.
Future Directions
• Do a similar primary structure analysis but
expanding to also include hypothetical
proteins from the literature (19 such
proteins)
• Study the different classes of proteins
known to form the needle, form the
translocon and act as chaperones
• Do secondary structure analysis on the
known SPI1 proteins and on the putative
cytoplasmic proteins just identified
• Try Rosetta Stone program
Acknowledgments
Kasturi Haldar
Team Salmonella:
Drew “Big Daddy Salmonella” Catron
Everett Roark
Team Malaria:
Paul Cheresh
Carlos Lopez-Estrano
Sean Murphy
Thanos Lykidis
Luisa Hiller
Thomas Akompong
Travis Harrison
Parwez Nawabi
Souvik Bhattacharjee
Team Bioinformatics:
Dhugal Bedford
Veronica Ryskin