* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Signal transduction wikipedia , lookup
Amino acid synthesis wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Biosynthesis wikipedia , lookup
Metalloprotein wikipedia , lookup
Point mutation wikipedia , lookup
Interactome wikipedia , lookup
Magnesium transporter wikipedia , lookup
Protein purification wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Genetic code wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Structural alignment wikipedia , lookup
Western blot wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Biochemistry wikipedia , lookup
Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium Matthew Sylvester 12/1/03 Endocytic Trafficking SPI-1 SPI-2 ? Salmonella-containing vacuole Lysosome bacterial effector proteins (SseJ, SifA, SseXs, and several others) Lamp-1 H+ ATPase Cathepsins Transferrin R Man. 6-PR Selection of S. typhimurium Proteins • Salmonella effectors are secreted into the host cell via either the Salmonella pathogenicity island 1 (SPI1) or SPI2 type three secretion system (TTSS) • We chose only those proteins shown experimentally in the literature to go out through one or both of these systems (see PubMed at http://ncbi.nlm.nih.gov) • The seventeen identified SPI1 and SPI2-associated effectors were considered as one group for subsequent analysis • As the N-terminal 150 amino acids have been shown to contain conserved sequences for several SPI2 effectors, we compared this region (Miao and Miller, 2000) Alignment of SPI-2 Effector Proteins Miao E and Miller S. A conserved amino acid sequence directing intracellular type III secretion by Salmonella typhimurium. PNAS. 2000, 97(13). Pp. 7539-7544. Published alignment of known and putative SPI2 effectors identified by a BLAST (Basic Local Alignment Search Tool) search and then aligned using ClustalW. Note the presence of the WEK(I/M)XXFF motif from approx. aa 31-38. BLAST • Tries to find the most “similar” proteins • Compares a query to sequences in a database and each comparison is given a score (higher scores are more similar) • Scoring matrices (substitution-based) are used to assign a score based on the probability of each residue substitution • Gap penalties are negative scores • The alignment score is the sum of scores at each position • Significance of overall alignment given a p-value or an evalue – e-value = expectation value: The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Building Substitution Matrices: Part I Blocks: Local ungapped alignment with rows = protein segments and columns = amino acid position 1ADEPQDA 2ACEPDDA … … … … … ….. …………………… 10 S D E P Q D A New Sequence: A D E P Q R A -count number of matches and mismatches between new sequence and every other sequence in block. -We have 9AA matches and 1 AS mismatch in pos. 1 Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. PNAS (1992). pp.10915-10919. Building Substitution Matrices: Part II Next, sum the results of each column, store results in a table and add the new sequence to the group By successively adding new sequences, we get a table with all possible pairs If we have 9 A’s and 1 S in the first column, we get 1 + 2 + …+8=36 possible AA pairs and we get 9 AS or SA pairs and we get 0 SS pairs If w = width of amino acids and s = # sequences, we have w*s*(s-1)/2 total possible pairs. Here, we have 36+9=45 or 1*10*9/2=45 Calculating the Lod (log-odds) Matrix • Let fij be the total number of amino acid pairs in the frequency table at position i,j (1<=j<=i<=20) • Then the observed proportion for each amino acid pairing is: • We have fAA=36 and fAS=9, so qAA=36/45 and qAS=9/45 Calculating the Lod Matrix II • Now we need the expected probabilities of occurrence for each amino acid pair • If we assume that the observed frequencies of each amino acid are the population frequencies, we have • For our example, pA=36/45+(9/45)/2 =0.9 and pS=(9/45)/2=0.1 • Then the expected probability (eij)of occurrence is pipj for i=j and pipj+pjpi for i!=j • We have expected probability of AA=0.9*0.9=0.81, AS=2*0.9*0.1=0.18, SS=0.1*0.1=0.01 Calculating the Lod Matrix III • Then we calculate the log-odds score in bits as sij=log2(qij/eij), so if we see more than expected, sij>0, if we see as many as expected, sij=0, and if we see less than expected, sij<0 • Multiplying s by 2 and rounding to the nearest integer, we obtain our values for the block substitution matrix (BLOSUM) Clustering • To prevent “double-counting” amino acid contributions from closely related proteins, sequences are clustered and counted as a single sequence in counting amino acids • Thus, if two sequences are identical at >X% of their aligned positions, then contributions are averaged between the two • In our example, if we were to cluster 8 of our sequences with A in the first position, we now have 2As and 1S • These matrices will be denoted BLOSUM X, such as BLOSUM 62 Substitution Matrix (log-odds) Based on observed frequencies of substitutions in related proteins; identical amino acids are given high positive scores, frequently observed substitutions get lower positive scores, and seldom observed substitutions get negative scores. Related Calculations • Relative entropy measures the average information in bits that can be distinguishes an alignment from chance • Expected score in bit units Bioinformatics Approaches: Primary Structure Protein Sequences ClustalW alignment Hmmer motif creation Hmmer database search Input sequences and other database proteins returned Consensus sequence developed MEME domain creation TRVI consensus creation MAST domain database search Search database with motif allowing for gaps and protein substitutions Input sequences and other database proteins returned Input sequences and other database proteins returned Primary Sequence Search Methodology Hmmer search of aligned sequences: • Hmmer uses hidden markov models to make a profile probability matrix of amino acids from aligned sequences • The matrix is searched against the appropriate genome database TRVI search allowing for gaps and substitutions: • A motif is developed by allowing for a flexible number of gaps wherever there are gaps in the alignment • Substitutions of amino acids with similar properties are allowed • The motif is searched against the appropriate genome database MEME/MAST search of unaligned sequences: • Identifies a specified number of domains (probability matrices) across a subset of the input sequences • The domains are searched against the appropriate genome database How Hmmer Works: Profile Hidden Markov Models for Protein Sequence Analysis • http://hmmer.wustl.edu/ Hmmer Architecture • Squares are match states (consensus positions), diamonds are insertions, circles are deletions and beginning/end. Arrows indicate state transitions. Hidden Markov Model Background From PMMB—Sandrine Dudoit See also http://www.ai.mit.edu/~murphyk/Bayes/rabiner.pdf More Hidden Markov Model Background Still More Background Hmmer Intro • Each M/D/I is a node and are determined by data and the multiple sequence alignment • Each M state aligns with a single amino acid and carries a vector of 20 probabilities determined by the proportion of times that an amino acid has shown up in a position in a multiple sequence alignment • Capable of handling gapped alignments • At each node either the M (amino acid aligned) or D state is used, and I states occur between nodes and selftransition • Arrows are transition probabilities and are estimated by the residues in each column of the multiple sequence alignment • S,N,C,T,J are “special states” that are algorithm-dependent and controlled externally Intermediate Hmmer • Want to calculate P(S|M) where the sum over the space of all sequence should be 1 • …The rules of the HMM allow us to do this • Implied that the insertions follow a geometric distribution • From a multiple sequence alignment “seed”, Hmmer make a consensus sequences and searches databases against this consensus sequence Hmmer Results ClustalW Alignment of SPI1 Effectors ClustalW Alignment of All Known Effectors Analysis of TRVI-Putative Cytoplasmic Proteins • Literature search – YciE not found – YciF classified as a putative structural protein by Blattner et al. • BLAST searches – STM0274 almost exactly SciI (S. typhimurium); other homologies to ImpC and ImpD (Rhizobium leguminosarum), and conserved hypotheticals—no literature on SciI, ImpC, nor ImpD – YciF has homologies to other putative structural proteins in Shigella and E.coli. Also homologous to several conserved hypotheticals – YciE has homologies to YciE from E.coli and other putative cytoplasmic/structural proteins in other species (YciE and YciF do not hit each other) – STM3767 homologous to a 4-hydroxy-2-oxoglutarate aldolase and several hypothetical proteins – STM4192 homologous to a nucleoprotein/polynucleotideassociated enzyme, hypothetical protein YaiL from E.coli, and hypotheticals (YaiL not in literature) Analysis of TRVI-Microarray Proteins • SseJ and YciE show up • fruF is part of the phosphoenolpyruvate: fructose phosphotransferase system • STM1181 is a putative flagella basal body part S. typhimurium MEME Motif Summary MEME MAST Analysis MEME search results using MAST and searched by domain: Domain 1: SseI, SlrP, SopA (putative effector proteins), YebE Domain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein) Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698) Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605) Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator) Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB (part of needle complex) Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406 Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter), STM4502 • Domains 1,3, and 5 look to be important for SPI2 secretion • The other domains are important for small, related subsets of proteins MEME Including Putative Cytoplasmic Proteins S. typhimurium Search Results Summary Hmmer search of aligned sequences : Only the input sequences (+ 2 theoretically secreted proteins) were returned. SPI1 and SPI2 effectors both have significant e-values from a combined matrix. TRVI search allowing for gaps and substitutions: 56 hits returned—Possible interesting hits include SseI, 5 LysR family proteins, 5 putative cytoplasmic proteins , 1 putative periplasmic protein, 2 inner membrane proteins, and 3 flagellar proteins. 4 proteins (FruF, SseJ, YciE, and a putative flagellar protein) were also identified in a DNA microarray screen under SPI2 inducing conditions with cholesterol. MEME search results using MAST and searched by domain: Domain 1: SseI, SlrP, SopA (putative effector proteins), YebE Domain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein) Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698) Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605) Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator) Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB (part of needle complex) Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406 Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter), STM4502 Primary Structure Conclusions • The best lead may be YciE, a putative cytoplasmic protein found with two different search methods • The methods did not give the same output • Hypothetical proteins found in the literature such as SipD, SptP (SPI1) and SpiC, SrfJ, SseB,C,D (SPI2) were not found • All proteins that go out via SPI2 do not necessarily have the WEK(I/M)XXFF motif • There is not a clear SPI1 motif Secondary Structure Prediction • Psipred structure prediction server used • Predictions made by two feed-forward neural networks based on PSI-BLAST output • N-terminal motif (MEME 3)—random coil in all SPI2 proteins • First SPI2 motif at aa 31-38 (MEME 1)— examples are SseJ, SifA, SifB(+F), SlrP(+F), SseI, SspH1(+F) • Second SPI2 motif at aa 105-120 (no MEME)— entirely random coil except for a small segment of SspH2 Secondary Structure Prediction of SifA Alpha-helical Wheel (SifA,SifB) WEK(I/M)XXFF is the Conserved motif among SPI2 effectors from aa 34 -41 (positions 1,2,3,4,7). All show this profile but SseJ (position 7 is polar-still a hydrophobic face). SspH1 Secondary Structure SspH2 Alpha-Helical Wheel SseG Secondary Structure SseG Alpha-helical Wheel SopD Alpha-Helical Wheel Secondary Structure Conclusion • A hydrophobic face on the alpha helix containing the conserved may be at least in part responsible for the translocation signal • Other seemingly important domains do not have secondary structure (other than random coils) • I have not looked at the SPI1 effectors nor the putative cytoplasmic proteins in this regard 3D Structure Prediction and Comparison: Ab initio • Prediction based solely upon the primary amino acid sequence of the protein • Rosetta Stone has done fairly well at CASP competitions– David Baker at U. of Washington • Accuracy of predictions still in question 3D Prediction and Comparison: Homology Modeling • BLAST protein of interest on proteins in the Brookhaven Protein Data Bank (PDB) • If there is significant homology (approx. 30%), then a model for the protein of interest can be determined based on the known structure(s) of the other protein(s) • This model can be compared to other known or predicted models to determine similarity • The main flaw is that if there is not a sequence with significant homology that has been crystallized, this method cannot be used Results of Swiss-Model Homology Search of all Putative and Know Effectors • Only full-length SspH1, SspH2 and SopE had enough homology to get structures • Only SopE gave me a result when I submitted the first 150 amino acids • The catalytic domain of SopE has been crystallized, but the first 77 amino acids are missing • Only the Leucine-rich repeat region of SspH1 and SspH2 could be modeled (amino acids 158 and higher) Tertiary Structure Examples Catalytic domain of SopE (starts at aa 77) and cdc42 SspH1 homology-modeled to YopM. Homology starts at Amino acid 158. Geno3D2 used. Future Directions • Do a similar primary structure analysis but expanding to also include hypothetical proteins from the literature (19 such proteins) • Study the different classes of proteins known to form the needle, form the translocon and act as chaperones • Do secondary structure analysis on the known SPI1 proteins and on the putative cytoplasmic proteins just identified • Try Rosetta Stone program Acknowledgments Kasturi Haldar Team Salmonella: Drew “Big Daddy Salmonella” Catron Everett Roark Team Malaria: Paul Cheresh Carlos Lopez-Estrano Sean Murphy Thanos Lykidis Luisa Hiller Thomas Akompong Travis Harrison Parwez Nawabi Souvik Bhattacharjee Team Bioinformatics: Dhugal Bedford Veronica Ryskin