* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download invited talk
Deoxyribozyme wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Interactome wikipedia , lookup
Fatty acid synthesis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Western blot wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Nicotinamide adenine dinucleotide wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Structural alignment wikipedia , lookup
Peptide synthesis wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Point mutation wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Proteolysis wikipedia , lookup
Catalytic triad wikipedia , lookup
Metalloprotein wikipedia , lookup
Genetic code wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Identification of specificity-determining positions in protein alignments Mikhail Gelfand Research and Training Center “Bioinformatics” Institute for Information Transmission Problems, RAS ECCB2005, Madrid Motivation • Large protein families with general function assigned by homology, not much functional information • Much less structural data. Not many structures with substrates, cofactors etc. • Some specificity assignments from comparative genomics => • Search for specificity-determining positions in alignments – identification of functional sites – prediction of specificity – understanding and eventually re-design of function Specificity (of transporters) from comparative genomics – three examples. 1. New specificities in a little studied family Pasteurellaceae S-box (rectangle frame) MetJ (circle frame) LYS-element (circles) Tyr-T-box (rectangles) NMB SON-2 BL1111 SON-1 VC-2 VC-1 BH SON-3 clostridia OB CAC0744 LysT CB EF-nhaC1 PPE Archaea LP-nha2 LGA L ME LP-nha1 LB EF-nhaC2 TyrT BC1434 FN1414 BT1270 CB NMB05 36 FN0352 BC4121 TTE-nhaC SA2117 CJ OB2874 269. 47 CTC CPE DF FN0978 OB1118 HP MetT BS-yheL FN0650 BC1709 CTC00901 FN062 4 CTC02520 BS-mleN BB0637 CPE2317 FN1420 CTC02529 VCA0193 SO1087 FN1422 BC0373 BB0638 FN207 7 BH3946 VC2037 SA2292 HI1107 VV21061 MleN malate/lactate 2. Misleading homology: The PnuC family of transporters The THI elements The RFN elements 3. A nightmare. The NiCoT family of nickel- cobalt transporters SDP (Specificity-Determining Position) Alignment position that is conserved within groups of proteins having the same specificity (specificity groups) but differs between them SDP is not equivalent to a functionally important position Measure of specificity: mutual information Ip f p ( , i) log all specificity all amino groupsi acids f p ( , i ) f p ( ) f (i ) f p ( , i) f p ( ) f (i) = count of amino acid α in group i at position p divided by the total number of sequences = frequency of amino acid α in position p = fraction of proteins in group i Taking into account the structure of the phylogenetic tree: random shuffling and linear regression linear regression min Z-score I p I exp p Z p (I exp) p => positions that are more specific than expected given the tree Smoothing: pseudocounts and similarity between amino acid residues • m(ab) = amino acid substitution matrix • n(a,i) = count of amino acid a at position i Automated threshold setting: the Bernoulli estimator Are 5 SDP with Z-score > 12 better than 10 SDP with Z-score > 9? Z1 Z 2 k * arg min Pthere are at least k observed Z - scores Z Z k k n i i ni arg min 1 C n q p k i n k 1 p P( Z Z k ) Zk 1 exp( Z 2 )dZ 2 q 1 p Other similar techniques • Evolutionary trace (Lichtarge et al. 1996, 1997) – need structure; gradual construction of group-specific consensus • Evolutionary rate shifts (DIVERGE, Gu et al. 2002) – positions with group-specific evolutionary rate • Surface patches of slowly evolving residues (Rate4Site, Pupko et al. 2002) – need structure • PCA in the sequence space (Casari et al., 1995) • Correlated mutations (Pazos and Valencia, 2002) • Prediction of functional sub-types (Hannenhalli and Russell, 2000) – relative entropy of HMM profiles for groups SDPpred: Web interface Input: multiple alignment of proteins divided into specificity groups === AQP === %sp|Q9L772|AQPZ_BRUME -------------------------------------mlnklsaeffgtfwlvfggcgsa ilaa--afp-------elgigflgvalafgltvltmayavggisg--ghfnpavslgltv iiilgsts------------------------------slap-----------------qlwlfwvaplvgavigaiiwkgllgrd-------------------------------------%sp|P48838|AQPZ_ECOLI -------------------------------------mfrklaaecfgtfwlvfggcgsa vlaa--gfp-------elgigfagvalafgltvltmafavghisg--ghfnpavtiglwa lvihgatd------------------------------kfap-----------------qlwffwvvpivggiiggliyrtllekrd------------------------------------%tr|Q92ZW9 -------------------------------------mfkklcaeflgtcwlvlggcgsa vlas--afp-------qvgigllgvsfafgltvltmaytvggisg--ghfnpavslglav iiilgsth------------------------------rrvp-----------------qlwlfwiaplfgaaiagivwksvgeefrpvd---------------------------------=== GLP === %sp|P11244|GLPF_ECOLI ----------------------------msqt---stlkgqciaeflgtglliffgvgcv aalkvag---------a-sfgqweisviwglgvamaiyltagvsg--ahlnpavtialwl glilaltd------------------------------dgn--------------g-vpr -flvplfgpivgaivgafayrkligrhlpcdicvveek--etttpseqkasl------------%sp|P44826|GLPF_HAEIN ----------------------------mdks-----lkancigeflgtalliffgvgcv … SDPpred: Output Alignment of the family with the SDPs highlighted (Alignment view) Detailed description Plot of probabilities of each SDP used by the Bernoulli (List of SDPs) estimator to set the cutoff (Probability plot view) Transcription factors from the LacI family • Training set: 459 sequences, average length: 338 amino acids, 85 specificity groups – 44 SDPs 10 residues contact NPF (analog of the effector) 7 residues in the effector contact zone (5Ǻ<dmin<10Ǻ) 6 residues in the intersubunit contacts 5 residues in the intersubunit contact zone (5Ǻ<dmin<10Ǻ) 7 residues contact the operator sequence 6 residues in the operator contact zone (5Ǻ<dmin<10Ǻ) LacI from E.coli SDP clusters at the subunit contact region Cluster I Effector Cluster II DNA operator LacI (lactose repressor) from E.coli (1jwl) Overall statistics (LacI of E. coli) • Total 348 amino acids Non-contacting residues (distance to the DNA, effector, or the other subunit >10Ǻ) Contact zone (may be functional) • 44 SDP Contacting residues (distance to the DNA, effector, or the other subunit <5Ǻ) Membrane channels of the MIP family • Training set: 17 sequences, average length 280 amino acids, 2 specificity groups: Aquaporines & glyceroaquaporines – 21 SDPs 8 residues contact glycerol (substrate) (dmin<5Ǻ) 8 residues oriented to the channel 5 residues in the contacts with other subunits GlpF from E.coli Two SDP clusters at the contact of subunits forming the tetramer Cluster I 20Leu, 24Ile, 108Tyr of one subunit, 193Ser of another subunit Cluster II Glu43 Substrate (glycerol) Subunit I Glpf (glycerol facilitator) from E. coli (1fx8) Overall statistics (GlpF from E.coli) • Total 281 amino acids Non-contacting residues (distance to the substrate, or another subunit >10Ǻ) Contact zone (may be functional) • 21 SDP Contacting residues (distance to the substrate, or another subunit <5Ǻ) isocitrate/isopropylmalate dehydrogenases : combinations of specificities towards substrate and cofactor • IDH: catalyzes the oxidation of isocitrate to α-ketoglutorate and CO2 (TCA) using either NAD or NADP as a cofactor in organisms from prokaryotes to higher eukaryotes Mitochondria • IMDH: catalyzes oxidative decarboxylation of 3isopropylmalate into 2-oxo-4methylvalerate (leucine biosynthesis) in prokaryotes and fungi, the cofactor is NAD Eukaryota Archaea Bacteria Eukaryota Archaea Bacteria Selecting specificity groups 1. By substrate: all IDHs vs. all IMDHs 2. By cofactor: all NADdependent vs. all NADP-dependent IDH (NADP) type II IDH (NAD) IDH (NADP) type II 3. Four groups IDH (NADP) type II IDH IDH(NAD) (NAD) IMDH (NAD) IDH (NADP) type I IMDH (NAD) IDH (NADP) type I IMDH (NAD) IDH (NADP) type I Predicted SDPs most SDPs near the substrate SDPs near the substrate and the cofactor SDPs near the substrate, the cofactor and the other subunit SDPs, the cofactor and the substrate Substrate (isocitrate) Cofactor (NADP) Nicotinamide nucleotide 100Lys, 104Thr, 105Thr, 107Val, 337Ala, 341Thr: substrate-specific and four group SDPs, functionally not characterized Adenine nucleotide 344Lys, 345Tyr, 351Val: cofactor-specific SDPs, known determinants of specificity to cofactor NADP-dependent IDH from E. coli (1ai2) SDPs predicted for different groupings cofactorspecific SDPs 208Arg 337Ala 100Lys 300Ala Color code: 105Thr 229His 154Glu 103Leu 233Ile 158Asp 115Asn 305Asn 308Tyr 155Asn 231Gly 327Asn 344Lys 287Gln 164Glu 351Val 345Tyr 241Phe 38Gly 40Asp 104Thr 107Val 152Phe 323Ala 245Gly 161Ala 232Asn Contacts cofactor Contacts substrate AND cofactor 162Gly 36Gly Contacts substrate 45Met Contacts substrate AND the other subunit Contacts the other subunit substratespecific SDPs 31Tyr 341Thr 97Val 98Ala Four groups Overview • Transcription factors: contacts with the cofactor and the DNA • Transporters: contacts with the substrate • Enzymes: contacts with the substrate and the cofactor And all: • contacts between subunits Protein-DNA interactions Entropy at aligned sites (blue plots) and the number of contacts (red: heavy atoms in a base pair at a distance <cutoff from a protein atom) CRP PurR IHF TrpR The observed correlation does not depend on the distance cutoff CRP/FNR family of regulators TGTCGGCnnGCCGACA CooA Desulfovibrio TTGTGAnnnnnnTCACAA FNR Gamma TTGATnnnnATCAA HcpR Desulfovibrio TTGTgAnnnnnnTcACAA Correlation between contacting nucleotides and amino acid residues • • • • DD DV EC YP VC DD DV EC YP VC CooA in Desulfovibrio spp. CRP in Gamma-proteobacteria HcpR in Desulfovibrio spp. FNR in Gamma-proteobacteria COOA COOA CRP CRP CRP HCPR HCPR FNR FNR FNR ALTTEQLSLHMGATRQTVSTLLNNLVR ELTMEQLAGLVGTTRQTASTLLNDMIR KITRQEIGQIVGCSRETVGRILKMLED KXTRQEIGQIVGCSRETVGRILKMLED KITRQEIGQIVGCSRETVGRILKMLEE DVSKSLLAGVLGTARETLSRALAKLVE DVTKGLLAGLLGTARETLSRCLSRMVE TMTRGDIGNYLGLTVETISRLLGRFQK TMTRGDIGNYLGLTVETISRLLGRFQK TMTRGDIGNYLGLTVETISRLLGRFQK Contacting residues: REnnnR TG: 1st arginine GA: glutamate and 2nd arginine TGTCGGCnnGCCGACA TTGTGAnnnnnnTCACAA TTGTgAnnnnnnTcACAA TTGATnnnnATCAA The correlation holds for other factors in the family Factor CRP VFR CLP FNR & ANR FNR FNR & FixK DNR & Nnr FNR PrfA NtcA CysR CooA HcpR* HcpR* HcpR* HcpR* HcpR* HcpR* HcpR* ArcR CprK FlpA&B Organisms Consensus Specific aa Enterobacteria&Vibrio&Pasteurellaceae TTGTGAnnnnnnTCACAA R E R Pseudomonas sp. TTGTGAnnnnnnTCACAA R E R Xanthomonas&Xylella sp. nTGTGAnnnnnnTCACAn R E R Gamma-proteobacteria nnTTGATnnnnATCAAnn V E R Beta-proteobacteria nnTTGATnnnnATCAAnn L E R Alpha-proteobacteria nnTTGATnnnnATCAAnn I/L E R Pseudomonas &Paracoccus nnTTGATnnnnATCAAnn P E R Bacillus sp. nTGTGAnnTAnnTCACAn R E R Listeria nnTTAACAnnTGTTAAnn S S R Cyanobacteria ntGTAnCnnnnGnTACan R V R Cyanobacteria ? R V R Desulfovibrio sp. and R.rubrum nTGTCGGCnnGCCGACAn R Q T Desulfovibrio sp. TTGTgAnnnnnnTcACAA R E R Desulfuromonas acetoxidans, Desulfotalea atTTGAccnnggTCAAat psychrophila S/P E R Clostridia, Bacteroides, Thermotogales, ctGTAACawwtCTTACag Fusobacteria, Treponema R P R ~P. gingivalis nTGTCGCnnnnGCGACAn R A R ~C. difficile nnGGATnnnnnnATCCnn R S R ~T.tengcongensis, D.halfniensa nTGTGAnnnnnnTCACAn R E R ~Acidithiobacillus ferrooxidans nCTTGATTnnAATCAAGn P E R Bacillus, Enterococcus sp. nTGTGAnATATnTCACAn R E A/S Desulfitobacterium dehalogenas nnTTAnTGnnCAnTAAnn H V R/K Lactococcus lactis nnTTGATnnnnATCAAnn P E R Metabolic system catabolic repression virulence phytopathogenicity response to anaerobiosis response to anaerobiosis nitrogen fixation denitrification response to anaerobiosis virulence nitrogen metabolism sulfate utilization CO utilization prismane & sulfate reduction prismane prismane prismane prismane prismane prismane arginine catabolism halorespiration ? Inducer cAMP cAMP ? (not cAMP) O2,NO O2 O2 NO, NO2 O2-low conditions ? 2-oxoglutarate sulfate? CO ? ? ? ? ? ? ? O2 aromatics Eh, O2 SMc04260 11 Plans and perspectives. Protein-DNA interactions RAFR_ECOLI 12 RRC03428 13 10 SMb21598 SMb20324 9 BS_YvdE 8 MALR_STAXY SMc03060 14 EC_MalI 7 SACR_LACLA BS_RbsR 16 SMc04401 REF00345 6 SMb21272 15 RKP03067 19 SMc02975 17 REF00754 mlr2242 18 SCRR_SALTY BS_CcpA 20 EC_FruR 21 5 TTE0201 RKP05215 4 PA1949 22 SMb21650 3 EC_PurR GALR_STRTR EC_EbgR 2 EC_RbsR LacI family of transcriptional regulators (each branch represents a subfamily) 23 1 EC_LacI 43 EC_YcjW EC_TreR 24 VCA0654 25 SCRR_STAXY 26 42 41 RPU04121 SMb21372 STM3696 29 40 27 STM2345 RKP05499 36 39 PA2320 RSc1790 EC_CytR 34 EC_IdnR EC_AscG 30 STM1555 31 EC_GalR EC_GalS 38 RRC03254 EC_GntR 37 CSCR_ECOLI 33 35 SMc03165 XCC2369 32 PA2259 BS_KdgR 28 D-galactose & galactosides maltose & trehalose sucrose D-fructose D-ribose D-xylose … and their signals 1605 regulators from 189 genomes, forming 302 groups of orthologs and binding 2518 sites • A new family of Ni/Co transporters • No structural data • Specificity predicted by comparative genomics • Predicted SDPs form several clusters in the alignment, are located on the same sides of alpha-helices • Mutational analysis Plans and perspectives. Experimental verification Terminators of translation in prokaryotes / decoding of stop-codons. Specificity of RF1 (UAG, UAA) and RF2 (UGA, UAA) Fragment of the alignment (117 pairs). SDPs are shown by black boxes above the alignment. “Interesting” positions: invariant, SDPs, variable rate. SDPs and invariant positions: two decoding sites? Plans and perspectives • Use of 3D structures, when available. Identification of functional sites as spatial clusters of SDPs and conserved positions • Automated identification of specificity groups based on the analysis of the phylogenetic tree • Protein-DNA interactions • Identification of protein-protein contact surfaces Publications • N.J.Oparina, O.V.Kalinina, M.S.Gelfand, L.L.Kisselev (2005) Common and specific amino acid residues in the prokaryotic polypeptide release factors RF1 and RF2: possible functional implications. Nucleic Acids Research 33 (in press). • O.V.Kalinina, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Science 13: 443-456. • O.V.Kalinina, P.S.Novichkov, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Research 32: W424-W428. • O.V.Kalinina, M.S.Gelfand, A.A.Mironov, A.B.Rakhmaninova (2003) Amino acid residues forming specific contacts between subunits in tetramers of the membrane channel GlpF. Biophysics (Moscow) 48: S141-S145. • L.A.Mirny, M.S.Gelfand (2002) Using orthologous and paralogous proteins to identify specificity determining residues in bacterial transcription factors. Journal of Molecular Biology 321: 7-20. • L.Mirny, M.S.Gelfand (2002) Structural analysis of conserved base-pairs in protein-DNA complexes. Nucleic Acids Research 30: 1704-1711. • http://math.belozersky.msu.ru/~psn/ Acknowledgements • • • • • • Leonid Mirny (Harvard, MIT) Olga Kalinina Andrei A. Mironov Alexandra B. Rakhmaninova Dmitry Rodionov Olga Laikova • • • • Howard Hughes Medical Institute Ludwig Institute of Cancer Research Russian Fund of Basic Research Russian Academy of Sciences, programs “Molecular and Cellular Biology” and “Origin and Evolution of the Biosphere”