* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Hidden Markov models for detecting remote protein homologies
G protein–coupled receptor wikipedia , lookup
Magnesium transporter wikipedia , lookup
Protein phosphorylation wikipedia , lookup
List of types of proteins wikipedia , lookup
Protein folding wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Protein design wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein domain wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Proteolysis wikipedia , lookup
Structural alignment wikipedia , lookup
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous Brief outline • Description and evaluation of a new hidden Markov model method, SAM-T98, for finding remote homologs of protein sequences. • Evaluation with three fold-recognition test datasets and a curated database. • Comparison with WU-BLASTP and DOUBLEBLAST • Results 1.Biology background – Homologs: chromosomes carrying the same genetic loci; – Structure of a protein can be predicted by using a homology to sequences for which the structure is known. – Similar structures similar functions Can classify proteins into families with similar functions. Remote-homolog Detection 2.Statistics background • Hidden Markov Models Q=set of states={match, insert, delete} V=output alphabet={20 amino acids} (i)=probability of being in state i, at time t=0 A=transition probabilities={aij}, where aij=Pr[entering state j at time t+1| in state i at time t] B=output probabilities={bj(k)}, where bj(k)=Pr[producing vk at time t | in state j at time t] •HMMs as profile models Homologs are chromosomes carrying the same genetic loci; a diploid cell has 2 copies of each homolog, one derived from each parent. A profile of a protein family is a labeling of the positions of the amino acids in the secondary structure and a probability distribution for each position. The structure of a protein can be predicted by using a homology to sequences for which the structure is known. Proteins with similar structure assumed to have similar functionclassification of proteins into families according their function. Typical Profile HMM: • chain of match, insert and delete states. • Specific probabilities to all transitions between nodes and character costs in match and insert states. • BEST PATH: a single path from ‘Start’ to ‘End’ and each character is related to a successive match or insertion state along the path. Example For this work: • Local alignment procedure was used: relates part of the sequence to one contiguous path through part of the HMM. • An HMM is trained on sequences-members of protein family, the resulting HMM identifies the positions of amino acids which describe structure of family. use this HMM to discriminate family members. TEST SETS Fold recognition datasets • FSSP: based on a protein classification tree (Holm and Sander, 1996, 1997) presents a continuously updated structural classification of 3dimensional protein folds (sequences of 1050 leaves of the FSSP tree, 166 target sequences). • Uses DALI to determine structural homology. • Classification: z-scores > 6 homologs z-scores < 2 non-homologs 2 < z-scores < 6 partly yes or no • Even with the best possible classifier, about 2% of the non-self pairs represent homologies to be detected. • At the minimum-error point for an optimal classifier, there are about 1.4% homolog pairs. • SCOP:(Structural Classification of proteins) • 2 test sets (Brenner, 1996; Park et al., 1997): identical lists for both target lists • Database of known folds • Homologous pair: if both sequences were in the same SCOP superfamily. • No 2 sequences had >40% sequence similarity. • Whole-chain test set: .6% correct homologies • Domain test set: the same Sequence comparison dataset • Pearson: curated version of the PIR database ( Barker et al., 1990). • 12 216 sequences total • Set of 67 target sequences • .4% were considered correct • PIR families (Protein Information Resource): close homologs Pearson test is for close homolog classification, NOT remote homolog. ALGORITHMS • • • • WU-Blast (Basic Local Alignment Tool): Protein identification software for genes Set E (# of false positives) = 10 Log of P-value reported as the score to threshold. • Optimum threshold never corresponded to P-value > .005. • • DOUBLE-Blast: Inspired by ISS (Park et al., 1997), ISS was used to recognize remote revolutionary related sequence pairs derived from SCOP database. It considered to increase detection compared to FASTA. Two-step approach: 1. Set of close homologs found to the target sequence in NRP. 2. Each homolog is used as a query to search the final database. • SAM-T98: Single target sequence: finds and multiplies align a set of homologs and creates an HMM from that multiple alignment. Resulting HMM is used then for database search. • SAM package • Database small: method is used to create an HMM for each sequence in the database. • For fold-recognition tests created HMMs for all sequences. • Pearson test only for the 67 target sequences.