* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt - University of Illinois Urbana
Multi-state modeling of biomolecules wikipedia , lookup
Magnesium transporter wikipedia , lookup
Molecular evolution wikipedia , lookup
List of types of proteins wikipedia , lookup
Bottromycin wikipedia , lookup
Protein moonlighting wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Community fingerprinting wikipedia , lookup
Western blot wikipedia , lookup
Expanded genetic code wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Point mutation wikipedia , lookup
Protein adsorption wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Biochemistry wikipedia , lookup
Genetic code wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Proteolysis wikipedia , lookup
Structural alignment wikipedia , lookup
Profile HMMs (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 10, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign 1 Protein Families • • Protein Family - a group of proteins that share a common function and/or structure, that are potentially derived from a common ancestor (set of homologous proteins) Characterizing a Family - Compare the sequence and structure patterns of the family members to reveal shared characteristics that potentially describe common biological properties – Multiple sequence alignment – Motif/Domain - sequence and/or structure patterns common to protein family members (a trait) – Profile is a way to represent a family of proteins 2 Profile of a Protein Family Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position. Protein family can be represented by a 20·n profile representing frequencies of amino acids. 3 Profiles and HMMs • A 20*n profile PWM corresponds to a simple HMM with n sequentially linked match states M1,…,Mn • More complex profiles involve modifications to this simple HMM by adding additional states to model insertions and deletions 4 The Simplest Profile HMM • Linear HMM with “match” states Begin M1 .... Mj .... Mk End Initial state distribution: P(M1)=1.0 Transition probabilities: p(Mi->Mj)=1.0 Output probabilities: p(o|Mi) is from the PWM • Problem: No gap is allowed, but gap is quite common in protein motifs 5 Adding Insert States • Allow one gap between Mi and Mi+1 • Allow one or more gaps between Mi and Mi+1 • Allow zero or more gaps between Mi and Mi+1 Ii Begin Mi Additional parameters: - Output probability for Ii: p(o|Ii) - Transition probabilities involving Ii End • Problem: How to model deletions? 6 Adding Delete States • • • How can we allow deletion of Mi? How can we allow deletion of Mi and Mi+1? How can we allow arbitrary deletions? Dj Begin Mj End Additional parameters: - State transitions involving Dj’s - What about output probabilities? 7 Putting All Together Dj Ij Begin • • Mj End Parameters can be estimated with slight modification in the estimation formulas How many parameters are there in a profile HMM with k match states? 8 Build an HMM from Multiple Alignments Step 1: Decide the consensus region Step 2: Decide which columns are “match” states and which are “insert” states. Step 3: Gaps in the match states are marked as “delete” Step 3: Build an HMM using supervised training HBA_HUMAN HBB_HUMAN MYG_PHYCA GLB3_CHITP GLB5_PETMA LGB2_LUPLU GLB1_GLYDI ...VGA--HAGEY... ...V----NVDEV... ...VEA--DVAGH... ...VKG------D... ...VYS--TYETS... ...FNA--NIPKH... ...IAGADNGAGV... *** ***** Match Insert Delete P(v|M1)=? P(M1->M2)=? 9 Estimation of Probabilities • Maximum Likelihood estimator – P(X|Y)=count(X;Y)/count(Y) • – E.g., p(v|M1)=“count of v in M1 column”/ “all amino acids in M1 column” Bayesian estimator (with conjugate prior) – P(X|Y)=(count(X;Y)+pseudocount(X;Y) )/(count(Y)+totalpseudocount) – E.g., p(v|M1)=(“count of v in M1 column”+1)/ (“all amino acids in M1 column”+20) (Laplace smoothing) P(v|M1)=6/27 P(M1->M2)=7/10 10 Optimal Model Construction • How do we decide which columns to be labeled as “match” or “insert”? • Many approaches – Heuristic approach: Columns with low entropy are more likely “match” (threshold-based) – Bayesian model selection: Choose a model that fits the data the best and is consistent with our prior 11 Profile HMM for Local Alignment • • Output prob. in flanking states use background values p(o|Background). Looping prob. close to 1, e.g. (1- ) for some small . Dj Ij Mj Begin End Q Q 12 Repeat Alignments • • Join the two background states Allow multiple matches Dj Ij Mj Begin Q End 13 Usages of Profile HMMs • Detecting potential membership in a family – Matching a sequence to the profile HMMs – Score a sequence S by p(S|HMM)/p(S|Random) • Return top k best matching profile HMMs for a given sequence • Given an HMM, find additional sequences in the family • Aligning a sequence to an existing family – Decoding the sequence using Viterbi – Using the state transition path to align the sequence with the existing sequences in the family • Many other uses…. 14 What You Should Know • Architecture of a typical profile HMM • How HMMs can be constructed appropriately to model matching, insertion, and deletion of amino acids • How to estimate a profile HMM based on multiple sequence alignment • How a regular profile HMM can be modified to allow multiple matches and local alignment 15