Download ppt - University of Illinois Urbana

Profile HMMs (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 10, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign 1 Protein Families • • Protein Family - a group of proteins that share a common function and/or structure, that are potentially derived from a common ancestor (set of homologous proteins) Characterizing a Family - Compare the sequence and structure patterns of the family members to reveal shared characteristics that potentially describe common biological properties – Multiple sequence alignment – Motif/Domain - sequence and/or structure patterns common to protein family members (a trait) – Profile is a way to represent a family of proteins 2 Profile of a Protein Family Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position. Protein family can be represented by a 20·n profile representing frequencies of amino acids. 3 Profiles and HMMs • A 20*n profile PWM corresponds to a simple HMM with n sequentially linked match states M1,…,Mn • More complex profiles involve modifications to this simple HMM by adding additional states to model insertions and deletions 4 The Simplest Profile HMM • Linear HMM with “match” states Begin M1 .... Mj .... Mk End Initial state distribution: P(M1)=1.0 Transition probabilities: p(Mi->Mj)=1.0 Output probabilities: p(o|Mi) is from the PWM • Problem: No gap is allowed, but gap is quite common in protein motifs 5 Adding Insert States • Allow one gap between Mi and Mi+1 • Allow one or more gaps between Mi and Mi+1 • Allow zero or more gaps between Mi and Mi+1 Ii Begin Mi Additional parameters: - Output probability for Ii: p(o|Ii) - Transition probabilities involving Ii End • Problem: How to model deletions? 6 Adding Delete States • • • How can we allow deletion of Mi? How can we allow deletion of Mi and Mi+1? How can we allow arbitrary deletions? Dj Begin Mj End Additional parameters: - State transitions involving Dj’s - What about output probabilities? 7 Putting All Together Dj Ij Begin • • Mj End Parameters can be estimated with slight modification in the estimation formulas How many parameters are there in a profile HMM with k match states? 8 Build an HMM from Multiple Alignments Step 1: Decide the consensus region Step 2: Decide which columns are “match” states and which are “insert” states. Step 3: Gaps in the match states are marked as “delete” Step 3: Build an HMM using supervised training HBA_HUMAN HBB_HUMAN MYG_PHYCA GLB3_CHITP GLB5_PETMA LGB2_LUPLU GLB1_GLYDI ...VGA--HAGEY... ...V----NVDEV... ...VEA--DVAGH... ...VKG------D... ...VYS--TYETS... ...FNA--NIPKH... ...IAGADNGAGV... *** ***** Match Insert Delete P(v|M1)=? P(M1->M2)=? 9 Estimation of Probabilities • Maximum Likelihood estimator – P(X|Y)=count(X;Y)/count(Y) • – E.g., p(v|M1)=“count of v in M1 column”/ “all amino acids in M1 column” Bayesian estimator (with conjugate prior) – P(X|Y)=(count(X;Y)+pseudocount(X;Y) )/(count(Y)+totalpseudocount) – E.g., p(v|M1)=(“count of v in M1 column”+1)/ (“all amino acids in M1 column”+20) (Laplace smoothing) P(v|M1)=6/27 P(M1->M2)=7/10 10 Optimal Model Construction • How do we decide which columns to be labeled as “match” or “insert”? • Many approaches – Heuristic approach: Columns with low entropy are more likely “match” (threshold-based) – Bayesian model selection: Choose a model that fits the data the best and is consistent with our prior 11 Profile HMM for Local Alignment • • Output prob. in flanking states use background values p(o|Background). Looping prob. close to 1, e.g. (1- ) for some small . Dj Ij Mj Begin End Q Q 12 Repeat Alignments • • Join the two background states Allow multiple matches Dj Ij Mj Begin Q End 13 Usages of Profile HMMs • Detecting potential membership in a family – Matching a sequence to the profile HMMs – Score a sequence S by p(S|HMM)/p(S|Random) • Return top k best matching profile HMMs for a given sequence • Given an HMM, find additional sequences in the family • Aligning a sequence to an existing family – Decoding the sequence using Viterbi – Using the state transition path to align the sequence with the existing sequences in the family • Many other uses…. 14 What You Should Know • Architecture of a typical profile HMM • How HMMs can be constructed appropriately to model matching, insertion, and deletion of amino acids • How to estimate a profile HMM based on multiple sequence alignment • How a regular profile HMM can be modified to allow multiple matches and local alignment 15

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - University of Illinois Urbana