Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PROTEIN SEONDARY & SUPER-SECONDARY STRUCTURE PREDICTION WITH HMM By En-Shiun Annie Lee CS 882 Protein Folding Instructed by Professor Ming Li 0 OUTLINE . 1. 2. 3. 4. Introduction Problem Methods (4) HMM Examples (3) a. Segmentation HMM b. Profile HMM c. Conditional Random Field 5. Proposal 1 INTRODUCTION . 1. 2. 3. 4. Introduction * Problem Methods (4) HMM Examples (3) a. Segmentation HMM b. Profile HMM c. Conditional Random Field 5. Proposal 1 Genomics . • Achievements in Genomic – BLAST (Basic Local Alignment Search Tool) • most cited paper published in 1990s • more than 15,000 times – Human genome project • Completion April 2003 1 Proteomics . • Precedence to Proteomics – Protein Data Bank (PDB) • 40,132 structures • cited more than 6,000 times 1 Proteomics . Number of Protein Structures in Protein Data Bank 1 Secondary Structure . • Importance – The known secondary structure may be used as an input for the tertiary structure predictions. 1 Protein Structure . • Primary Structure 1 Protein Structure . • Secondary Structure 1 Secondary Structure . • α-helix – Interaction between i and (i+4)th residue 1 Secondary Structure . • β-sheet/strand – Parallel or Anti-parallel 1 Secondary Structure . • Coil (loop) 1 Protein Structure . • Tertiary Structure 1 Protein Structure . • Super-Secondary (2.5) Structure Super-Secondary (2.5) Structure 1 Protein Structure . • Quaternary Structure Super-Secondary (2.5) Structure 2 PROBLEM . 1. 2. 3. 4. Introduction Problem * Methods (4) HMM Examples (3) a. Segmentation HMM b. Profile HMM c. Conditional Random Field 5. Proposal 2 Secondary Structure . • Problem – Given: • A primary sequence of amino acids – a1a2…an – Find: • Secondary structure of each ai as – α-helix = H – β-strand = E * – coil = C 2 Secondary Structure . • Example – Given: • Primary Sequence – GHWIATRGQLIREAYEDYRHFSSECPFIP – Find: • Secondary Structure Element – CEEEEECHHHHHHHHHHHCCCHHCCCCCC – Note: segments 2 Prediction Quality . • Three-state prediction accuracy – Q3 = # of correctly predicted residues total # of number of residues – Q, Qβ, Qc – Q3 for random prediction is 33% – Theoretical limit Q3=90%. 2 Prediction Quality . • Segment Overlap (SOV) – Higher penalties for core segment regions • Matthews Correlation Coefficients (MCC) – Prediction errors made for each state 2 True Structures . • Three dimensional PDB data – DSSP (Dictionary of Secondary Structure of Proteins) • 8 states – – – – – – – – H = alpha helix G = 310 - helix I = 5 helix (pi helix) E = extended strand (beta ladder) B = residue in isolated beta-bridge T = hydrogen bonded turn S = bend C = coil – STRIDE H H H E E C C C 3 METHODS . 1. 2. 3. 4. Introduction Problem Methods (4) * HMM Examples (3) a. Segmentation HMM b. Profile HMM c. Conditional Random Field 5. Proposal 3 Sliding Window . • Sliding-Window 3 Sliding Window . • Sliding-Window 3 Sliding Window . • Sliding-Window 3 Sliding Window . • Sliding-Window 3 Four Methods . a. b. c. d. Statistical Method Neural Network Support Vector Machine Hidden Markov Model 3a Statistical Method . • Propensity • Ex. Chou-Fasman 50~53% 3b Neural Network . • Ex. PHD 71% 3c SVM . • Ex. PSIPRED 76~78% 3d HMM Definition . • State set Q • Output alphabet Σ 3d HMM Definition . • Transition probabilities – probability of entering the state p from state q – Tq(p) • q Q • p Q 3d HMM Definition . • Emission probabilities – probability emits each letter of Σ from state q – Eq(ai) • ai Σ • q Q 3d HMM Decoding . • Problem – Given: • HMM = (Q,Σ,E,T) and • Sequence S – Where S = S1, S2, …, Sn – Find: • Most probable path of state gone through to get S – Where X = X1, X2, …, Xn = state sequence 4 HMM Decoding . • Optimize – Pr [ S , X ] • X = X1, X2, …, Xn = state sequence • S = S1, S2, …, Sn – Pr [ S | X ] 4 HMM Decoding . • Dynamic programming – Memoryless – Pr [Xn|Sn] = Pr [Xn-1|Sn-1] Tn-1[Xn] EXn [Sn] 4 HMM EXAMPLES . 1. 2. 3. 4. Introduction Problem Methods (4) HMM Examples (3) * a. Segmentation HMM b. Profile HMM c. Conditional Random Field 5. Proposal 4a SEMI-HMM . 1. 2. 3. 4. Introduction Problem Methods (4) HMM Examples (3) a. Semi-HMM * b. Profile HMM c. Conditional Random Field 5. Proposal 4a Semi-HMM . • Definition – Each state can emit a sequence – Move emission probabilities into states – Model secondary structure segments 4a Segmentation . • Sequence Segments 4a Segmentation . • Sequence Segments 4a Segmentation . • Sequence Segments • T = secondary structural type of the segment, {H, E, L} • S = ends of each individual structural segments • R = known amino acid sequence 4a Segmentation . • Sequence Segments • T2 = E = β-strand • S2 = 9 • R2 = S1 + 1 : S2 4a Bayesian . • Bayesian Formulation • R = Sequence of ALL amino acid residues • S = End of the segments • T = Secondary structural type of the segments – {H, E, L} 4a Bayesian . • Bayesian Formulation 1. Likelihood 2. Priori Probability 3. Constant (S,T) dropped 4a Bayesian . • Likelihood • m = Total number of segments • Sj = End of the jth segments • Tj = Secondary structural type of the jth segments 4a Bayesian . • Likelihood 4a Bayesian . • Likelihood 4a Bayesian . • Likelihood N-terminus Internal C-terminus 4a BSPPS . • Bayesian Segmentation PPS 4a BSPPS . • Bayesian Segmentation PPS 4a Results . • Better than PSIPRED – (w/o homology information) 4a Results . • Better than PSIPRED – (w/o homology information) 4b PROFILE-HMM . 1. 2. 3. 4. Introduction Problem Methods (4) HMM Examples (3) a. Semi-HMM b. Profile HMM * c. Conditional Random Field 5. Proposal 4b Profile HMM . • Main States – Columns of alignment 4b Profile HMM . • Insertion States 4b Profile HMM . • Deletion States – Jump over 1+ column in alignment 4b Profile HMM . • Combined 4b HMMSTR . • HMM for local protein STRucture 4b HMMSTR . • HMM for local protein STRucture • Pronounced “hamster” 4b I-Site Library . • I-sites Library – Motif = short basic structural fragments • 3~19 residues • 262 motifs • Highly predictable – Non-redundant PDB data (<25% similarity) – Fold uniquely across protein family – Exhaustive motif clustering 4b Build HMM . • States – Amino acid sequence and – Structural attribute • Transition from state – Adjacent positions in motif – No gap or insertion states 4b Build HMM . • Emission probability distributions – b = observed amino acid • (20 probability values) – d = secondary structure • (helix, strand, loop) – r = backbone angle region • (11 dihedral angle symbols) – c = structural context descriptor • (10 context symbols) 4b Build HMM . • Model I-site Library – Each 262 motif is a chain in HMM – Merge states base on similarity of • Sequence • Structure 4b Build HMM . • Model I-site Library • Merge states – base on similarity of • Sequence • Structure 4b HMMSTR Merge . • Ex. β-Hairpin Serine β-Hairpin Type-I β-Hairpin 4b HMMSTR Merge . • Ex. β-Hairpin Serine β-Hairpin Type-I β-Hairpin 4b HMMSTR Merge . • Ex. β-Hairpin 4b HMMSTR Merge . • Ex. β-Hairpin 4b HMMSTR Training . • Input: PDB proteins • Find – best state sequence for sequence – probability distribution of one amino acid • Integrate 3 data set – Aligned probability distribution – Amino acid and context information – Contact map 4b HMMSTR Summary . • 282 nodes • 317 transitions • 31 merged motifs 4b HMMSTR Summary . • Introduce structural context on level of super-secondary structure • Predict higher-order 3D tertiary structure – Side-result = predict 1D secondary structure 4b PROFILE-HMM . 1. 2. 3. 4. Introduction Problem Methods (4) HMM Examples (3) a. Semi-HMM b. Profile HMM c. Conditional Random Field * 5. Proposal 4c HMM Disadvantages . • Does not model – Multiple interacting features – Long-range dependencies • Strict independence assumptions 4c Conditional Model . • Allow – Arbitrary features – Non-independent features • Transition probability – With respect to past and future observations 4c Conditional Model . y1 y2 y3 y4 y5 y6 x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 x1 x2 x3 x4 x5 x6 HMM CRF … … 4c Random Field . • Random Field (Undirected graphical model) – Let G = (Y, E) be a graph • Where each vertex Yv = a random variable – If P(Yv|all other Y)= P(Yv|neighbours of Yv) Then Y is a random field 4c Random Field . • Example: – P(Y5 | all other Y) = P(Y5 | Y4, Y6) 4c Conditional RF . • Conditional Random Field – Let X = r.v. data sequences to be labeled • observations – Let Y = r.v. corresponding label sequences • labels – Let G = (V, E) be a graph • S.t. Y = (Yv)vY so Y is indexed by vertices of G – If P(Yv | X, Yw w≠v) = P(Yv | X, Yw, w~v) Then (X, Y) is a random field 4c Conditional RF . • Example: – P(Y3 | X, all other Y) = P(Y3 | X, Y2, Y4) 4c HMM vs. CRF . • HMM: – Maximize P(x,y|θ)=P(y|x,θ)P(x|θ) – Transition and emission probabilities – Transition/emission base only one x • CRF: – Maximize P(y|x,θ) – Feature function f(i, j, k) – Feature function base on all x 4c Beta-Wrap . • β-Helix – 3 parallel β-strands – Connected by coils • Few solved structures – 9 SCOP SuperFamilies – 14 RH solved structures in PDB – Solved structures differ widely 4c Graph Definition . • Let G = (V,E1,E2) be a graph – V = Nodes/States = Secondary structures – Edges = interactions • E1 – Edges between adjacent neighbors – Implied in the model • E2 – Edges for long-term interactions – Explicitly considered 4c Beta-Wrap Example . • Simple Example: – S2 = first β-strand – S3 = coil – S4 = second β-strand – S5 = coil – S6 = -helix 4c Beta-Wrap . • β-Helix Solution: 5 PROPOSAL . 1. 2. 3. 4. Introduction Problem Methods (4) HMM Examples (3) a. Segmentation HMM b. Profile HMM c. Conditional Random Field 5. Proposal * 5 Difficulties . • Do not infer global interaction – i.e. Beta-sheet interactions • Protein structure definition constraint 5 Possible Future Work . • Novel methods of secondary structure prediction – Model as Integer Programming • Super-secondary structure prediction 5 Acknowledgement . • Professor Ming Li – Guidance in • knowledge and • expertise • Bioinformatics lab • Mentoring a “rookie” • Class • Attention and listening