* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1+1+1 - Computer Science, Columbia University
Histone acetylation and deacetylation wikipedia , lookup
Signal transduction wikipedia , lookup
Phosphorylation wikipedia , lookup
Hedgehog signaling pathway wikipedia , lookup
P-type ATPase wikipedia , lookup
List of types of proteins wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Protein phosphorylation wikipedia , lookup
Magnesium transporter wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Protein design wikipedia , lookup
Protein folding wikipedia , lookup
Protein moonlighting wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Proteolysis wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Structural alignment wikipedia , lookup
Inferring Protein Structure with Discriminative Learning and Network Diffusion Rui Kuang Department of Computer Science Center for Computational Learning Systems Columbia University Thesis defense, August 16th 2006 Agenda Biological Background Protein Classification with String Kernels Domain Identification and Boundary Detection Protein Ranking with Network Diffusion Conserved Motifs between Remote Homologs Conclusion & Future Research Background Structural classification Domain segmentation Protein ranking Motif discovery Conclusion & Future Research What are proteins? Proteins – encoded by genes A protein (polypeptide chain) is a sequence of amino acid residues Derived from Greek word proteios meaning “of the first rank” by Jöns J. Berzelius in 1838 Amino Acid Polypeptide Chain (Picture courtesy of Branden & Tooze ) Why study proteins? Proteins play crucial functional roles in all biological processes: enzymatic catalysis, signaling messengers, structural elements… Function depends on unique 3-D structure. Easy to obtain protein sequences, difficult to determine structure. NLAFALSELDRITAQ LKLPRHVEEEAARL YREAVRKGLIRGRS IESVMAACVYAACR LLKVPRTLDEIADIA RVDKKEIGRSYRFI ARNLNLTPKKLF… Sequence fold determine Structure (Picture courtesy of Branden & Tooze ) Function Sequence Space and Structure Space Sequence (>2,500,000) •Homologous proteins share >30% sequence identity, which suggests strong structural similarity. •Remote homologous proteins share similar structure but low sequence similarity. Structure (38,086 known in PDB): discrete groups of folds with unclear boundaries (by 8/14/2006) Remote Homology Detection Remote homology : remote evolutionary relationship conserved structure/function, low sequence similarity <10% sequence identity ADTIVAVELDTYPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGKVG TAHIIYNSVDKRLSAVVSYPNADSATVSYDVDLDNVLPEWVRVGLSAS TGLYKETNTILSWSFTSKLKSNSTHETNALHFMFNQFSKDQKDLILQG DATTGTDGNLELTRVSSNGPQGSSVGRALFYAPVHIWESSAVVASFEA TFTFLIKSPDSHPADGIAFFISNIDSSIPSGSTGRLLGLFPDAN MSLLPVPYTEAASLSTGSTVTIKGRPLVCFLNEPYLQVDFHTEMKEES DIVFHFQVCFGRRVVMNSREYGAWKQQVESKNMPFQDGQEFLSISVLP DKYQVMVNGQSSYTFDHRIKPEAVKMVQVWRDISLTKFNVSYLKR It is often not possible to detect statistically significant sequence alignment between remote homologs DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA D + WEI+ +++ + ++ SGS+G +++G + +VA+K LK E + F EV DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF Protein Domains Proteins often consist of several independent domains fold autonomically often function differently represent fundamental structural, functional and evolutionary units Example: a two-domain protein 3-layer(aba) sandwich at the N-terminal a mainly alpha in an orthogonal bundle at the C-terminal Inferring protein structure/function from sequence similarity For newly sequenced genomes, often homology detection can only identify less than a half of the genes. Remote homology detection and domain segmentation are crucial steps for studying genes with no close homology. Query sequence Boundary identification Domain Family 11 Domain Domain Family 2 Domain Family 3 Domain Family n-2 Domain Family n-1 Domain 2 Domain 3 Sequenc e Alignmen Remote homology detection t We want to correctly segment protein sequences into domains Domain Family n (domain boundary identification) and associate them with their Domain database structural/functional class (remote homology corresponding detection). Protein Structural Classification Background Structural classification Domain segmentation Protein ranking Motif discovery Conclusion & Future Research Protein classification is the prediction of the structural or functional class of a protein from its primary sequence. SCOP: Structural Classification of Proteins Known domain structures are organized in a hierarchy: family, superfamily and fold. SCOP Superfamily : low sequence similarity but functional features suggest probable common evolutionary origin Family : Sequence identity > 30% or functions and structures are very similar Remote Homology Detection in Protein Classification Remote homologs: sequences that belong to the same super-family but not the same family. SCOP Fold Super-family Negative Negative Training Set Test Set Family Positive Training Set Positive Test Set Support Vector Machine (SVM) Classifiers SVM: Large margin-based discriminative learning approach. Find a hyperplane to separate positive data from negative data and also maximize the margin. A Quadratic Programming problem only depends on the inner product between data points. Kernels for Discrete Objects Kernel trick: To train an SVM, can use kernel rather than explicit feature map Can define kernels for sequences, graphs, other discrete objects: { sequences } RN Kernel value is inner product in feature space: K(x, y) = (x), (y) Original string kernels (Watkins, Haussler, Lodhi et al.) require quadratic time in sequence length, O(|x| |y|), to compute each kernel value K(x, y). We introduce fast novel string kernels with linear time complexity. Profile Kernel and its Family Tree Three generations Spectrum Kernels Mismatch Kernels Profile Kernels Faster computation, i.e., linear computation time in sequence length. Profile-based string kernels take advantage of abundant unlabeled data to capture homologous/evolutionary information for remote homology detection. Spectrum Kernel Feature map indexed by all possible k-length subsequences (“k-mers”) from alphabet of amino acids, || = 20 Q1:AKQDYYYYE AKQ KQD QDY DYY YYY YYY YYE Feature Space Q2:DYYEIAKQE (AAA-YYY) 1 AKQ 1 DYY 1 DYY 1 YYE 0 EIA 1 YEI 0 IAK 1 EIA 1 KQD 0 IAK 0 KQE 1 1 QDY 0 K-mers captureAKQ some positionsimilarity, KQE 0 YEI 1 independent local 1 YYE 1 but they don’t effectively model 2 YYY 0 evolutionary divergence. K(Q1,Q2)= <(…1…1…0…0…1…0…1…0…1…2),(…1…1…1…1…0…1…0…1…1…0)> =3 Leslie, Eskin and Noble, PSB 2002 Mismatch Kernel For k-mer s, the mismatch neighborhood N(k,m)(s) is the set of all k-mers t within m mismatches from s Size of mismatch neighborhood is O(||mkm) AKQ CKQ DKQ AKQ … … AAQ AKY ( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 ) AAQ AKY CKQ DKQ Arbitrary mismatch does not model the mutation probability between amino acids. Leslie, Eskin, Weston and Noble, NIPS 2002 Profile Kernel [Kuang & Leslie, JBCB 2005 and CSB 2004] Profile kernel: specialized to protein sequences, probabilistic profiles to capture homology information Semi-supervised approach: profiles are estimated using unlabeled data (about 2.5 million proteins ) E.g. PSI-BLAST profiles: estimated by iteratively aligning database homologs to query sequence. Profiles are build from multiple sequence alignment to model the positional mutation probability. QUERY LKLLRFLGSGAFGEVYEGQLKTE....DSEEPQRVAIKSLRK....... HOMOLOG1 IIMHNKLGGGQYGDVYEGYWK........RHDCTIAVKALK........ HOMOLOG2 LTLGKPLGEGCFGQVVMAEAVGIDK.DKPKEAVTVAVKMLKDD...... HOMOLOG3 IVLKWELGEGAFGKVFLAECHNLL...PEQDKMLVAVKALK........ A C D … Y L K L 3 -2 1 -1 0 2 -1 0 0 … … … 2 -3 -3 … … … … … … Profile-based k-mer Map P( x) p j (b), b , j 1 x Use profile to define position-dependent mutation neighborhoods: E.g. k=3, =5 and a profile of negative log probabilities A C D … K … Q … Y AKQ A 1 5 4 … 4 … 3 … 2 K 3 4 4 … 1 … 4 … 4 Q 4 1 4 … 4 … 1 … 3 … … … … … … … … … … M k , P x j 1 : j k b b b : log p 1 2 k i j i bi AKQ YKQ (2+1+1<) YKC AKQ AKC (1+1+1<) (1+1+2<) (2+1+1<) ( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 ) AKC AKQ YKC YKQ Efficient Computing with Trie Use trie data structure to organize lexical traversal of all instances of k- mers in the training profiles. Scales linearly with length, O(km_max+1||m_max(|x|+|y|)), where m_max is maximum number of mismatches that occur in any mutation neighborhood. E.g. k=3, =5 Sequence y Sequence x A C D … Q … Y A 1 3 3 … 3 … 2 Q 3 2 2 … 1 … 1 K 2 1 1 … 2 … 3 … A … Q … C D … … … x: 1+1+1< x: 1+1+1 < … y: .5+.6+2 < y: .5+.6+4 > … A C D … Q … Y … A Q Y ….5 2 1 … 2 1 2 … 2 1 4 … … … … … … 2 .6 2 … … … … … … 3 3 3 Update K(x, y) by adding contribution for feature AQC but not AQD … … … … … … Inexact Matching Kernels [Leslie & Kuang, JLMR 2004, KMCB 2004 & COLT 2003] Gappy kernels For g-mer s, g > k, the gapped match set G(g,k)(s) consists of all k-mers t that occur in s with up to (g - k) gaps Wildcard kernels Introduce wildcard character “”, define feature space indexed by k-mers from {}, allowing up to m wildcards Substitution kernels Use substitution matrices to obtain P(a|b), substitution probabilities for residues a, b The mutation neighborhood M(k,)(s) is the set of all kmers t such that - i=1…k log P(si|ti) < Experiments SCOP 1.59 benchmark with 54 experiments Train PSI-BLAST profiles on NR database Comparison against PSI-BLAST and recent SVM- based methods: PSI-BLAST rank: use training sequence as query and rank testing sequences with PSI-BLAST e-value eMotif Kernel (Ben-Hur et al., 2003): features are known protein motifs, stored using trie SVM-pairwise (Liao & Noble, 2002): feature vectors of pairwise alignment scores (e.g. PSI-BLAST scores) Cluster Kernel (Weston et al., 2003): Implicitly average the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence Results Results (Cont.) Kernels ROC ROC50 PSI-BLAST 0.743 0.293 eMotif 0.711 0.247 Mismatch(5,1) 0.875 0.416 Gappy(6,4) 0.851 0.387 Substitution(4,6.0) 0.876 0.441 Wildcard(5,2,1.0) 0.881 0.447 SVM-Pairwise 0.866 0.533 Cluster 0.923 0.699 Profile(5,7.5)-5 Iteration 0.984 0.874 Identify Protein Domains and Domain Boundaries Background Structural classification Domain segmentation Protein ranking Motif discovery Conclusion & Future Research SVM-based remote homology detection methods do not rely on sequence alignment. To learn the domain segmentation, we use our SVM classifiers as domain recognizers and find the optimal segmentation giving the maximum sum of the classification scores. SCOP Super-families: Domain recognizers: SVM1 SVM2 SVM3 Query sequence Boundaries SVM4 Algorithms for finding optimal segmentation Assuming we know the number of domains on a protein, we look for the optimal segmentation with the maximum sum of classification scores with dynamic programming. No gaps: T(1, j ) F ( S1, j ), for 1 j N T( c , j ) inf , when j c max 1 k j ( F ( Sk , j )T( c 1,k ) ), otherwise Allowing gaps (can also be solved as a LP problem): T(1, j ) max 1k l j F ( S k ,l ), all 1 j N T( c , j ) inf , when j c max 1 k l j ( F ( Sk ,l )T( c 1,k ) ), otherwise : segment between position i and position j of sequence S F ( s ) : the best classification score of segment s T( c , j ) : the maximum sum of classification scores from c segments on S1,j Si, j Toy Example of Dynamic Programming c=1 1 4 2 2 1 1 c=2 -INF 2 5 8 6 7 c=3 -INF -INF 3 6 9 13 Algorithms for finding optimal segmentation (unconstrained number of domains) The regions across boundaries are less classifiable than other regions within one domain. Use dynamic programming to alternate between domain regions and boundary regions. P0 0, G0 0 Pj max 1i j (lseg ( Si , j ) Gi 1 ) G j max 1i j (l gap ( Si , j ) Pi 1 ), Si, j : segment between position i and position j of sequence S lseg ( s ) : confidence score of assigning s as domain region l gap ( s ) : confidence score of assigning s as boundary region boundary region domain region Experiments Datasets: 25 SCOP folds: 2678 training domains and 471 test chains (189 multi-domain proteins). 40 SCOP super-families: 1917 training domains and 375 test chains (131 multi-domain proteins). Baseline approach: Align test proteins against PSI-BLAST profiles of the training domains, and use the best aligned regions as domain regions. Evaluation: Domain label: significant overlap between the true domain and the predicted domain. Boundaries: both the predicted start and end positions should be close to the true ones. Experiments (Cont.) At least 75% percent positional overlap between the true domain and the prediction. Dataset Fold Dataset (414 domains) Super-family Dataset (288 domains) Boundary distance 25 50 25 50 PSI-BLAST 8.7%(36) 11.4%(47) 5.6%(16) 15.3%(44) DP_nogap 16.4%(68) 26.1%(108) 21.5%(62) 26.7%(77) DP_gap 30.7%(127) 35.8%(148) 25.0%(72) 30.2%(87) DP_alter 18.8%(78) 25.6%(106) 17.7%(51) 26.4%(76) Protein Ranking Background Structural classification Domain segmentation Protein ranking Motif discovery Conclusion & Future Research Protein ranking: search protein database for sequences that share an evolutionary or functional relationship with a given query sequence. Standard protein ranking algorithm: pairwise alignment-based algorithm, PSI-BLAST, can easily detect close homologs. Pairwise alignment-based algorithms are not effective for remote homolog detection. Query Homologous protein Remote homolog Other labeled protein From Local Similarity to Global Structure (RankProp) Weston, Elisseff, Zhou, Leslie, Noble, PNAS 2004 Noble, Kuang, Leslie, Weston, FEBS 2005 Weston, Kuang, Leslie, Noble, BMC Bioinformatics 2005 6 2 5 7 3 Query 1 Homologous protein Remote homolog Other labeled protein Unlabeled protein 8 4 1 2 3 6 7 8 4 5 Correct Ranking Ranking based on local similarity 1 2 3 4 5 6 7 8 Cluster assumption: proteins with structural or functional relation tend to be in the same cluster in the network. Diffusion on the protein similarity network to capture cluster structure. RankProp (Cont.) Capture global structure with diffusion. Protein similarity network: Graph nodes: protein sequences in the database Directed edges: weighted by PSI-BLAST e-value Initial ranking score at each node: the similarity to the query sequence Iterative diffusion operation: Yt 1 Y0 KYt Y0 : Initial ranking score Yt : Ranking score at step t K : Normalized connectivity matrix : a parameter for balancing the initial ranking score and propagation MotifProp [Kuang, Weston, Noble, Leslie Bioinformatics, 2005] Motivated by HITS algorithm for page ranking and NLP algorithms Protein-motif network Nodes: proteins and motifs Edges: whether a motif is contained in a protein Motifs: patterns/models built on protein segments conserved during evolution. Often characterize structural/ function properties of a protein. Examples: eMOTIF, PROSITE, Kmers, BLOCKS… Query …FYPGKGHTEDNIVVWLPQYNILVGGCLVKSTSAKDLGNVADAYVNEWSTSIENVLKRYRNINAVVPGHGEVG… Motif Database MotifProp (Cont.) MotifProp can identify motif-rich regions derived from motif ranking to help interpret diffusion algorithm. Low computational cost: protein-motif network is fast to build. Motifs serve as bridges connecting homologous/remote homologous proteins. Query Motif vertices Protein vertices In MotifProp, protein nodes and motif nodes enforce their similarity to the query sequence through propagation. Diffusion in Protein-motif Network MotifProp: ~ Normalize affinity matrix H to H Initialize P and F with the initial activation value Iterate until converge ( (0,1) For all i, Pi P (1 ) 0 ) ~ H Fj ij j For all i, Fi F (1 ) 0 ~ H' ij Pj Query j Motif vertices Protein vertices Experiments 7329 sequences (4246 for training and 3083 for testing) of <95% identity from SCOP 1.59 plus 100,000 proteins from Swiss-Prot. Motif sets: 4-mers, PROSITE and eMOTIF. Experiments (Cont.) Algorithm ROC1 ROC10 ROC50 Sequential MotifProp 0.640 0.663 0.688 k-mer MotifProp 0.621 0.648 0.679 RankProp 0.592 0.667 0.725 PROSITE MotifProp 0.600 0.643 0.664 eMOTIF MotifProp 0.527 0.612 0.666 PSI-BLAST 0.594 0.616 0.641 Conserved Motifs between Remote Homologs Background Structural classification Domain segmentation Protein ranking Motif discovery Conclusion & Future Research We can derive weights of k-mer features from SVM classifiers trained with profile kernel. MotifProp provides activation values on the k-mer features after propagation. Both the SVM weights and Motif activation values can be mapped back to protein sequences to identify conserved structural/functional motifs. Positional contribution to classification score: S x j 1 : j k Px j 1 : j k , , where Δ is the SVM weights or MotifProp activation values on k-mer features. Mapping Discriminative Regions to Structure (Profile Kernel) Ecoli MarA protein (1bl0) In examined examples, discriminative motif regions correspond to conserved structural features of the protein superfamily Example: Homeodomain-like protein superfamily. Motif Rich Regions (MotifProp) Motif-rich regions on chain B of arsenite oxidase protein from the ISP protein super-family. The PDB annotation and motif-rich regions are given. The 3D protein structure with motif-rich regions in yellow. Conclusions & Contributions Background Structural classification Domain segmentation Protein ranking Motif discovery & angle prediction Conclusion & Future Research Profile-based string kernels exploit compact representation of homology information for better detection of remote homologs. Dynamic Programming-based approach improves multi-label domain classification and domain boundary detection over PSI-BLAST alignment-based approach. MotifProp improves protein ranking over PSI-BLAST by network diffusion on protein-motif network. Interpretation of profile-SVM classifier and MotifProp by motif regions: conserved structural components. Fast kernels for inexact string matching. Classifiers for protein backbone angle prediction (not presented). SVM-FOLD Web Server Future Research Protein function inference by structural genomics and proteomics Identify functional properties of protein structures with kernel methods, e.g. prediction of protein functional sites and structure-based identification of protein-protein interaction sites. Protein function inference from proteomics, e.g. protein function prediction based on protein-protein interaction patterns and protein structures. Protein structure prediction Unified prediction of protein backbone and side chain positions (Phi-Psi angles and rotamers) with energybased cost function. Acknowledgements: Committee Tony Jebara Dept. of Computer Science, Columbia University Christina Leslie (advisor) Center for Computational Learning Systems & C2B2, Columbia University Kathleen Mckeown (chair) Dept. of Computer Science, Columbia University William Stafford Noble Dept. of Genome Science & Dept. of Computer Science, University of Washington Rocco Servedio Dept. of Computer Science, Columbia University Jason Weston Machine Learning Group, NEC Labs (USA) Acknowledgements: Collaborators An-Suei Yang (Genome Research Center, Academia Sinica of Taiwan) Dengyong Zhou (Machine Learning Group, Microsoft) Yoav Freund (Computer Science Department, UCSD) Eugene Ie (Computer Science Department, UCSD) Ke Wang (Computer Science Department, Columbia University) Wei Chu (Center For Computational Learning Systems, Columbia University) Kai Wang (Biomedical Informatics Department, Columbia University) Iain Melvin (Machine Learning Group, NEC) Girish Yao (Computer Science Department, Columbia University) Lan Xu (Molecular Biology Department, The Scripps Research Institute) Publications Structural classification: Profile kernels (JBCB 2005 and CSB 2004) Inexact marching kernels (JMLR 2004, COLT 2003 & KMCB 2004) Protein ranking: RankProp (FEBS 2005 and BMC Bioinformatics 2005) MotifProp (Bioinformatics 2005) Protein local structure prediction Kernel methods based on sliding-window (Bioinformatics 2004) Structured output learning (ongoing research) Protein domain segmentation (In preparation) Protein backbone angle prediction Conformational Phi-Psi Angles States …… 3-D structure (Φ1,Ψ1) A (Φ2,Ψ2) A (Φ3,Ψ3) A (Φ4,Ψ4) Discretization of Phi-Psi angles G (Φ5,Ψ5) B (Φ6,Ψ6) B (Φ7,Ψ7) B (Φ8,Ψ8) B …… B … Sliding-window SVM approach [Kuang, Leslie & Yang 2004] Encode each position independently with sequence information Conformational within a length-k window. States A A A:-3 –4 –4 –4 –3 –4….. A A:0 –1 –1 3 –4 3 4 1….. B B:0 –1 2 1 –3 4 0 –1…… B B:-2 –3 –4 –5 –2 4…… B B:0 –3 –1 –2 –4 –1…… B G G E …… To SVM B B B B B Smoothing: use predictions to train a second sets of SVMs Experiments Datasets: 697 sequences of 97,365 amino acids with sequence identity < 25 % from PDB (PDB_SELECT25). Comparison against: LSBSP1: query against local structure-based sequence profile database. HMMSTR: Hidden Markov Model based on local structural motifs. RankProp in Genome Browser Regularization Framework Closed form solution of MotifProp Y * (1 )( I W ) 1Y 0 Related to the regularization framework in Zhou et. al. NIPS 2003 1 1 Q(Y * ) Wij || Yi* Y j* || 2 i , j 1 n 2 , Where • • n * 0 || Y Y i i || i 1 0 * P P * 0 Initial Ranking : Y Final Ranking : Y * F 0 F Normalized Affinity Matrix : 0 H~ W ~ H' 0 Discussion: Alignment VS K-mers Rangwala and Karypis et. al. 2005 achieved further improvement on a previous benchmark dataset on SCOP 1.53 with kernels defined on profile-profile alignment. Proteins are documents with no punctuation and there is no dictionary!!! Optimal local alignment detects the Length k subsequences summarize local matches most conserved paragraph. Sensitive for detecting homologous Can detect discontinuous and disordered conserved regions between remote homologs proteins. Good for remote homologs with one Can achieve fast computation relatively long conserved region. Well defined k-mer feature space for applying learning algorithms Alignment is easily interpretable. Extracting Discriminative Motif Regions SVM training determines support vector sequence profiles and their weights: (P(xi), i) SVM decision hyperplane normal vector: w = i yi i (P(xi)) Positional contribution to classification score: S x j 1 : j k Px j 1 : j k , w Averaged positional score for positive sequences: S avg x j S x j k q : j 1 q q 1k Map Motif Rich Regions Map final motif activation values to query sequence to find conserved structural components between remote homologs. Determination of Protein Structures X-ray crystallography The interaction of x-rays with electrons arranged in a crystal can produce electron-density map, which can be interpreted to an atomic model. Crystal is very hard to grow. Nuclear magnetic resonance (NMR) Some atomic nuclei have a magnetic spin. Probed the molecule by radio frequency and get the distances between atoms. Only applicable to small molecules. Protein structure prediction Comparative modeling: Where there is a clear sequence relationship between the target structure and one or more known structures. Fold recognition ('threading'): No sequence homology with known structures. Find consistent folds (remote homology detection). Ab initio structure prediction(‘de novo’): Deriving structures, approximate or otherwise, from sequence. RankProp Weston, Elisseff, Zhou, Leslie, Noble, PNAS 2004 Noble, Kuang, Leslie, Weston, FEBS 2005 Weston, Kuang, Leslie, Noble, BMC Bioinformatics 2005 Protein similarity network: Graph nodes: protein sequences in the database Directed edges: exp(-Sij/σ), where Sij is the PSIBLAST e-value between ith protein and jth protein. Initial ranking score at each node: the similarity to the query sequence Iterative diffusion operation: Yt+1=Y0+αKYt Y0: Initial ranking score Yt: Ranking score at step t K: Normalized connectivity matrix α: (0,1) a parameter for balancing the initial ranking score and propagation Diffusion in protein similarity network Remote homology can be detected by diffusion from common neighbors in the cluster. Protein network is expensive to build Hard to interpret the ranking Extracting Discriminative Motif Regions Sort positional scores: about 40%50% of positions in positive training sequences contribute 90% of classification score Peaky positional plots discriminative motifs Inferring Protein Structure with Machine Learning Machine learning builds statistical models from data to learn underlining principles automatically. Machine Learning techniques are promising for inferring protein structure: large amount of data but little theory. Solved structures in protein databank (PDB) provide valuable knowledge about protein folding patterns. Number of total Structures New Structures added in a year. Sequence Alignment for Matching Proteins Smith-Waterman algorithm finds the optimal local alignment with maximum substitution scores between two sequences by dynamic programming Alignment-based Algorithms Smith-Waterman algorithm: find optimal local alignments by dynamic programming BLAST & PSI-BLAST [Altschul 1997]: fast approximations of Smith-waterman algorithm. Only extend alignment from short identical stretches potentially contained in true matches. Profiles are built to search database iteratively. SAM-98 [Karplus 1999]: HMM based approach. Build HMM for the query and target sequences. Rank sequences by likelihood computed from HMMs. HITS Algorithm for Page Ranking Good hubs: web pages with many pointers to related pages. Good authorities: web pages pointed to by hubs. Recursive updating enforces good hubs and good authorities. Let N be the set of edges Initialize Hub[A] and Aut[A] to 1 Iterate until converge For all A, Aut[A]=∑(B,A) NHub[B] For all A, Hub[A]=∑(A,B) NAut[B] Normalize Aut and Hub Hubs Authorities Kleinber, 1998