Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Intelligent Systems for Molecular Biology ISMB, Michigan 2005 Dirk Husmeier Biomathematics & Statistics Scotland (BioSS) JCMB, The King’s Buildings, Edinburgh EH9 3JZ, UK http://www.bioss.ac.uk/∼dirk Current topics in computational biology Current topics in computational biology • Structural bioinformatics Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation • Microarrays and gene expression Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation • Microarrays and gene expression • Pathways and systems biology Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation • Microarrays and gene expression • Pathways and systems biology • Gene ontologies Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation • Microarrays and gene expression • Pathways and systems biology • Gene ontologies Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences. Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences. W × 4 matrix ψk (l): Probability that the nucleotide in the kth position, k ∈ [1, . . . , W ], is an l ∈ {A, C, G, T }. Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences. W × 4 matrix ψk (l): Probability that the nucleotide in the kth position, k ∈ [1, . . . , W ], is an l ∈ {A, C, G, T }. Background model for non-binding sequences 4-dim vector θ0(l): Probability of nucleotide l; this distribution is position-independent. Sequence S1, S2, . . . , SN Sequence S1, S2, . . . , SN Non-binding sequence: R=0 P (S1, S2, . . . , SN |R = 0) = N Y t=1 θ0(St) Sequence S1, S2, . . . , SN Non-binding sequence: R=0 P (S1, S2, . . . , SN |R = 0) = N Y θ0(St) t=1 Binding sequence: R=1, motif starting at position m+1 P (S1, S2, . . . , SN |R = 1, start = m + 1) = m Y t=1 N Y θ0(St) W Y ψk (Sm+k ) k=1 W Y ψk (Sm+k ) = θ0(St) θ0(Sm+k ) t=1 k=1 N Y t=m+W +1 θ0(St) Binding sequence: R=1, motif starting at position m+1 N Y W Y ψk (Sm+k ) P (S1, S2, . . . , SN |R = 1, start = m + 1) = θ0(St) θ0(Sm+k ) t=1 k=1 Binding sequence: R=1, motif starting at position m+1 N Y W Y ψk (Sm+k ) P (S1, S2, . . . , SN |R = 1, start = m + 1) = θ0(St) θ0(Sm+k ) t=1 k=1 Binding sequence: R=1, motif starting anywhere P (S1, S2, . . . , SN |R = 1) = NX −W P (start = m + 1)P (S1, S2, . . . , SN |R = 1, start = m + 1) m=0 NX −W Y W 1 ψk (Sm+k ) θ0(St) = N − W + 1 m=0 θ0(Sm+k ) t=1 N Y k=1 • JASPER: Resource for mammalian TF binding sites. • Learn 64 PSSMs with MEME, and 64 two-component mixture PSSMs. • For each PSSM, compute the top 1000 scoring hits in 12,253 promoter regions in mouse and human. • For each mixture, compute the N1 top scoring hits for PSSM1 and the N2 top scoring hits for PSSM2 such that N1 + N2 = 1000 and N1/N2 = mixture ratio. • Compute the fraction of sites that were conserved between human and mouse. Conservation statistics surrogate for functionality • De novo motif detection: MEME, Gibbs sampler etc. • Still in need of much improvement. • Use existing biological knowledge by modelling and exploiting regularities that are shared by structurally related TFs (transcription factors). • FBPs: Familial binding profiles= Average binding motifs for a set of related TFs. • Use FBP as a prior. Problem: Traditional motif finding methods can only incorporate one FBP. • Novel approach (SOMBRERO): ”Prior:” Binding profile SOM (BP-SOM) that has previously been trained on a set of ”typical” PSSMs (e.g. taken from JASPER). SOM SOMBRERO Input PSSMs k-mers Output FBPs motif features Predict TF binding sites • Third-order Markov model of intergenic DNA to generate random datasets. • Compute the expected number of matches. • Rank motifs in terms of significance. Evaluation • 3rd-order Markov model of yeast intergenic sequences with inserted PSSM TFBs from yeast and mammals. • BP-SOM pre-trained on mammalian-specific PSSMs from JASPER. Results • Improved prediction for mammalian TFBs. • No improved prediction for yeast TFBs. • Mammalian-specific ”prior” does not adverseley affect the performance either. • Relevant FBPs will be reinforced. Irrelevant FBPs will fade out. Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation • Microarrays and gene expression • Pathways and systems biology • Gene ontologies Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation • Microarrays and gene expression • Pathways and systems biology • Gene ontologies Pathways and systems biology Pathways and systems biology • Modelling Predictive mathematical models based on ODEs to explain the spatial locations and dynamics of expression profiles in small gene networks. • Network reconstruction Application of machine learning methods to infer static molecular interaction networks from heterogeneous postgenomic data. • Analysis of network structure Simplified view of complex interaction networks, e.g., by identifying recurrent patterns across multiple networks to discover biologically meaningful modules. • Function prediction A given network, e.g. of protein interactions, is used for functional annotation, that is, the underlying structure of protein interaction maps is used to predict novel protein functions. Objective Predict protein-protein interactions heterogeneous postgenomic data: from a combination • Protein sequences • Gene ontologies • Evolutionary context: Interactions in other species • Local network topology: Mutual clustering coefficient of Methodology SVMs and kernels methods Protein sequences • Spectrum: frequencies of kmers. • Motifs: counts of discrete sequence motifs. • Pfam domains: compare sequence with every HMM in the Pfam database, compute E-value for each protein-HMM comparison. Pairwise kernel to measure the similarity between two pairs of proteins (X1, X2) and (X10 , X20 ). K[(X1, X2), (X10 , X20 )] = K(X1, X10 )K(X2, X20 )+K(X1, X20 )K(X2, X10 ) Gene ontologies Similarity of annotations α and α0 scored as maxα00∈ancestors(α)∩ancestors(α0) − log P (α00) Interactions in other species X1 and X2 H(X) I(i, k) E(Xi, Xk ) Proteins Set of non-yeast proteins that are significant BLAST hits of X Indicator variable for the interaction of proteins i and k Negative log E-value from BLAST h(X1, X2) = maxi∈H(X1),k∈H(Xk ) I(i, k) min[E(X1, Xi), E(X2, Xj )] Interactions in other species X1 and X2 H(X) I(i, k) E(Xi, Xk ) Proteins Set of non-yeast proteins that are significant BLAST hits of X Indicator variable for the interaction of proteins i and k Negative log E-value from BLAST h(X1, X2) = maxi∈H(X1),k∈H(Xk ) I(i, k) min[E(X1, Xi), E(X2, Xj )] Local network topology (X)∪N (Y ) Mutual clustering coefficient: N N (X)∩N (Y ) , N (X) neighbours of proteins X in the interaction network. Incorporation of measurement reliability • Soft-margin parameter C • Chigh for interactions believed to be reliable and for negative examples • Clow for interactions not known to be reliable Findings Findings • Predictions from sequences alone → modest performance. • Data integration → improved predictions. • GO annotations → considerable improvement. • GO annotations alone → cannot distinguish between physical interactions and members of the same complex. • Sequence motifs → related to the binding site. • Future work → include gene expression and chIP-on-chip data. Criticism • Choice of kernels requires some intuition. • Choice of the soft-margin parameters Chigh, Clow and the kernel weights arbitrary, not estimated. Objective Infer enzyme networks from the integration of multiple heterogeneous postgenomic data. Objective Infer enzyme networks from the integration of multiple heterogeneous postgenomic data. • Genome: Sequence data encoded into phylogenetic profiles. • Transcriptome: Microarray gene expression. • Proteome: Protein localization. • Metabolome: Chemical compounds involved, encoded in Enzyme Commission (EC) numbers from KEGG. Identify genes coding for missing enzymes on known pathways. Methodology Deal with heterogeneity of the data sets by transforming all data into kernels similarity matrices → Unified mathematical framework across different types of datasets. Supervised inference: • Use partial knowledge of the metabolic network to infer unknown parts. • Mapping of a gene to a low-dimensional space by exploiting the partial knowledge of the network. • Diffusion kernels, kernel canonical correlation analysis between the data kernel and the kernel describing the known part of the network. Findings Unsupervised, without chemical constraints Supervised, with chemical constraints Findings • Predictions from sequences alone → poor performance. • Data integration → improves predictions. • Supervised learning → considerable improvement. Criticism Choice of kernels seems to require a lot of intuition, the setting of parameters seems rather arbitrary. • Expression data: Gaussian RBF kernel, σ = 8. • Phylogenetic profiles: Gaussian RBF kernel, σ = 3. • Localization data: Linear kernel. • Chemical compatibility network: Diffusion kernel, β = 0.01. • KCCA: number of features= 30, λ = 0.01. • Proteins and their interaction partners must co-evolve so that divergent changes in one partner’s binding surface are complemented in the interface with the other partner. • Correlation between evolutionary trees indicates that the proteins from two families interact. • Predict specific interactions: Establish a mapping between the leaves of the two trees; proteins at equivalent leaves are predicted to interact. Phylogenetic trees of Ntr-sensor histidine kinases and their corresponding regulators • Ramani and Marcotte (2003): Mapping that maximizes the correlation between two similarity matrices. • MCMC to explore the space of all superpositions. • Disadvantage: Large search space full of local optima. • New approach: Use topological information of the phylogenetic tree in addition to evolutionary distances. • Search space: Automorphism group of a phylogenetic tree. Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation • Microarrays and gene expression • Pathways and systems biology • Gene ontologies Maximum likelihood of evolutionary trees: hardness and approximation Benny Chor and Tamir Tuller • Reconstructing the ML tree is computationally intractable (NP-hard). • Algorithm for approximating log-likelihood under the condition that the input sequences are sparse. Detecting coevolving amino acid sites using Bayesian mutational mapping Matthew W. Dimmic, Melissa J. Hubisz, Carlos D. Bustamante and Rasmus Nielsen • The evolution of protein sequences is constrained by complex interactions between amino acid residues. • Harmful substitutions may be compensated for by other substitutions at neighboring sites, residues can coevolve. • Bayesian phylogenetic approach to the detection of coevolving residues in protein families. • Based on a coevolutionary Markov model for codon substitution. Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution Yun S. Song, Yufeng Wu, and Dan Gusfield • Computing the minimum number of recombinations needed to explain sequences (with one mutation per site) is NP-hard. • Practical method for computing both upper and lower bounds on the minimum number of recombinations. Current topics in computational biology • Structural bioinformatics • Phylogeny and evolution • Sequence analysis and gene regulation • Microarrays and gene expression • Pathways and systems biology • Gene ontologies Multi-way clustering of microarray data probabilistic sparse matrix factorization using Delbert Dueck *, Quaid D. Morris and Brendan J. Frey • Multi-way clustering of microarray data using a generative model. • Probabilistic extension of a previous hard-decision algorithm. • Allows for varying levels of sensor noise in the data. Clustering short time series gene expression data Jason Ernst 1,*, Gerard J. Nau 2 and Ziv Bar-Joseph • Algorithm specifically designed for clustering short (less than 10 time points) time series expression data. • Assigning genes to a predefined set of model profiles. • Profiles capture the potential distinct patterns that can be expected from the experiment. • Evaluation using Gene Ontology analysis. • The proposed algorithm outperforms all competing algorithms. Experimental design for three-color and four-color gene expression microarrays Yong Woo, Winfried Krueger, Anupinder Kaur, and Gary Churchill • Three-color microarray technology is currently available at a reasonable cost. • Clear guidelines for designing and analyzing three-color experiments do not exist. • Three- and a four-color cyclic (loop) design. • Three-color experiments using the same number of samples (and fewer arrays) will perform as efficiently as two-color experiments. GenXHC: a probabilistic generative model for crosshybridization compensation in high-density genomewide microarray data Jim C. Huang, Quaid D. Morris, Timothy R. Hughes, and Brendan J. Frey • Objective: removal of cross-hybridization noise. • Probabilistic generative model for cross-hybridization. • Prior knowledge of hybridization similarities between the nucleotide sequences of microarray probes and their target cDNAs. • Variational learning.