Download Intelligent Systems for Molecular Biology ISMB, Michigan 2005

Document related concepts

Cyclol wikipedia , lookup

Proteomics wikipedia , lookup

Protein structure prediction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Structural alignment wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Homology modeling wikipedia , lookup

Protein–protein interaction wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcript
Intelligent Systems for
Molecular Biology
ISMB, Michigan 2005
Dirk Husmeier
Biomathematics & Statistics Scotland (BioSS)
JCMB, The King’s Buildings, Edinburgh EH9 3JZ, UK
http://www.bioss.ac.uk/∼dirk
Current topics in computational biology
Current topics in computational biology
• Structural bioinformatics
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
• Microarrays and gene expression
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
• Microarrays and gene expression
• Pathways and systems biology
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
• Microarrays and gene expression
• Pathways and systems biology
• Gene ontologies
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
• Microarrays and gene expression
• Pathways and systems biology
• Gene ontologies
Position Specific Scoring Matrix (PSSM)
Search for a motif of length W in binding sequences.
Position Specific Scoring Matrix (PSSM)
Search for a motif of length W in binding sequences.
W × 4 matrix ψk (l):
Probability that the nucleotide in the kth position,
k ∈ [1, . . . , W ], is an l ∈ {A, C, G, T }.
Position Specific Scoring Matrix (PSSM)
Search for a motif of length W in binding sequences.
W × 4 matrix ψk (l):
Probability that the nucleotide in the kth position,
k ∈ [1, . . . , W ], is an l ∈ {A, C, G, T }.
Background model
for non-binding sequences
4-dim vector θ0(l):
Probability of nucleotide l; this distribution is position-independent.
Sequence S1, S2, . . . , SN
Sequence S1, S2, . . . , SN
Non-binding sequence: R=0
P (S1, S2, . . . , SN |R = 0) =
N
Y
t=1
θ0(St)
Sequence S1, S2, . . . , SN
Non-binding sequence: R=0
P (S1, S2, . . . , SN |R = 0) =
N
Y
θ0(St)
t=1
Binding sequence: R=1, motif starting at position m+1
P (S1, S2, . . . , SN |R = 1, start = m + 1)
=
m
Y
t=1
N
Y
θ0(St)
W
Y
ψk (Sm+k )
k=1
W
Y
ψk (Sm+k )
=
θ0(St)
θ0(Sm+k )
t=1
k=1
N
Y
t=m+W +1
θ0(St)
Binding sequence: R=1, motif starting at position m+1
N
Y
W
Y
ψk (Sm+k )
P (S1, S2, . . . , SN |R = 1, start = m + 1) =
θ0(St)
θ0(Sm+k )
t=1
k=1
Binding sequence: R=1, motif starting at position m+1
N
Y
W
Y
ψk (Sm+k )
P (S1, S2, . . . , SN |R = 1, start = m + 1) =
θ0(St)
θ0(Sm+k )
t=1
k=1
Binding sequence: R=1, motif starting anywhere
P (S1, S2, . . . , SN |R = 1)
=
NX
−W
P (start = m + 1)P (S1, S2, . . . , SN |R = 1, start = m + 1)
m=0
NX
−W Y
W
1
ψk (Sm+k )
θ0(St)
=
N − W + 1 m=0
θ0(Sm+k )
t=1
N
Y
k=1
• JASPER: Resource for mammalian TF binding sites.
• Learn 64 PSSMs with MEME, and 64 two-component
mixture PSSMs.
• For each PSSM, compute the top 1000 scoring hits in
12,253 promoter regions in mouse and human.
• For each mixture, compute the N1 top scoring hits for
PSSM1 and the N2 top scoring hits for PSSM2 such
that N1 + N2 = 1000 and N1/N2 = mixture ratio.
• Compute the fraction of sites that were conserved
between human and mouse.
Conservation statistics surrogate for functionality
• De novo motif detection: MEME, Gibbs sampler etc.
• Still in need of much improvement.
• Use existing biological knowledge by modelling and
exploiting regularities that are shared by structurally
related TFs (transcription factors).
• FBPs: Familial binding profiles= Average binding motifs
for a set of related TFs.
• Use FBP as a prior. Problem: Traditional motif finding
methods can only incorporate one FBP.
• Novel approach (SOMBRERO): ”Prior:” Binding profile
SOM (BP-SOM) that has previously been trained on a
set of ”typical” PSSMs (e.g. taken from JASPER).
SOM SOMBRERO
Input PSSMs
k-mers
Output FBPs motif features
Predict TF binding sites
• Third-order Markov model of intergenic DNA to
generate random datasets.
• Compute the expected number of matches.
• Rank motifs in terms of significance.
Evaluation
• 3rd-order Markov model of yeast intergenic sequences
with inserted PSSM TFBs from yeast and mammals.
• BP-SOM pre-trained on mammalian-specific PSSMs
from JASPER.
Results
• Improved prediction for mammalian TFBs.
• No improved prediction for yeast TFBs.
• Mammalian-specific ”prior” does not adverseley affect
the performance either.
• Relevant FBPs will be reinforced.
Irrelevant FBPs will fade out.
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
• Microarrays and gene expression
• Pathways and systems biology
• Gene ontologies
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
• Microarrays and gene expression
• Pathways and systems biology
• Gene ontologies
Pathways and systems biology
Pathways and systems biology
•
Modelling
Predictive mathematical models based on ODEs to explain the spatial locations
and dynamics of expression profiles in small gene networks.
•
Network reconstruction
Application of machine learning methods to infer static molecular interaction
networks from heterogeneous postgenomic data.
•
Analysis of network structure
Simplified view of complex interaction networks, e.g., by identifying recurrent
patterns across multiple networks to discover biologically meaningful modules.
•
Function prediction
A given network, e.g. of protein interactions, is used for functional annotation,
that is, the underlying structure of protein interaction maps is used to predict
novel protein functions.
Objective
Predict protein-protein interactions
heterogeneous postgenomic data:
from
a
combination
• Protein sequences
• Gene ontologies
• Evolutionary context: Interactions in other species
• Local network topology: Mutual clustering coefficient
of
Methodology
SVMs and kernels methods
Protein sequences
• Spectrum: frequencies of kmers.
• Motifs: counts of discrete sequence motifs.
• Pfam domains: compare sequence with every HMM in the Pfam
database, compute E-value for each protein-HMM comparison.
Pairwise kernel to measure the similarity between two pairs of proteins
(X1, X2) and (X10 , X20 ).
K[(X1, X2), (X10 , X20 )] = K(X1, X10 )K(X2, X20 )+K(X1, X20 )K(X2, X10 )
Gene ontologies
Similarity of annotations α and α0 scored as
maxα00∈ancestors(α)∩ancestors(α0) − log P (α00)
Interactions in other species
X1 and X2
H(X)
I(i, k)
E(Xi, Xk )
Proteins
Set of non-yeast proteins that are significant BLAST hits of X
Indicator variable for the interaction of proteins i and k
Negative log E-value from BLAST
h(X1, X2) = maxi∈H(X1),k∈H(Xk ) I(i, k) min[E(X1, Xi), E(X2, Xj )]
Interactions in other species
X1 and X2
H(X)
I(i, k)
E(Xi, Xk )
Proteins
Set of non-yeast proteins that are significant BLAST hits of X
Indicator variable for the interaction of proteins i and k
Negative log E-value from BLAST
h(X1, X2) = maxi∈H(X1),k∈H(Xk ) I(i, k) min[E(X1, Xi), E(X2, Xj )]
Local network topology
(X)∪N (Y )
Mutual clustering coefficient: N
N (X)∩N (Y ) ,
N (X) neighbours of proteins X in the interaction network.
Incorporation of measurement reliability
• Soft-margin parameter C
• Chigh for interactions believed to be reliable and for negative
examples
• Clow for interactions not known to be reliable
Findings
Findings
• Predictions from sequences alone → modest performance.
• Data integration → improved predictions.
• GO annotations → considerable improvement.
• GO annotations alone → cannot distinguish between physical
interactions and members of the same complex.
• Sequence motifs → related to the binding site.
• Future work → include gene expression and chIP-on-chip data.
Criticism
• Choice of kernels requires some intuition.
• Choice of the soft-margin parameters Chigh, Clow and the kernel
weights arbitrary, not estimated.
Objective
Infer enzyme networks from the integration of multiple heterogeneous
postgenomic data.
Objective
Infer enzyme networks from the integration of multiple heterogeneous
postgenomic data.
• Genome: Sequence data encoded into phylogenetic profiles.
• Transcriptome: Microarray gene expression.
• Proteome: Protein localization.
• Metabolome: Chemical compounds involved, encoded in Enzyme
Commission (EC) numbers from KEGG.
Identify genes coding for missing enzymes on known pathways.
Methodology
Deal with heterogeneity of the data sets by transforming all data into
kernels similarity matrices → Unified mathematical framework across
different types of datasets.
Supervised inference:
• Use partial knowledge of the metabolic network to infer unknown
parts.
• Mapping of a gene to a low-dimensional space by exploiting the
partial knowledge of the network.
• Diffusion kernels, kernel canonical correlation analysis between the
data kernel and the kernel describing the known part of the network.
Findings
Unsupervised, without chemical constraints
Supervised, with chemical constraints
Findings
• Predictions from sequences alone → poor performance.
• Data integration → improves predictions.
• Supervised learning → considerable improvement.
Criticism
Choice of kernels seems to require a lot of intuition, the setting of
parameters seems rather arbitrary.
• Expression data: Gaussian RBF kernel, σ = 8.
• Phylogenetic profiles: Gaussian RBF kernel, σ = 3.
• Localization data: Linear kernel.
• Chemical compatibility network: Diffusion kernel, β = 0.01.
• KCCA: number of features= 30, λ = 0.01.
• Proteins and their interaction partners must co-evolve
so that divergent changes in one partner’s binding
surface are complemented in the interface with the
other partner.
• Correlation between evolutionary trees indicates that the
proteins from two families interact.
• Predict specific interactions: Establish a mapping
between the leaves of the two trees; proteins at
equivalent leaves are predicted to interact.
Phylogenetic trees of Ntr-sensor histidine kinases and their corresponding regulators
• Ramani and Marcotte (2003): Mapping that maximizes the
correlation between two similarity matrices.
• MCMC to explore the space of all superpositions.
• Disadvantage: Large search space full of local optima.
• New approach: Use topological information of the phylogenetic
tree in addition to evolutionary distances.
• Search space: Automorphism group of a phylogenetic tree.
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
• Microarrays and gene expression
• Pathways and systems biology
• Gene ontologies
Maximum likelihood of evolutionary trees:
hardness and approximation
Benny Chor and Tamir Tuller
• Reconstructing the ML tree is computationally intractable
(NP-hard).
• Algorithm for approximating log-likelihood under the condition that
the input sequences are sparse.
Detecting coevolving amino acid sites using Bayesian
mutational mapping
Matthew W. Dimmic, Melissa J. Hubisz, Carlos D. Bustamante and Rasmus Nielsen
• The evolution of protein sequences is constrained by complex
interactions between amino acid residues.
• Harmful substitutions may be compensated for by other
substitutions at neighboring sites, residues can coevolve.
• Bayesian phylogenetic approach to the detection of coevolving
residues in protein families.
• Based on a coevolutionary Markov model for codon substitution.
Efficient computation of close lower and upper
bounds on the minimum number of recombinations
in biological sequence evolution
Yun S. Song, Yufeng Wu, and Dan Gusfield
• Computing the minimum number of recombinations needed to
explain sequences (with one mutation per site) is NP-hard.
• Practical method for computing both upper and lower bounds on
the minimum number of recombinations.
Current topics in computational biology
• Structural bioinformatics
• Phylogeny and evolution
• Sequence analysis and gene regulation
• Microarrays and gene expression
• Pathways and systems biology
• Gene ontologies
Multi-way clustering of microarray data
probabilistic sparse matrix factorization
using
Delbert Dueck *, Quaid D. Morris and Brendan J. Frey
• Multi-way clustering of microarray data using a generative model.
• Probabilistic extension of a previous hard-decision algorithm.
• Allows for varying levels of sensor noise in the data.
Clustering short time series gene expression data
Jason Ernst 1,*, Gerard J. Nau 2 and Ziv Bar-Joseph
• Algorithm specifically designed for clustering short (less than 10
time points) time series expression data.
• Assigning genes to a predefined set of model profiles.
• Profiles capture the potential distinct patterns that can be expected
from the experiment.
• Evaluation using Gene Ontology analysis.
• The proposed algorithm outperforms all competing algorithms.
Experimental design for three-color and four-color
gene expression microarrays
Yong Woo, Winfried Krueger, Anupinder Kaur, and Gary Churchill
• Three-color microarray technology is currently available at a
reasonable cost.
• Clear guidelines for designing and analyzing three-color experiments
do not exist.
• Three- and a four-color cyclic (loop) design.
• Three-color experiments using the same number of samples (and
fewer arrays) will perform as efficiently as two-color experiments.
GenXHC: a probabilistic generative model for crosshybridization compensation in high-density genomewide microarray data
Jim C. Huang, Quaid D. Morris, Timothy R. Hughes, and Brendan J. Frey
• Objective: removal of cross-hybridization noise.
• Probabilistic generative model for cross-hybridization.
• Prior knowledge of hybridization similarities between the nucleotide
sequences of microarray probes and their target cDNAs.
• Variational learning.