Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Identify regulatory modules from gene expression data Xu Ling 02/09/2005 Introduction Much of a cell’s activity is organized as a network of interacting modules: sets of genes coregulated to respond to different conditions. Identifying this organization is crucial for understanding cellular responses to internal and external signals. Genome-wide expression profiles (e.g., DNA microarray) provide important information about regulatory mechanisms. With the availability of complete genome sequences, identifying cis-regulatory elements via a bioinformatics approach on a genome-wide manner comes out as a promising solution. Tasks What’s the underlying mechanisms by which genes are regulated? Modules of coregulated genes? Regulators (transcription factors)? Regulation conditions (TFBSs/motifs, positional and combinatorial constraints)? General scheme (1) clustering-based approaches for finding motifs from gene expression and sequence data classify General scheme (2) sequence(/knowledge)-based approaches for finding motifs from gene expression and sequence data General scheme (3) Comparative genomics has also been applied to identify eukaryotic regulatory elements (e.g., Human-Mouse) because functional noncoding sequences may be conserved across species from evolutionary constraints. Finding a good pair of species to compare and choosing a good sequence conservation threshold are critical and such information is not available for most species. Related work Predicting gene expression from sequence Michael A. Beer and Saeed Tavazoie Cell, 2004, 117: 185-198 A successful application of existing computational approaches in studying the yeast transcriptional regulation network Approach Clustering (k-means) – modules of coregulated genes Motif Finding (AlignACE) – putative regulatory elements (TFBSs) Bayesian network learning – regulation conditions (motifs, positional and combinatorial constraints) Bayesian Network Sequence features (x1,…,xn) expression patterns (ei) Sequence feature (xi): presence of motifs, positional constraints, and combinatorial constraints Expression pattern (ei): a binary one layer network Maximizing P(ei|x1,…,xn), the probability that genes with these sequence features will participate in expression pattern i Properties Easy to integrate all kinds of sequence features Explicit Sequence features To avoid complex networks overfit the training data, a parameter for penalizing dense networks is used. “Optimal” network is greedily learned. Motif finding approaches Explicit statistical modeling based Expectation maximization – MEME, … Gibbs Sampling – AlignACE, Gibbs Motif sampler, … Others – CONSENSUS, … word enumeration based – MDscan, … MEME Sequence is broken up into all overlapping subsequences of length W which it contains. Two-component finite mixture model: “Motif” (a set of similar subsequences of fixed width) & “Background” (all other positions in the sequences) Motif model: each example of the motif is assumed to be generated by a sequence of independent, multinomial random variables. Background model: each position (which is not part of a motif) is generated independently by a multinomial random variable. Maximize the likelihood of the model M given the data D: L(M|D)=p(D|M) by EM algorithm Gibbs motif sampler Dealing with a specific model alignment rather than a weighted average as EM does. Iteratively sample motif models (or possibly background model) for each subsequence and thereby partition motif-encoding regions into different motifs. Iterative heuristic method, which combines gradient search steps with random jumps in the search space, hence not guaranteed to reach optimal, but won’t stuck at local maximums as EM does. Identify the most probable motif models by locating the optimum alignments, which maximize the ratios of the corresponding target probabilities to the background probabilities (MAP (maximum a posteriori) score). Future work Ab initio motif finding approach from gene expression and sequence data by attempting new heuristic or statistic model. Integrating prior knowledge (e.g., GO) to facilitate identification of regulatory elements and transcriptional network.