Download Gene Net Analysis: Motifs vs. Correlation

OVERVIEW Omer Berkman 1 Contents    Biological background Using Gene-arrays to decipher generegulatory interactions Applications… 2 Hybridization   DNA double strand form by “gluing” of complementary single starnds Complementary rule: A-T/U, G-C 3 Protein production 4 From DNA to Protein Transcription Gene Translation mRNA Protein cells express different subset of the genes in different tissues and under different conditions 5 Functional genomics   The complete sequences of many microbial genomes are already known - the inventory of the building blocks of life was collected. next stage is ‘‘re-assembling the pieces’’ :  Defining the role of each gene in these genomes.  Understanding how the genome functions as a whole in the complex natural history of a living organism. Knowing when and where a gene is expressed often provides a strong clue as to its biological role 6 Transcriptional process     This process is highly regulated. One of the most important ways in which the cell regulates gene expression is by using a feedback loop. some of the proteins are transcription factors. These proteins regulate the expression of other genes (and possibly, their own expression) by either initiating or repressing 7 transcription. Transcriptional networks    One gene can be a regulator of another gene. Biochemical networks responsible for regulating the expression of genes in cells. In these transcription networks, the nodes represent transcriptional factors (genes) and the edges represent direct transcriptional regulation. [Shen-Orr 2002, Thieffry 1998] 8 Transcriptional networks example 9 Gene-arrays for mRNA analysis    Differences in cell type or state are correlated with changes in the mRNA levels of its genes. The only specific reagent required to measure the abundance of the mRNA for a specific gene is a cDNA sequence. DNA microarrays provide a practical and economical tool for studying gene expression on a very large scale. 10 Affymetrix model for DNA chip Now, we can infer which of the genes were expressed and in what intensity. Due to some biological processes, not always the correct sequence will hybridized to the oligo. 11 Gene Arrays / DNA chips    From “one gene in one experiment” to “massively parallel biological data acquisition”. Simultaneously analyzing the expression levels of large numbers of genes provides the opportunity to study the activity of whole genomes. Large-scale gene expression analysis reveals the behavior of co-regulated gene networks. 12 Raw Data  The curse of dimensionality : Thousands of Genes versus only few observations 13 Static versus dynamic We distinguish between static experiments and time series experiments:  Static –    A snapshot in different samples is measured. Data are assumed to be independent identically distributed. Dynamic –   A temporal process is measured. Data have strong autocorrelation between successive points. 14 Temporal observations    It’s possible to produce time-dependent measurements, termed expression matrices. These expression matrices are the result of the underlying regulatory network. Reverse engineering seeks to extract information from time-series measurements in order to identify regulatory interactions in these genetic networks. 15 Complications        The curse of dimensionality Extremely noisy observations Expensive experiments Stochastic nature Population averaged Feasible time scale Partially information We are facing a hard problem… 16 1. The curse of dimensionality (Bellman, 1961)    The number of genes typically far exceeds the number of time points for which data are available, making the problem an ill-posed one. “Traditional statistics” won’t help here - the amount of samples, versus the number of genes, does not provide enough information to construct a full detailed model with high statistical significance. New statistical methods/approaches were developed (Bootstrap, Interpolations, Clustering, FDR…) 17 2. Stochastic nature Deterministic Stochastic Biology has no deterministic processes… 18 3. Population averaged    Measurements are obtained as population-averaged data The measurement itself kills or alters the organism This mask the real regulatory interactions (quantization problem) 19 4. Feasible time scale Empirical limit on the number of time points :   The average speed of the biologic process determines the number of informative points. The error of the method applied have to be smaller than the expression level difference. MISSING REGULATORY INTERACTIONS COST and ERROR 20 5. Partial information     Biological systems are robust, adaptable, and redundant. Genes are not the only actor in the game – transcriptional factors can be of many kinds. The regulatory interactions between genes are not deterministic at the mRNA level - a gene has few independently regulated derivatives. mRNA expression data alone only gives a partial picture that does not reflect key events such as translation and protein (in)activation. 21 Fundamental question   How much information is needed to map the generegulatory interactions of a biological system? Hertz’s Estimation [1998] for the number of gene states to be measured for a successful reverse engineering: P=K log (N/K) N - The size of the network (e.g. the number of genes) K - The average number of interactions per gene. 22 Application 1 [DeRisi 1997] Exploring the metabolic and genetic control of gene expression   Investigation of gene expression accompanying the metabolic shift from fermentation to respiration in yeast. Identify genes whose expression was affected by deletion of TUP1 or overexpression of YAP1. 23 Yeast genome micro-array Genes induced or repressed appear in this image as red and green spots, respectively. 24 Temporal samples 25 Analysis     Stable gene expression during exponential growth. A marked change was seen as glucose was progressively depleted from the growth media. - mRNA levels for 710 genes were induced. - mRNA levels for 1030 genes declined. The expression patterns observed for previously characterized genes showed concordance with previously published results. About half of these differentially expressed genes have no apparent homology to any gene whose function is known. This provides the first small clue to26 Coordinated regulation of functionally related genes Genes can be grouped on the basis of the similarities in their expression patterns 27 Distinct temporal patterns 28 Metabolic Diagram Red boxes identify genes whose expression increases in the diauxic shift. Green boxes identify genes whose expression diminishes in the diauxic shift. 29 Defining the contributions of individual regulatory genes   Using a DNA micro-array to identify genes whose expression is affected by mutations in each putative regulatory gene. Performing: - Deletion the transcriptional repressor TUP1. - Overexpression of the transcriptional activator YAP1. 30 Deleting the TUP1 gene     Wild-type yeast cells and cells bearing a deletion of the TUP1 gene were grown. mRNA was isolated from the two populations and used to prepare c-DNA labeled with green and red. The labeled probes were mixed and simultaneously hybridized to the micro-array. Red spots on the array represent genes that were induced in the TUP1 strain, and thus presumably repressed by TUP1. 31 Overexpressing the YAP1 gene   Complementary DNA from the control and YAP1 over-expressing strains, labeled with Cy3 and Cy5, respectively, was prepared from mRNA isolated from the two strains and hybridized to the micro-array. Red spots on the array represent genes that were induced in the strain overexpressing YAP1. 32 Characterization of regulatory pathways and networks    Use of a micro-array to characterize the transcriptional consequences of mutations provides a simple and powerful approach. This strategy also has an important practical application in drug screening. However, one should keep in mind that transcriptional regulations might be complicated. 33 Application1 summary    DNA micro-arrays provide a simple and economical way to explore gene expression patterns on a genomic scale. “The greatest challenge now is to develop efficient methods for organizing, interpreting, and extracting insights from the large volumes of data these experiments provide.” Technical advances have made array experiments fairly easy to do, but tools for analysis of data produced have lagged behind. 34 Application 2 [Friedman 2000] Using Bayesian Networks to Analyze Expression Data   Probabilistic approach. Bayesian network as a model for genetic networks. 35 Bayesian networks – definitions  Representation of a joint probability distribution. This representation, consists of two components:   G is a directed acyclic graph (DAG) whose vertices correspond to the random variables θ describes a conditional distribution for each variable, given its parents in G. 36 Simple example 37 Bayesian networks – properties  Encodes the Markov assumption : Each variable is independent of its non-descendants, given its parents in the graph    A graph-based model that captures properties of conditional independence between variables. Useful for describing processes composed of locally interacting components. Provide models of causal influence. 38 Equivalence classes     Let Ind(G) be the set of independence statements (of the form X is independent of Y given Z). More than one graph can imply exactly the same set of independencies. Two graphs G’ and G’’ are equivalent if Ind(G’)=Ind(G’’), that is, both graphs are alternative ways of describing the same set of independencies. Equivalent graphs have the same underlying undirected graph but might disagree on the direction of some of the 39 arcs (we switch to PDAG). Learning Bayesian Networks     Given a training set D of independent instances of X, find a network B={G, θ} that best matches D. Several scoring functions are available. Finding the structure G that maximizes the score is a problem which is known to be NP-hard. For Heuristic search we need :  A score function which is decomposable For example -  S(G:D) = log P(D|G) + log P(G) + C An iterative search method For example - Greedy/stochastic hill climbing, simulated annealing… 40 Biological (causal) interpretation   Edges: the parents of a variable are its immediate causes (the parent of a node is a transcription factor for this gene). A causal network models the effects of interventions: If X causes Y, then manipulating the value of X affects the value of Y, but not the other way around (If we knockout gene X then this will affect the expression of gene Y, but a knockout of gene Y has no effect on the expression of gene X). 41 Analyzing Expression Data    Random variable denote the expression level of individual genes. In addition, we can include random variables that denote other attributes that affect the system (experimental conditions, temporal indicators…). We want to learn one from the available data and use it to answer questions about the system. 42 Find high-scoring networks   The data is not informative enough to determine which single model is the right one Focusing on features that are common to most of the possible models:   Markov relation - indicates that two genes are related in some joint biological interaction or process (if there is either an edge between them, or both are parents of another variable (Pearl 1988)). Order relation - X is an ancestor of Y in all the networks of a given equivalence class (the given PDAG contain a directed path from X to Y). 43 How can we estimate a measure of confidence in the features?  bootstrap method (Efron & Tibshirani 1993)   A method to enlarge our data set by generating “perturbed” versions of our original data set. In this way we collect many networks, all of which are fairly reasonable models of the data. For each feature f of interest calculate : 1 m conf ( f )   f (Gi ) m i 1 where f(G) is 1 if f is a feature in G, and 0 otherwise. 44 Local Probability Models In order to specify a Bayesian network model, we still need to choose the type of the local probability models we learn. In the current work, we consider two approaches:   Multinomial model (discretizing to (-1,0,1). Linear Gaussian model. 45 Robustness analysis 46 Multinomial versus Gaussian The two methods highlight different types of connections 47 between genes. Biological Analysis    Order relations reveals existence of dominant genes. Out of all 800 genes only few seem to dominate the order (i.e., appear before many genes). Top Markov relations reveals genes that most are functionally related. Nice presentation: http://www.cs.huji.ac.il/~nirf/GeneExpression/top800/ 48 An example of the graphical display of Markov features This suits biological knowledge! 49 Application2 summary  Using Bayesian networks to model genetic network:   Involves thousands of genes while current data sets contain a few dozen samples. This raises problems in computational complexity and the statistical significance of the results. Genetic regulation networks are sparse (gene assumed to have no more than a few dozen genes directly affect its transcription). Bayesian networks are especially suited for learning in such sparse domains.  Did not use any (biological) prior knowledge.  This theory can provide tools for experimental design. 50 Dynamic Bayesian Networks     DBNs are an extension of Bayesian networks, which have been successfully applied to model expression data (Pe’er et al., 2001). The main advantage that unlike BNs, DBNs allow for cycles, which are common in biological systems. In addition, DBNs can also improve our ability to learn causal relationships by relying on the temporal nature of the data. DBNs seem like a promising direction for modeling temporal system and recently a number of papers 51 discuss this model. Application 3 [Holter 2000] Fundamental patterns underlying gene expression   Algebraic approach. Using SVD to a model gene expression. 52 Singular Value Decomposition   A standard and straight-forward analytic procedure which finds eigenvectors, or fundamental patterns of expression with time, of the array matrices. The SVD theorem states that the matrix A can be written as : A = USVT 53 SVD theorem   U and V are orthogonal S elements are all zero except for Si,i which are singular values (square roots of the eigenvalues) 54 Characteristic modes    We define the vectors Xi to be the first rows of the matrix SVT. Those r vectors are the characteristic modes associated with the matrix A. The temporal variation of any gene j can be written as a linear combination of these vectors: 55 Results   The first two values were significantly greater than the others for all three data sets, but the same is not true in a control calculation on random data. Only the first few modes are required to capture the essential features of the expression data in most cases (the modes reflect the genomewide expression pattern and are not genespecific). 56 gene expression and random data sets 57 Characteristic modes for the gene expression and random data sets  The magnitude of the singular value is reflected in the amplitude of each mode. 58 A reconstruction of the expression profiles 59 Analysis 1  Type of ‘‘spectral’’ analysis : a gene expression profile can be precisely represented by specifying the magnitude and sign of the contribution of each of its characteristic modes.  This suggests that at a gross level, most timedependent expression patterns are very simple.  Data from SVD agree with previous knowledge 60 of expression patterns. Plot of the coefficients Symbols of different colors and shapes are used for genes that belong to the different clusters. 61 Analysis 2    The data points (which are not random) are concentrated near the perimeter of a circle or an ellipse, with the interior rather sparsely populated. Expression profiles clustered by more conventional methods correspond well to groups of genes with similar coefficients. Despite the evolutionary distance between yeast and humans, the observed behavior is both simple and similar. 62 Application3 summary     SVD has uncovered underlying patterns or ‘‘characteristic modes’’ in gene temporal profiles. The expression pattern of any particular gene can be represented precisely by a linear combination of the modes with gene-specific coefficients. A good approximation of the exact pattern can be obtained by using just a few of the modes, underscoring the simplicity of the gene expression patterns. This paradigm may find expression patterns that would not be detected using other methods. 63 Application 3b [Holter et al 2001] Dynamic modeling    In the previous application we treated the gene expression pattern as a ‘‘static’’ image and derived the underlying genomewide characteristic modes of which it is composed. Now we carry out a dynamical analysis, exploring the possible causal relationships among the genes by deducing a time translation matrix for the characteristic modes defined by SVD. This matrix predicts future expression levels of genes based on their expression levels at some initial time. 64 How to deduce a time translation matrix?    To uniquely and unambiguously determine the g2 elements of the matrix, one needs a set of g2 linearly independent equations. D’haeseleer [1999] used a nonlinear interpolation scheme to guess the shapes of gene expression profiles between the measured time points (speculative). Van Someren [2000] chose to cluster the genes and study the interrelationships between the clusters (based on profile similarity). 65 Deduce a time translation matrix by using SVD    The SVD construction gives a linear combination of which exactly describes the expression pattern of each gene. The modes form a linearly independent basis set. The problem is mathematically well defined and tractable if one considers the causal relationships among the modes. 66 Analysis   The causal links between the modes, and thence the genes, involve just a few essential connections. Any additional connections among the genes must therefore provide redundancy in the network. An important corollary is that it may be impossible to determine detailed connectivities among genes with just the micro-array data, because the number of genes greatly exceeds the number of contributing modes. 67 Application3b summary    A model in which the expression levels of the genes at a given time are linear combinations of their levels at a previous time. Temporal evolution of the gene expression profiles can be described by using a ‘‘time translation’’ matrix, which reflects the magnitude of the connectivities between genes. Because there are only a few essential connections among modes and therefore among genes, additional links provide redundancy in the network. 68 References Yaacov Lindzen’s presentation “Introduction to Micro-arrays” “Genetic network analysis in light of massively parallel biological data acquisition”. Szallasi, 1999, PSB “Exploring the metabolic and genetic control of gene expression on a genomic scale”. DeRisi et al, 1997, Science “Using Bayesian networks to analyze expression data”. Friedman et al, 2000. “Fundamental patterns underlying gene expression profiles: Simplicity from complexity”. Holter et al, 2000, Genetics “Dynamic modeling of gene expression data”. Holter et al, 2001, Genetics “Analyzing time series gene expression data”. Bar Joseph, 69 2004, Bioinformatics

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gene Net Analysis: Motifs vs. Correlation