* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Network construction: Literature Review
Survey
Document related concepts
Transcript
Systems Biology: The inference of networks from high dimensional genomics data Ka Yee Yeung Nov 3, 2011 1 Systems Biology • “Systems biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behavior of that system” (Nir Friedman) • The goal is to construct models of complex biological systems and diseases (Trey Ideker) 2 An iterative approach High-throughput assays Data handling Experiments Integration of multiple forms of experiments and knowledge Mathematical modeling 3 Multi-disciplinary Science • • • • • • Biology Biotechnology Computer Science Mathematics and Statistics Physics and chemistry Engineering… 4 Networks as a universal language We are caught in an inescapable network of mutuality. ... Whatever affects one directly, affects all indirectly. —Martin Luther King Jr. Internet Electronic Circuit Social Network Gene Regulatory Network Science Special Online Collection: Complex systems and networks: http://www.sciencemag.org/complexity/ 5 Road Map • Definitions: graphical representation of networks • Different types of molecular networks • What can we do with networks? • Network construction methods – Co-expression networks – Bayesian networks – Regression-based methods 6 Graphical Representation of Gene Networks • G=(V,E) where – V: set of nodes (vertices) – E: set of edges between the nodes that represent the relationships between nodes • Directed vs. undirected • Network topology: connectivity structure • Modules: subset of nodes that are more highly interconnected with each other than other nodes in the network Undirected Directed 7 Degree • The degree k of a node is the number of edges connected to it. • In a directed graph, each node has an in-degree and an out-degree. 8 Courtesy of Bill Noble Degree distribution • The degree distribution plots the number of nodes that have a given degree k as a function of k. • The shape of the degree distribution allows us to distinguish among types of networks. 9 Courtesy of Bill Noble Scale-free networks • Most nodes have only one connection; a few hub nodes are highly connected. • The degree distribution is exponential, which yields a straight line on a log plot. • Most biological networks are scale-free. 10 Courtesy of Bill Noble Molecular or Biochemical Pathways • A set of coupled chemical reactions or signaling events. • Nodes are molecules (often substrates) and edges represent chemical reactions. • Represent decades of work in which the underlying chemical reactions are validated. Example: KEGG (Kyoto Encyclopedia of Genes and Genomes) • Contains 410 “pathways” that represent molecular interaction and reaction networks that are manually curated from 149,937 published references (as of 10/25/2011). 11 Molecular Networks Constructed from High-throughput assays (1) Physical interaction network: • A graphical representation of molecular binding interactions such as a protein-protein interaction (PPI) network. • Nodes are molecules; edges represent physical interactions between molecules. • Example: Yeast PPI network in which most interactions derived from large-scale experiments like yeast 2-hybrid data (high false +ve/-ve rates) 12 Molecular Networks Constructed from High-throughput assays (2) Correlation or co-expression network: A graphical representation that averages over observed expression data. Nodes are mRNA molecules, edges represent correlations between expression levels of connected nodes. Bayesian networks: A directed, graphical representation of the probabilities of one observation given another. Nodes represent mRNA molecules; edges represent the probability of a particular expression value given the expression values of the parent nodes. 13 What can we do with these molecular networks? Using the position in networks to describe function Guilt by association Finding the causal regulator (the "Blame Game") 14 Courtesy of Mark Gerstein What can we do with these molecular networks? Hubs tend to be essential! log(Frequency) Power-law distribution Success stories: • Network modeling links breast cancer susceptibility and centrosome dysfunction. Pujana et al. Nature Genetics 2007 • Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target. PNAS 2006 15 Courtesy of Mark Gerstein What can we do with these molecular networks? Network-based drug discovery Success stories: • Variations in DNA elucidate molecular networks that cause disease. Chen et al. Nature 2008. • Genetics of gene expression and its effect on disease. Emilsson et al. Nature 2008. 16 Fig 4 Schadt et al. Nature Reviews Drug Discovery 2009 Motivation of gene network inference • Using biochemical methods, it takes 1000’s of person years to assign genes to pathways. • Even for well-studied genomes, the majority of genes are not mapped to known pathways. • As more genomes are being sequenced, and more genes are discovered, we need systematic methods to assign genes to pathways. 17 A gene-regulation function describes how inputs such as transcription factors and regulatory elements, are transformed into a gene’s mRNA level. 18 Kim et al. Science 2009 A gene-regulation function describes how inputs such as transcription factors and regulatory elements, are transformed into a gene’s mRNA level. Modeling DNA sequence-based cisregulatory gene networks. Bolouri & Davidson. 2002. 19 Kim et al. Science 2009 Network construction methods • Co-expression networks • Bayesian networks • Regression-based methods 20 Goal: construct gene networks 21 Early inference of transcriptional regulation: Clustering • Clustering: extract groups of genes that are tightly co-expressed over a range of different experiments. • Pattern discovery • No prior knowledge required • Applications: – Guilt by association (functional annotations) – Extraction of regulatory motifs – Molecular signatures for tissue sub-types 22 Correlation: pairwise similarity 1 1 Experiments X p genes n genes genes X Raw matrix Y Similarity matrix Y n n 4 Correlation (X,Y) = 1 3 2 X 1 Y 0 Z 1 2 3 4 W Correlation (X,Z) = -1 Correlation (X,W) = 1 -1 -2 -3 23 Clustering algorithms • Inputs: – Similarity matrix – Number of clusters or some other parameters • Many different classifications of clustering algorithms: – Hierarchical vs partitional – Heuristic-based vs model-based – Soft vs hard 24 Hierarchical Clustering • Agglomerative (bottom-up) • Algorithm: dendrogram – Initialize: each item a cluster – Iterate: • select two most similar clusters • merge them – Halt: when required number of clusters is reached 25 Hierarchical: Single Link • cluster similarity = similarity of two most similar members - Potentially long and skinny clusters + Fast 26 Hierarchical: Complete Link • cluster similarity = similarity of two least similar members + tight clusters - slow 27 Hierarchical: Average Link • cluster similarity = average similarity of all pairs + tight clusters - slow 28 Co-expression Networks • Co-expression networks – Aka: Correlation networks, association networks – Use microarray data only – Nodes are connected if they have a significant pairwise expression profile association across environmental perturbations • References: – A general framework for weighted gene coexpression network analysis (Zhang, Horvath SAGMB 2005) – WGCNA: an R package for weighted correlation network analysis. (Langfelder, Horvath BMC Bioinformatics 2008) 29 Steps for constructing a co-expression network A) Microarray gene expression data B) Measure concordance of gene expression with correlation C) The Pearson correlation matrix is thresholded to arrive at an adjacency matrix unweighted network Or transformed continuously with the power adjacency function weighted network 30 Example: co-expression network Correlation matrix A A B C D E F G H I J B c D 0.91 0.72 0.84 E F G 0.78 0.88 H I J 0.91 0.72 0.84 Correlation threshold, t=1 0.94 0.78 0.88 0.94 0.75 0.75 0.92 0.92 0.98 0.98 C I H A G B F D J k P(k) 0 10/10 At t=1, there are no edges, so all nodes have degree (k) = 0 E 31 Example: co-expression network Correlation matrix A A B C D E F G H I J B c D 0.91 0.72 0.84 E F G 0.78 0.88 H I J 0.91 0.72 0.84 Correlation threshold, t=0.9 0.94 0.78 0.88 0.94 0.75 0.75 0.92 0.92 0.98 0.98 C G B F D P(k) 0 2/10 1 8/10 I H A k J At t=0.9, there are 4 edges, so 8 nodes have degree (k) = 1 E 32 Example: co-expression network Correlation matrix A B c D 0.91 0.72 0.84 E F G 0.78 0.88 H I J 0.91 0.72 0.84 Correlation threshold, t=0.7 0.94 0.78 0.88 0.94 0.75 0.75 0.92 0.92 0.98 0.98 C I H A G B F D J k P(k) 1 7/10 3 2/10 5 1/10 Log(P(k)) A B C D E F G H I J E Log(k) 33 Bayesian networks • A directed acyclic graph (DAG) such that the nodes represent mRNA expression levels and the edges represent the probability of observing an expression value given the values of the parent nodes. • The probability distribution for a gene depends only on its regulators (parents) in the network. Example: G4 and G5 share a common regulator G2, i.e., they are conditionally independent given G2. factorization of the full joint probability distribution into component conditional distributions. 34 Needham et al. PLOS Comp Bio 2007 Independent Events If G1, …, G5 are independent, then the joint probability p(G1, G2, G3, G4, G5) = p(G1) p(G2) p(G3) p(G4) p(G5) G1 G2 G4 G3 G5 Example: K=“KaYee gives the lecture today”. R=“It is raining outside today” Whether it is rain or shine outside doesn’t affect whether KaYee is giving the lecture today. p(K,R) = p(K) * p(R) 35 Conditional Probability Distributions • Conditional probability distributions: p(B|A) = the probability of B given A. Example: K=“KaYee gives the lecture today”. E=“today’s lecture contains equations” P(E, K) = Probability that Ka Yee gives the lecture today and today’s lecture contains equations = 0.05. P(K)=1/10 = 0.1. P(E|K) = P(E, K) / P(K) = 0.05/0.1 = 0.5. • Score a network (fit) in light of the data: p(M|D) where D=data, M=network structure infer how well a particular network explains the observed data. 36 Conditional Independence Example: K=“KaYee gives the lecture today”. E=“today’s lecture contains equations” C=“today’s slides are in Comic Sans font” K E C If Ka Yee is giving the lecture today, then whether today’s lecture contains equations doesn’t affect whether today’s slides are in Comic Sans. P(E|K,C) = P(E|K) E and C are conditionally independent given K. • In Bayesian networks, each node is independent of its non-descendants, given its parents in the DAG. • Using conditional independence between variables, the joint probability distribution of the models may be represented in a compact manner. 37 Joint Probability Distribution p(G1, G2, G3, G4, G5) = p(G1) p(G2|G1) p(G3|G1) p(G4|G2) p(G5|G1, G2, G3) 38 Constructing a Bayesian network • Variables (nodes in the graph) • Add edges to the graph by computing conditional probabilities that characterize the distribution of states of each node given the state of its parents. • The number of possible network structures grows exponentially with the number of nodes, so an exhaustive search of all possible structures to find the one best supported by the data is not feasible. • Monte Carlo Markov Chain (MCMC) algorithm: – Start with a random network. – Small random changes are then made to the network by flipping, adding, or deleting individual edges. – Accept changes that improve the fit of the network to the data. 39 Bayesian networks • Advantages: – Compact and intuitive representation – Integration of prior knowledge – Probabilistic framework for data integration • Limitation: no feedback loop dynamic Bayesian networks (variables are indexed by time and replicated in the network) • References: – Using Bayesian Network to Analyze Expression Data. Friedman et al. J. Computational Biology 7:601-620, 2000. – A Primer on Learning in Bayesian Networks for Computational Biology. Needham et al. PLOS Computational Biology 2007. 40 What kinds of data contain potential information about gene networks? Large expression sets • Co-expression (correlation of expression levels) implies connectivity • But correlation A B ≠ causality ✔ A B A B A B C Adding causality • Genetic perturbation: DNA variation at A influences RNA variation at B. • Time series: A goes up prior to B. • Prior knowledge 41 Adding genetics data • Quantitative trait locus (QTL): a region of DNA that is associated with a particular trait (eg. Height) • QTL mapping (linkage analysis): correlate the genotypic and phenotypic variation 42 Our data Our experimental design: Time dependencies: ordering of regulatory events. Genotype data: correlate DNA variations in the segregants to measured expression levels BY (lab) Phenotype: RNA levels in response to drug perturbation 6 time points DNA genotype × RM (wild) ... : 95 segregants Genetics of global gene expression. Rockman & Kruglyak. 2006. 43 Experimental design: Roger Bumgarner, Kenneth Dombek, Eric Schadt, Jun Zhu. genes regulators Expression data Genome-wide binding data Literature Other data, e.g. protein-protein interaction, genetic interaction, genotype etc. Supervised learning: integration of external data 0.95 0.23 0.78 Probability that … R regulates g …. Regulators constrained by the external data sources g Variable selection Time series expression data Gene regulatory network Yeung et al. To appear in PNAS. 44 44 Integration of external data Expression data Genome-wide binding data Literature Other data Compute variables (Xi) that capture evidence of regulation for (TF-gene) pairs Y Xi TF-gene Training data: Positive (Y=1) vs. negative (Y=0) training examples Apply logistic regression to determine weights (ai’s) of Xi’s. regulators genes 0.95 0.23 0.78 … …. Probability that R regulates g 45 Constraining candidate regulators R2 R1 g R3 Graphical representation of network as a set of nodes and edges. Goal: To infer parent nodes (regulators) for each gene g using the time series expression data • Without prior knowledge, every gene is a potential regulator of every other gene. We want to restrict the search to the most likely regulators. • For each gene g, we estimated how likely that each regulator R regulates g (a priori) using the supervised framework and the external data sources. 46 Regression-based approach Use the expression level at time (t-1) to predict the expression levels at time t in the same segregant Potential regulators R t-1 t g Let X(g,t,s) = expression level of gene g at time t in segregant s X(g,t,s) g,s * X(R,t 1,s) Variable selection R is a potential regulator 47 Yeung et al. To appear in PNAS. Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999] • BMA takes model uncertainty into account by averaging over the posterior distributions of a quantity of interest based on multiple models, weighted by their posterior model probabilities. Pr( | D) K Pr( | D, Mk )*Pr(Mk | D) k 1 • Output: Posterior probabilities for selected genes and selected models 48 Assessment • Recovery of known regulatory relationships: – We showed significant enrichment between our inferred network and the assessment criteria. • Lab validation of selected sub-networks Child nodes of selected TFs Genes that respond to deletion with rapamycin perturbation … WT TF • Comparison to other methods in the literature. 49 Known # # child Expression Systematic Common binding site references nodes in pattern over Description from SGD Name name from in SGD network A time JASPAR? YDR421W ARO80 YML113W DAT1 YBL103C RTG3 19 17 83 51 57 47 Increasing over time Decreasing over time Increasing over time yes Zinc finger transcriptional activator of the Zn2Cys6 family; activates transcription of aromatic amino acid catabolic genes in the presence of aromatic amino acids no DNA binding protein that recognizes oligo(dA).oligo(dT) tracts; Arg side chain in its Nterminal pentad Gly-Arg-LysPro-Gly repeat is required for DNA-binding; not essential for viability yes Basic helix-loop-helix-leucine zipper (bHLH/Zip) transcription factor that forms a complex with another bHLH/Zip protein, Rtg1p, to activate the retrograde (RTG) and TOR pathways 50 Comparing our networks to the deletion data Our inferred network Validation experiment # child nodes Genes that respond to the deletion # overlap Fisher's test p-value ARO80 51 10 4 9.3 x 10-6 DAT1 57 784 20 0.04 RTG3 47 2288 39 0.03 Deleted TF 51 Legend: Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes. Aro80p is a known regulator of ARO9 and ARO10. (Iraqui et al. Molecular and Cellular Biology 1999, 19:3360-3371). 52 Legend: Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes. Magenta: Target genes with known ARO80 binding site. Amazingly, all 4 genes that respond to deletion (ARO9, ARO10, NAF1, ESBP6) contain the known ARO80 binding site upstream! 53 genes regulators Expression data Genome-wide binding data Literature Supervised learning: integration of external data 0.95 0.23 0.78 Probability that … R regulates g …. Regulators constrained by the external data sources g Other data, e.g. protein-protein interaction, genetic interaction, genotype etc. Goal: incorporate prior probabilities in the variable selection step. Variable selection Time series expression data Gene regulatory network 54 Revisiting our Road Map • Definitions: graphical representation of networks • Different types of molecular networks • What can we do with networks? • Network construction methods – – – – Co-expression networks Bayesian networks Regression-based methods Assessment 55 Thank you’s Data + Biological interpretation Roger Bumgarner Kenneth Dombek Eric Schadt Jun Zhu Method development Adrian Raftery Kenneth Lo John Mittler Special thanks Dr. Rachel Brem Dr. Su-In Lee R01GM084163 R01GM084163-02S2 56