Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Gene Co-expression Networks Kirill Bessonov Nov 25th 2014 Talk Plan • Networks – main components – main types – biological importance • Practical on WGCNA package – main protocol steps – interpretation of network modules – demo Transportation Networks Computer Networks Social networks Network components • Networks also called graphs – Graph (G) contains • Nodes (N): genes, SNPs, cities, PCs, etc. • Edges (E): links/vertices connecting two nodes (Ni, Nj) Main types • Directed – Edge have directionality – Some links are unidirectional – Direction matters • Going A B is not the same as BA – Analogous to chemical reactions • Forward rate might not be the same as reverse – E.g. directed gene regulatory networks (TF gene) • Undirected – Edges have no directionality – Simpler to describe and work with – E.g. co-expression networks Biological networks • Co-expression – For genes that have similar expression profile • Directed gene regulatory networks – To show directionality between gene interactions – Show direction of information flow – E.g. transcription factor activating target gene • Protein-Protein networks – Show physical interaction between proteins – Concentrate on binding events • Others – Metabolic, differential, Bayesian, etc. Inferring co-expression networks in R WGCNA package (Weighted Gene Correlation Network Analysis) Main features • Builds correlation networks • Correlations are – simple to calculate – fast on large scale data • Support sign of association (not direction) • Lots of network metrics (e.g. connectivity) • Easy identification of modules – Reduction of dataset dimensionality good Construct a network Search for genes with similar expression profile Identify modules in predicted network Reduce data into gene sets / groups Relate modules to external information find biologically interesting modules E.g.: Clinical data, biological function (gene ontology, pathways) Study Module Preservation across different data Check robustness of module definition Find the key drivers in interesting modules Experimental validation, therapeutics, biomarkers Steps for constructing a co-expression network A) Obtain gene expression data B) Measure co-expression between genes via a correlation coefficient C) Build correlation matrix = network A) Adjacency matrix D) Transform correlation matrix with the power adjacency function new adjacency matrix weighted network Network=Adjacency Matrix • Adjacency matrix, A=[aij], encodes how a pair of nodes is connected (if at all) – Weighted networks = aij is edge value (weight) – Unweighted networks = aij presence or absence of edge Scale Free Network Topology • Scale free topology means 700 600 500 400 300 200 100 0 Frequency – presence of hub nodes highly connected to other nodes – metabolic networks exhibit scale free topology at least approximately – Node connectivity (k) follows power law – p(k)=proportion of nodes that have connectivity k Frequency Distribution of Connectivity 0.000 0.005 0.010 0.015 0.020 Connectivity k 0.025 0.030 0.035 How to check Scale Free Topology? Check if obtained network follows scale free topology Idea: Log transformation p(k) and k and look at scatter plots Answer: R^2 can be used to quantify goodness of fit R^2 > 0.6 mean that networks follows scale free topology Only few nodes display high connectivity Power function transformation • Idea: – transform correlation matrix via power function – Impose scale free topology – Select the best beta (β) Power function R^2 • Pick the largest beta • Corresponds to largest R^2 (Beta) Defining modules • based on a hierarchical cluster tree – Build a tree and cut it – Dynamic tree cutting at optimal height [1] Module=branch of a cluster tree Analysis of modules Modules genes 1 genes 2 genes 3 genes 4 • Perform gene ontology analysis on genes from each module (e.g. yellow = “genes 1”) • Link modules to clinical data (e.g. weight) – Via module eigengene e.g. cor(trait, eigengene) modules GENES Heatmap view of module Module of co-expressed genes tissue samples vertical bands indicate tight co-expression of module genes Modules as eigengenes • Can summarized all genes in a module by one eigengene (i.e. virtual gene) • allow one to relate modules to each other – Allows calculate distance between modules • to relate modules to clinical traits and SNPs Module Eigengene= measure of overexpression=average redness Rows,=genes, Columns=microarray br own -0.1 0.0 0.1 0.2 0.3 0.4 185 184 183 182 181 180 179 178 177 176 175 174 173 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 brown The brown module eigengenes across samples Analysis of modules • Relate modules to traits • Interested in modules with correlation > 0.75 (red) WGCNA Demo Simulated data - 5 modules Simulating expression data (1) Note: install 1st Hmisc library otherwise WGCNA installation fails install.packages("Hmisc") #Simulate data # Load WGCNA package library(WGCNA) # The following setting is important, do not omit. options(stringsAsFactors = FALSE); # Here are input parameters of the simulation model # number of samples or microarrays in the training data no.obs=50 # now we specify the true measures of eigengene significance # recall that ESturquoise=cor(y,MEturquoise) ESturquoise=0; ESbrown= -.6; ESgreen=.6;ESyellow=0 # Note that we dont specify the eigengene significance of the blue module # since it is highly correlated with the turquoise module. ESvector=c(ESturquoise,ESbrown,ESgreen,ESyellow) # number of genes nGenes1=3000 # proportion of genes in the turquoise, blue, brown, green, and yellow module #respectively. simulateProportions1=c(0.2,0.15, 0.08, 0.06, 0.04) # Note that the proportions dont add up to 1. The remaining genes will be colored grey, # ie the grey genes are non-module genes. # set the seed of the random number generator. As a homework exercise change this seed. set.seed(1) Simulating expression data (2) #Step 1: simulate a module eigengene network. # Training Data Set I MEgreen=rnorm(no.obs) scaledy=MEgreen*ESgreen+sqrt(1-ESgreen^2)*rnorm(no.obs) y=ifelse( scaledy>median(scaledy),2,1) MEturquoise= ESturquoise*scaledy+sqrt(1ESturquoise^2)*rnorm(no.obs) # we simulate a strong dependence between MEblue and MEturquoise MEblue= 0.6*MEturquoise+ sqrt(1-.6^2) *rnorm(no.obs) MEbrown= ESbrown*scaledy+sqrt(1-ESbrown^2)*rnorm(no.obs) MEyellow= ESyellow*scaledy+sqrt(1ESyellow^2)*rnorm(no.obs) ModuleEigengeneNetwork1=data.frame(y,MEturquoise,MEblue,ME brown,MEgreen, MEyellow) Simulating expression data (3) dat1=simulateDatExpr5Modules(MEturquoise=ModuleEigengeneNetwork1$MEturquoise, MEblue=ModuleEigengeneNetwork1$MEblue, MEbrown=ModuleEigengeneNetwork1$MEbrown, MEyellow=ModuleEigengeneNetwork1$MEyellow, MEgreen=ModuleEigengeneNetwork1$MEgreen, nGenes=nGenes1, simulateProportions=simulateProportions1) datExpr = dat1$datExpr; truemodules = dat1$truemodule; datME = dat1$datME; attach(ModuleEigengeneNetwork1) datExpr=data.frame(datExpr) ArrayName=paste("Sample",1:dim(datExpr)[[1]], sep="" ) # The following code is useful for outputting the simulated data GeneName=paste("Gene",1:dim(datExpr)[[2]], sep="" ) dimnames(datExpr)[[1]]=ArrayName dimnames(datExpr)[[2]]=GeneName rm(dat1); collectGarbage(); # The following command will save all variables defined in the current session. save.image("Simulated-dataSimulation.RData"); cat("Note: *.RData file written in ",getwd(), "\n") Construction of a weighted gene co-expression network (1) # Load WGCNA package library(WGCNA) # Load additional necessary packages library(cluster) 1# The following setting is important, do not omit. options(stringsAsFactors = FALSE); # Load the previously saved data load("Simulated-StandardScreening.RData"); attach(ModuleEigengeneNetwork1) sft=pickSoftThreshold(datExpr,powerVector=1:20) plot(sft$fitIndices[,1],sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="SFT, signed R^2", type="o") abline(h=0.90,col="red") Construction of a weighted gene co-expression network (2) # here we define the adjacency matrix using soft thresholding with beta=6 ADJ1=abs(cor(datExpr,use="p"))^6 # When you have relatively few genes (<5000) use the following code k=as.vector(apply(ADJ1,2,sum, na.rm=T)) # When you have a lot of genes use the following code #k=softConnectivity(datE=datExpr,power=6) # Plot a histogram of k and a scale free topology plot sizeGrWindow(10,5) par(mfrow=c(1,2)) hist(k) scaleFreePlot(k, main="Check scale free topology\n") Definition of co-expression modules (1) #Many clustering procedures require a dissimilarity matrix as input. We define a dissimilarity based on adjacency # Turn adjacency into a measure of dissimilarity dissADJ=1-ADJ1 hierADJ=hclust(as.dist(dissADJ), method="average" ) # Plot the resulting clustering tree together with the true color assignment sizeGrWindow(10,5); plotDendroAndColors(hierADJ, colors = data.frame(truemodules), dendroLabels = FALSE, hang = 0.03, main = "Gene hierarchical clustering dendrogram and simulated module colors" ) Definition of co-expression modules (2) #static tree cutting colorStaticADJ=as.character(cutreeStaticColor(hierADJ, cutHeight=.99, minSize=20)) # Plot the dendrogram with module colors sizeGrWindow(10,5); plotDendroAndColors(hierADJ, colors = data.frame(truemodules, colorStaticADJ), dendroLabels = FALSE, abHeight = 0.99, main = "Gene dendrogram and module colors") #dynamic tree cutting branch.number=cutreeDynamic(hierADJ,method="tree") # This function transforms the branch numbers into colors colorDynamicADJ=labels2colors(branch.number) sizeGrWindow(10,5) plotDendroAndColors(dendro = hierADJ, colors=data.frame(truemodules, colorStaticADJ, colorDynamicADJ, colorDynamicHybridADJ), dendroLabels = FALSE, marAll = c(0.2, 8, 2.7, 0.2), main = "Gene dendrogram and module colors") Calculating module eigengenes #caluculate eigengenes for each module datME=moduleEigengenes(datExpr,colorStaticADJ)$eigengenes #correlation between modules based on their eigengenes signif(cor(datME, use="p"), 2) #dendrogram dissimME=(1-t(cor(datME, method="p")))/2 hclustdatME=hclust(as.dist(dissimME), method="average" ) # Plot the eigengene dendrogram par(mfrow=c(1,1)) plot(hclustdatME, main="Clustering tree based of the module eigengenes") #see expression profiles - diagnostic plots #show available modules levels(as.factor(colorStaticADJ)) sizeGrWindow(8,9) par(mfrow=c(3,1), mar=c(1, 2, 4, 1)) which.module="blue"; plotMat(t(scale(datExpr[,colorStaticADJ==which.module ]) ),nrgcols=30,rlabels=T, clabels=T,rcols=which.module, title=which.module ) ME=datME[, paste("ME",which.module, sep="")] barplot(ME, col=which.module, main="", cex.main=2, ylab="eigengene expression",xlab="array sample") Relating modules to trait #all modules (green and brown modules look interesting) signif(cor(y,datME, use="p"),2) #get statistical significance of module association to trait cor.test(y, datME$MEbrown) cor.test(y, datME$MEgreen) References [1] Langfelder P, Zhang B et al (2007) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R. Bioinformatics 2008 24(5):719-720 [2] Steve Horvath, Tutorials for the WGCNA package