Download Gene Co-expression Networks and WGCNA package

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Airborne Networking wikipedia , lookup

Transcript
Gene Co-expression Networks
Kirill Bessonov
Nov 25th 2014
Talk Plan
• Networks
– main components
– main types
– biological importance
• Practical on WGCNA package
– main protocol steps
– interpretation of network modules
– demo
Transportation Networks
Computer Networks
Social networks
Network components
• Networks also called graphs
– Graph (G) contains
• Nodes (N): genes, SNPs, cities, PCs, etc.
• Edges (E): links/vertices connecting two nodes (Ni, Nj)
Main types
• Directed
– Edge have directionality
– Some links are unidirectional
– Direction matters
• Going A B is not the same as BA
– Analogous to chemical reactions
• Forward rate might not be the same as reverse
– E.g. directed gene regulatory networks (TF  gene)
• Undirected
– Edges have no directionality
– Simpler to describe and work with
– E.g. co-expression networks
Biological networks
• Co-expression
– For genes that have similar expression profile
• Directed gene regulatory networks
– To show directionality between gene interactions
– Show direction of information flow
– E.g. transcription factor activating target gene
• Protein-Protein networks
– Show physical interaction between proteins
– Concentrate on binding events
• Others
– Metabolic, differential, Bayesian, etc.
Inferring co-expression networks in R
WGCNA package
(Weighted Gene Correlation Network Analysis)
Main features
• Builds correlation networks
• Correlations are
– simple to calculate
– fast on large scale data
• Support sign of association (not direction)
• Lots of network metrics (e.g. connectivity)
• Easy identification of modules
– Reduction of dataset dimensionality  good
Construct a network
Search for genes with similar expression profile
Identify modules in predicted network
Reduce data into gene sets / groups
Relate modules to external information
find biologically interesting modules
E.g.: Clinical data, biological function (gene ontology, pathways)
Study Module Preservation across different data
Check robustness of module definition
Find the key drivers in interesting modules
Experimental validation, therapeutics, biomarkers
Steps for constructing a
co-expression network
A) Obtain gene expression data
B) Measure co-expression between genes via
a correlation coefficient
C) Build correlation matrix = network
A) Adjacency matrix
D) Transform correlation matrix with the
power adjacency function  new
adjacency matrix  weighted network
Network=Adjacency Matrix
• Adjacency matrix, A=[aij], encodes how a pair of
nodes is connected (if at all)
– Weighted networks = aij is edge value (weight)
– Unweighted networks = aij presence or absence of edge
Scale Free Network Topology
• Scale free topology means
700
600
500
400
300
200
100
0
Frequency
– presence of hub nodes highly
connected to other nodes
– metabolic networks exhibit
scale free topology at least
approximately
– Node connectivity (k) follows
power law
– p(k)=proportion of nodes
that have connectivity k
Frequency Distribution of Connectivity
0.000
0.005
0.010
0.015
0.020
Connectivity k
0.025
0.030
0.035
How to check Scale Free Topology?
Check if obtained network follows scale free topology
Idea: Log transformation p(k) and k and look at scatter plots
Answer: R^2 can be used to quantify goodness of fit
R^2 > 0.6 mean that networks follows scale free topology
Only few nodes display
high connectivity
Power function transformation
• Idea:
– transform correlation matrix via power function
– Impose scale free topology
– Select the best beta (β)
Power function
R^2
• Pick the largest beta
• Corresponds to largest R^2
(Beta)
Defining modules
• based on a hierarchical cluster tree
– Build a tree and cut it
– Dynamic tree cutting at optimal height [1]
Module=branch of
a cluster tree
Analysis of modules
Modules
genes 1
genes 2
genes 3
genes 4
• Perform gene ontology analysis on genes from
each module (e.g. yellow = “genes 1”)
• Link modules to clinical data (e.g. weight)
– Via module eigengene e.g. cor(trait, eigengene)
modules
GENES
Heatmap view of module
Module of
co-expressed
genes
tissue samples
vertical bands indicate tight co-expression of module genes
Modules as eigengenes
• Can summarized all genes in a module by one
eigengene (i.e. virtual gene)
• allow one to relate modules to each other
– Allows calculate distance between modules
• to relate modules to clinical traits and SNPs
Module Eigengene= measure of overexpression=average redness
Rows,=genes, Columns=microarray
br own
-0.1
0.0
0.1
0.2
0.3
0.4
185
184
183
182
181
180
179
178
177
176
175
174
173
172
171
170
169
168
167
166
165
164
163
162
161
160
159
158
157
156
155
154
153
152
151
150
149
148
147
146
145
144
143
142
141
140
139
138
137
136
135
134
133
132
131
130
129
128
127
126
125
124
123
122
121
120
119
118
117
116
115
114
113
112
111
110
109
108
107
106
105
104
103
102
101
100
99
98
97
96
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
brown
The brown module eigengenes across samples
Analysis of modules
• Relate
modules to
traits
• Interested
in modules
with
correlation
> 0.75 (red)
WGCNA Demo
Simulated data - 5 modules
Simulating expression data (1)
Note: install 1st Hmisc library otherwise WGCNA installation fails
install.packages("Hmisc")
#Simulate data
# Load WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Here are input parameters of the simulation model
# number of samples or microarrays in the training data
no.obs=50
# now we specify the true measures of eigengene significance
# recall that ESturquoise=cor(y,MEturquoise)
ESturquoise=0; ESbrown= -.6;
ESgreen=.6;ESyellow=0
# Note that we dont specify the eigengene significance of the blue module
# since it is highly correlated with the turquoise module.
ESvector=c(ESturquoise,ESbrown,ESgreen,ESyellow)
# number of genes
nGenes1=3000
# proportion of genes in the turquoise, blue, brown, green, and yellow module #respectively.
simulateProportions1=c(0.2,0.15, 0.08, 0.06, 0.04)
# Note that the proportions dont add up to 1. The remaining genes will be colored grey,
# ie the grey genes are non-module genes.
# set the seed of the random number generator. As a homework exercise change this seed.
set.seed(1)
Simulating expression data (2)
#Step 1: simulate a module eigengene network.
# Training Data Set I
MEgreen=rnorm(no.obs)
scaledy=MEgreen*ESgreen+sqrt(1-ESgreen^2)*rnorm(no.obs)
y=ifelse( scaledy>median(scaledy),2,1)
MEturquoise= ESturquoise*scaledy+sqrt(1ESturquoise^2)*rnorm(no.obs)
# we simulate a strong dependence between MEblue and
MEturquoise
MEblue= 0.6*MEturquoise+ sqrt(1-.6^2) *rnorm(no.obs)
MEbrown= ESbrown*scaledy+sqrt(1-ESbrown^2)*rnorm(no.obs)
MEyellow= ESyellow*scaledy+sqrt(1ESyellow^2)*rnorm(no.obs)
ModuleEigengeneNetwork1=data.frame(y,MEturquoise,MEblue,ME
brown,MEgreen, MEyellow)
Simulating expression data (3)
dat1=simulateDatExpr5Modules(MEturquoise=ModuleEigengeneNetwork1$MEturquoise,
MEblue=ModuleEigengeneNetwork1$MEblue,
MEbrown=ModuleEigengeneNetwork1$MEbrown,
MEyellow=ModuleEigengeneNetwork1$MEyellow,
MEgreen=ModuleEigengeneNetwork1$MEgreen,
nGenes=nGenes1,
simulateProportions=simulateProportions1)
datExpr = dat1$datExpr;
truemodules = dat1$truemodule;
datME = dat1$datME;
attach(ModuleEigengeneNetwork1)
datExpr=data.frame(datExpr)
ArrayName=paste("Sample",1:dim(datExpr)[[1]], sep="" )
# The following code is useful for outputting the simulated data
GeneName=paste("Gene",1:dim(datExpr)[[2]], sep="" )
dimnames(datExpr)[[1]]=ArrayName
dimnames(datExpr)[[2]]=GeneName
rm(dat1); collectGarbage();
# The following command will save all variables defined in the current session.
save.image("Simulated-dataSimulation.RData");
cat("Note: *.RData file written in ",getwd(), "\n")
Construction of a weighted gene
co-expression network (1)
# Load WGCNA package
library(WGCNA)
# Load additional necessary packages
library(cluster)
1# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the previously saved data
load("Simulated-StandardScreening.RData");
attach(ModuleEigengeneNetwork1)
sft=pickSoftThreshold(datExpr,powerVector=1:20)
plot(sft$fitIndices[,1],sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft
Threshold (power)",ylab="SFT, signed R^2", type="o")
abline(h=0.90,col="red")
Construction of a weighted gene
co-expression network (2)
# here we define the adjacency matrix using soft
thresholding with beta=6
ADJ1=abs(cor(datExpr,use="p"))^6
# When you have relatively few genes (<5000) use the
following code
k=as.vector(apply(ADJ1,2,sum, na.rm=T))
# When you have a lot of genes use the following code
#k=softConnectivity(datE=datExpr,power=6)
# Plot a histogram of k and a scale free topology plot
sizeGrWindow(10,5)
par(mfrow=c(1,2))
hist(k)
scaleFreePlot(k, main="Check scale free topology\n")
Definition of co-expression modules (1)
#Many clustering procedures require a dissimilarity
matrix as input. We define a dissimilarity based on
adjacency
# Turn adjacency into a measure of dissimilarity
dissADJ=1-ADJ1
hierADJ=hclust(as.dist(dissADJ), method="average" )
# Plot the resulting clustering tree together with
the true color assignment
sizeGrWindow(10,5);
plotDendroAndColors(hierADJ, colors =
data.frame(truemodules), dendroLabels = FALSE, hang
= 0.03,
main = "Gene hierarchical clustering dendrogram and
simulated module colors" )
Definition of co-expression modules (2)
#static tree cutting
colorStaticADJ=as.character(cutreeStaticColor(hierADJ, cutHeight=.99,
minSize=20))
# Plot the dendrogram with module colors
sizeGrWindow(10,5);
plotDendroAndColors(hierADJ, colors = data.frame(truemodules, colorStaticADJ),
dendroLabels = FALSE, abHeight = 0.99,
main = "Gene dendrogram and module colors")
#dynamic tree cutting
branch.number=cutreeDynamic(hierADJ,method="tree")
# This function transforms the branch numbers into colors
colorDynamicADJ=labels2colors(branch.number)
sizeGrWindow(10,5)
plotDendroAndColors(dendro = hierADJ,
colors=data.frame(truemodules, colorStaticADJ,
colorDynamicADJ, colorDynamicHybridADJ),
dendroLabels = FALSE, marAll = c(0.2, 8, 2.7, 0.2),
main = "Gene dendrogram and module colors")
Calculating module eigengenes
#caluculate eigengenes for each module
datME=moduleEigengenes(datExpr,colorStaticADJ)$eigengenes
#correlation between modules based on their eigengenes
signif(cor(datME, use="p"), 2)
#dendrogram
dissimME=(1-t(cor(datME, method="p")))/2
hclustdatME=hclust(as.dist(dissimME), method="average" )
# Plot the eigengene dendrogram
par(mfrow=c(1,1))
plot(hclustdatME, main="Clustering tree based of the module eigengenes")
#see expression profiles - diagnostic plots
#show available modules
levels(as.factor(colorStaticADJ))
sizeGrWindow(8,9)
par(mfrow=c(3,1), mar=c(1, 2, 4, 1))
which.module="blue";
plotMat(t(scale(datExpr[,colorStaticADJ==which.module ]) ),nrgcols=30,rlabels=T,
clabels=T,rcols=which.module,
title=which.module )
ME=datME[, paste("ME",which.module, sep="")]
barplot(ME, col=which.module, main="", cex.main=2,
ylab="eigengene expression",xlab="array sample")
Relating modules to trait
#all modules (green and brown modules look interesting)
signif(cor(y,datME, use="p"),2)
#get statistical significance of module association to
trait
cor.test(y, datME$MEbrown)
cor.test(y, datME$MEgreen)
References
[1] Langfelder P, Zhang B et al (2007) Defining clusters from a hierarchical cluster
tree: the Dynamic Tree Cut library for R. Bioinformatics 2008 24(5):719-720
[2] Steve Horvath, Tutorials for the WGCNA package